glm_events
—
processor model specific performance counter
events
This manual page describes events specific to the following Intel
CPU models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.
CPU models described by this document:
The following events are supported:
- ld_blocks.data_unknown
- Counts a load blocked from using a store forward, but did not occur
because the store data was not available at the right time. The forward
might occur subsequently when the data is available.
- ld_blocks.store_forward
- Counts a load blocked from using a store forward because of an
address/size mismatch, only one of the loads blocked from each store will
be counted.
- ld_blocks.4k_alias
- Counts loads that block because their address modulo 4K matches a pending
store.
- ld_blocks.utlb_miss
- Counts loads blocked because they are unable to find their physical
address in the micro TLB (UTLB).
- ld_blocks.all_block
- Counts anytime a load that retires is blocked for any reason.
- page_walks.d_side_cycles
- Counts every core cycle when a Data-side (walks due to a data operation)
page walk is in progress.
- page_walks.i_side_cycles
- Counts every core cycle when a Instruction-side (walks due to an
instruction fetch) page walk is in progress.
- page_walks.cycles
- Counts every core cycle a page-walk is in progress due to either a data
memory operation or an instruction fetch.
- uops_issued.any
- Counts uops issued by the front end and allocated into the back end of the
machine. This event counts uops that retire as well as uops that were
speculatively executed but didn't retire. The sort of speculative uops
that might be counted includes, but is not limited to those uops issued in
the shadow of a miss-predicted branch, those uops that are inserted during
an assist (such as for a denormal floating point result), and (previously
allocated) uops that might be canceled during a machine clear.
- misalign_mem_ref.load_page_split
- Counts when a memory load of a uop spans a page boundary (a split) is
retired.
- misalign_mem_ref.store_page_split
- Counts when a memory store of a uop spans a page boundary (a split) is
retired.
- longest_lat_cache.miss
- Counts memory requests originating from the core that miss in the L2
cache.
- longest_lat_cache.reference
- Counts memory requests originating from the core that reference a cache
line in the L2 cache.
- l2_reject_xq.all
- Counts the number of demand and prefetch transactions that the L2 XQ
rejects due to a full or near full condition which likely indicates back
pressure from the intra-die interconnect (IDI) fabric. The XQ may reject
transactions from the L2Q (non-cacheable requests), L2 misses and L2
write-back victims.
- core_reject_l2q.all
- Counts the number of demand and L1 prefetcher requests rejected by the L2Q
due to a full or nearly full condition which likely indicates back
pressure from L2Q. It also counts requests that would have gone directly
to the XQ, but are rejected due to a full or nearly full condition,
indicating back pressure from the IDI link. The L2Q may also reject
transactions from a core to ensure fairness between cores, or to delay a
core's dirty eviction when the address conflicts with incoming external
snoops.
- cpu_clk_unhalted.core_p
- Core cycles when core is not halted. This event uses a (_P)rogrammable
general purpose performance counter.
- cpu_clk_unhalted.ref
- Reference cycles when core is not halted. This event uses a programmable
general purpose performance counter.
- dl1.dirty_eviction
- Counts when a modified (dirty) cache line is evicted from the data L1
cache and needs to be written back to memory. No count will occur if the
evicted line is clean, and hence does not require a writeback.
- icache.hit
- Counts requests to the Instruction Cache (ICache) for one or more bytes in
an ICache Line and that cache line is in the ICache (hit). The event
strives to count on a cache line basis, so that multiple accesses which
hit in a single cache line count as one ICACHE.HIT. Specifically, the
event counts when straight line code crosses the cache line boundary, or
when a branch target is to a new line, and that cache line is in the
ICache. This event counts differently than Intel processors based on
Silvermont microarchitecture.
- icache.misses
- Counts requests to the Instruction Cache (ICache) for one or more bytes in
an ICache Line and that cache line is not in the ICache (miss). The event
strives to count on a cache line basis, so that multiple accesses which
miss in a single cache line count as one ICACHE.MISS. Specifically, the
event counts when straight line code crosses the cache line boundary, or
when a branch target is to a new line, and that cache line is not in the
ICache. This event counts differently than Intel processors based on
Silvermont microarchitecture.
- icache.accesses
- Counts requests to the Instruction Cache (ICache) for one or more bytes in
an ICache Line. The event strives to count on a cache line basis, so that
multiple fetches to a single cache line count as one ICACHE.ACCESS.
Specifically, the event counts when accesses from straight line code
crosses the cache line boundary, or when a branch target is to a new line.
This event counts differently than Intel processors based on Silvermont
microarchitecture.
- itlb.miss
- Counts the number of times the machine was unable to find a translation in
the Instruction Translation Lookaside Buffer (ITLB) for a linear address
of an instruction fetch. It counts when new translation are filled into
the ITLB. The event is speculative in nature, but will not count
translations (page walks) that are begun and not finished, or translations
that are finished but not filled into the ITLB.
- fetch_stall.all
- Counts cycles that fetch is stalled due to any reason. That is, the
decoder queue is able to accept bytes, but the fetch unit is unable to
provide bytes. This will include cycles due to an ITLB miss, ICache miss
and other events.
- fetch_stall.itlb_fill_pending_cycles
- Counts cycles that fetch is stalled due to an outstanding ITLB miss. That
is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes due to an ITLB miss. Note: this event is not the
same as page walk cycles to retrieve an instruction translation.
- fetch_stall.icache_fill_pending_cycles
- Counts cycles that fetch is stalled due to an outstanding ICache miss.
That is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes due to an ICache miss. Note: this event is not the
same as the total number of cycles spent retrieving instruction cache
lines from the memory hierarchy.
- uops_not_delivered.any
- This event used to measure front-end inefficiencies. I.e. when front-end
of the machine is not delivering uops to the back-end and the back-end has
is not stalled. This event can be used to identify if the machine is truly
front-end bound. When this event occurs, it is an indication that the
front-end of the machine is operating at less than its theoretical peak
performance. Background: We can think of the processor pipeline as being
divided into 2 broader parts: Front-end and Back-end. Front-end is
responsible for fetching the instruction, decoding into uops in machine
understandable format and putting them into a uop queue to be consumed by
back end. The back-end then takes these uops, allocates the required
resources. When all resources are ready, uops are executed. If the
back-end is not ready to accept uops from the front-end, then we do not
want to count these as front-end bottlenecks. However, whenever we have
bottlenecks in the back-end, we will have allocation unit stalls and
eventually forcing the front-end to wait until the back-end is ready to
receive more uops. This event counts only when back-end is requesting more
uops and front-end is not able to provide them. When 3 uops are requested
and no uops are delivered, the event counts 3. When 3 are requested, and
only 1 is delivered, the event counts 2. When only 2 are delivered, the
event counts 1. Alternatively stated, the event will not count if 3 uops
are delivered, or if the back end is stalled and not requesting any uops
at all. Counts indicate missed opportunities for the front-end to deliver
a uop to the back end. Some examples of conditions that cause front-end
efficiencies are: ICache misses, ITLB misses, and decoder restrictions
that limit the front-end bandwidth. Known Issues: Some uops require
multiple allocation slots. These uops will not be charged as a front end
'not delivered' opportunity, and will be regarded as a back end problem.
For example, the INC instruction has one uop that requires 2 issue slots.
A stream of INC instructions will not count as UOPS_NOT_DELIVERED, even
though only one instruction can be issued per clock. The low uop issue
rate for a stream of INC instructions is considered to be a back end
issue.
- inst_retired.any_p
- Counts the number of instructions that retire execution. For instructions
that consist of multiple uops, this event counts the retirement of the
last uop of the instruction. The event continues counting during hardware
interrupts, traps, and inside interrupt handlers. This is an architectural
performance event. This event uses a (_P)rogrammable general purpose
performance counter. *This event is Precise Event capable: The EventingRIP
field in the PEBS record is precise to the address of the instruction
which caused the event. Note: Because PEBS records can be collected only
on IA32_PMC0, only one event can use the PEBS facility at a time.
- uops_retired.any
- Counts uops which retired.
- uops_retired.ms
- Counts uops retired that are from the complex flows issued by the
micro-sequencer (MS). Counts both the uops from a micro-coded instruction,
and the uops that might be generated from a micro-coded assist.
- uops_retired.fpdiv
- Counts the number of floating point divide uops retired.
- uops_retired.idiv
- Counts the number of integer divide uops retired.
- machine_clears.all
- Counts machine clears for any reason.
- machine_clears.smc
- Counts the number of times that the processor detects that a program is
writing to a code section and has to perform a machine clear because of
that modification. Self-modifying code (SMC) causes a severe penalty in
all Intel® architecture processors.
- machine_clears.memory_ordering
- Counts machine clears due to memory ordering issues. This occurs when a
snoop request happens and the machine is uncertain if memory ordering will
be preserved as another core is in the process of modifying the data.
- machine_clears.fp_assist
- Counts machine clears due to floating point (FP) operations needing
assists. For instance, if the result was a floating point denormal, the
hardware clears the pipeline and reissues uops to produce the correct IEEE
compliant denormal result.
- machine_clears.disambiguation
- Counts machine clears due to memory disambiguation. Memory disambiguation
happens when a load which has been issued conflicts with a previous
unretired store in the pipeline whose address was not known at issue time,
but is later resolved to be the same as the load address.
- br_inst_retired.all_branches
- Counts branch instructions retired for all branch types. This is an
architectural performance event.
- br_inst_retired.jcc
- Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met)
branch instructions retired, including both when the branch was taken and
when it was not taken.
- br_inst_retired.all_taken_branches
- Counts the number of taken branch instructions retired.
- br_inst_retired.far_branch
- Counts far branch instructions retired. This includes far jump, far call
and return, and Interrupt call and return.
- br_inst_retired.non_return_ind
- Counts near indirect call or near indirect jmp branch instructions
retired.
- br_inst_retired.return
- Counts near return branch instructions retired.
- br_inst_retired.call
- Counts near CALL branch instructions retired.
- br_inst_retired.ind_call
- Counts near indirect CALL branch instructions retired.
- br_inst_retired.rel_call
- Counts near relative CALL branch instructions retired.
- br_inst_retired.taken_jcc
- Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired that were taken and does not count when the Jcc
branch instruction were not taken.
- br_misp_retired.all_branches
- Counts mispredicted branch instructions retired including all branch
types.
- br_misp_retired.jcc
- Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired, including both when the
branch was supposed to be taken and when it was not supposed to be taken
(but the processor predicted the opposite condition).
- br_misp_retired.non_return_ind
- Counts mispredicted branch instructions retired that were near indirect
call or near indirect jmp, where the target address taken was not what the
processor predicted.
- br_misp_retired.return
- Counts mispredicted near RET branch instructions retired, where the return
address taken was not what the processor predicted.
- br_misp_retired.ind_call
- Counts mispredicted near indirect CALL branch instructions retired, where
the target address taken was not what the processor predicted.
- br_misp_retired.taken_jcc
- Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired that were supposed to be
taken but the processor predicted that it would not be taken.
- issue_slots_not_consumed.any
- Counts the number of issue slots per core cycle that were not consumed by
the backend due to either a full resource in the backend (RESOURCE_FULL)
or due to the processor recovering from some event (RECOVERY).
- issue_slots_not_consumed.resource_full
- Counts the number of issue slots per core cycle that were not consumed
because of a full resource in the backend. Including but not limited to
resources such as the Re-order Buffer (ROB), reservation stations (RS),
load/store buffers, physical registers, or any other needed machine
resource that is currently unavailable. Note that uops must be available
for consumption in order for this event to fire. If a uop is not available
(Instruction Queue is empty), this event will not count.
- issue_slots_not_consumed.recovery
- Counts the number of issue slots per core cycle that were not consumed by
the backend because allocation is stalled waiting for a mispredicted jump
to retire or other branch-like conditions (e.g. the event is relevant
during certain microcode flows). Counts all issue slots blocked while
within this window including slots where uops were not available in the
Instruction Queue.
- hw_interrupts.received
- Counts hardware interrupts received by the processor.
- hw_interrupts.masked
- Counts the number of core cycles during which interrupts are masked
(disabled). Increments by 1 each core cycle that EFLAGS.IF is 0,
regardless of whether interrupts are pending or not.
- hw_interrupts.pending_and_masked
- Counts core cycles during which there are pending interrupts, but
interrupts are masked (EFLAGS.IF = 0).
- cycles_div_busy.all
- Counts core cycles if either divide unit is busy.
- cycles_div_busy.idiv
- Counts core cycles the integer divide unit is busy.
- cycles_div_busy.fpdiv
- Counts core cycles the floating point divide unit is busy.
- mem_uops_retired.dtlb_miss_loads
- Counts load uops retired that caused a DTLB miss.
- mem_uops_retired.dtlb_miss_stores
- Counts store uops retired that caused a DTLB miss.
- mem_uops_retired.dtlb_miss
- Counts uops retired that had a DTLB miss on load, store or either. Note
that when two distinct memory operations to the same page miss the DTLB,
only one of them will be recorded as a DTLB miss.
- mem_uops_retired.lock_loads
- Counts locked memory uops retired. This includes regular locks and bus
locks. (To specifically count bus locks only, see the Offcore response
event.) A locked access is one with a lock prefix, or an exchange to
memory. See the SDM for a complete description of which memory load
accesses are locks.
- mem_uops_retired.split_loads
- Counts load uops retired where the data requested spans a 64 byte cache
line boundary.
- mem_uops_retired.split_stores
- Counts store uops retired where the data requested spans a 64 byte cache
line boundary.
- mem_uops_retired.split
- Counts memory uops retired where the data requested spans a 64 byte cache
line boundary.
- mem_uops_retired.all_loads
- Counts the number of load uops retired.
- mem_uops_retired.all_stores
- Counts the number of store uops retired.
- mem_uops_retired.all
- Counts the number of memory uops retired that is either a loads or a store
or both.
- mem_load_uops_retired.l1_hit
- Counts load uops retired that hit the L1 data cache.
- mem_load_uops_retired.l2_hit
- Counts load uops retired that hit in the L2 cache.
- mem_load_uops_retired.l1_miss
- Counts load uops retired that miss the L1 data cache.
- mem_load_uops_retired.l2_miss
- Counts load uops retired that miss in the L2 cache.
- mem_load_uops_retired.hitm
- Counts load uops retired where the cache line containing the data was in
the modified state of another core or modules cache (HITM). More
specifically, this means that when the load address was checked by other
caching agents (typically another processor) in the system, one of those
caching agents indicated that they had a dirty copy of the data. Loads
that obtain a HITM response incur greater latency than most is typical for
a load. In addition, since HITM indicates that some other processor had
this data in its cache, it implies that the data was shared between
processors, or potentially was a lock or semaphore value. This event is
useful for locating sharing, false sharing, and contended locks.
- mem_load_uops_retired.wcb_hit
- Counts memory load uops retired where the data is retrieved from the WCB
(or fill buffer), indicating that the load found its data while that data
was in the process of being brought into the L1 cache. Typically a load
will receive this indication when some other load or prefetch missed the
L1 cache and was in the process of retrieving the cache line containing
the data, but that process had not yet finished (and written the data back
to the cache). For example, consider load X and Y, both referencing the
same cache line that is not in the L1 cache. If load X misses cache first,
it obtains and WCB (or fill buffer) and begins the process of requesting
the data. When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what time the request to Y occurs.
- mem_load_uops_retired.dram_hit
- Counts memory load uops retired where the data is retrieved from DRAM.
Event is counted at retirement, so the speculative loads are ignored. A
memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache,
hit DRAM, hit in the WCB or receive a HITM response.
- baclears.all
- Counts the number of times a BACLEAR is signaled for any reason,
including, but not limited to indirect branch/call, Jcc (Jump on
Conditional Code/Jump if Condition is Met) branch, unconditional
branch/call, and returns.
- baclears.return
- Counts BACLEARS on return instructions.
- baclears.cond
- Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Condition is Met)
branches.
- ms_decoded.ms_entry
- Counts the number of times the Microcode Sequencer (MS) starts a flow of
uops from the MSROM. It does not count every time a uop is read from the
MSROM. The most common case that this counts is when a micro-coded
instruction is encountered by the front end of the machine. Other cases
include when an instruction encounters a fault, trap, or microcode assist
of any sort that initiates a flow of uops. The event will count MS
startups for uops that are speculative, and subsequently cleared by branch
mispredict or a machine clear.
- decode_restriction.predecode_wrong
- Counts the number of times the prediction (from the predecode cache) for
instruction length is incorrect.