slm_events
—
processor model specific performance counter
events
This manual page describes events specific to the following Intel
CPU models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.
CPU models described by this document:
The following events are supported:
- br_inst_retired.all_branches
- ALL_BRANCHES counts the number of any branch instructions retired. Branch
prediction predicts the branch target and enables the processor to begin
executing instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU) for
prediction. This unit predicts the target address not only based on the
EIP of the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps, indirect calls
and jumps, returns.
- br_inst_retired.jcc
- JCC counts the number of conditional branch (JCC) instructions retired.
Branch prediction predicts the branch target and enables the processor to
begin executing instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU) for
prediction. This unit predicts the target address not only based on the
EIP of the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps, indirect calls
and jumps, returns.
- br_inst_retired.taken_jcc
- TAKEN_JCC counts the number of taken conditional branch (JCC) instructions
retired. Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction unit
(BPU) for prediction. This unit predicts the target address not only based
on the EIP of the branch but also based on the execution path through
which execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and jumps,
indirect calls and jumps, returns.
- br_inst_retired.call
- CALL counts the number of near CALL branch instructions retired. Branch
prediction predicts the branch target and enables the processor to begin
executing instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU) for
prediction. This unit predicts the target address not only based on the
EIP of the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps, indirect calls
and jumps, returns.
- br_inst_retired.rel_call
- REL_CALL counts the number of near relative CALL branch instructions
retired. Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction unit
(BPU) for prediction. This unit predicts the target address not only based
on the EIP of the branch but also based on the execution path through
which execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and jumps,
indirect calls and jumps, returns.
- br_inst_retired.ind_call
- IND_CALL counts the number of near indirect CALL branch instructions
retired. Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction unit
(BPU) for prediction. This unit predicts the target address not only based
on the EIP of the branch but also based on the execution path through
which execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and jumps,
indirect calls and jumps, returns.
- br_inst_retired.return
- RETURN counts the number of near RET branch instructions retired. Branch
prediction predicts the branch target and enables the processor to begin
executing instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU) for
prediction. This unit predicts the target address not only based on the
EIP of the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps, indirect calls
and jumps, returns.
- br_inst_retired.non_return_ind
- NON_RETURN_IND counts the number of near indirect JMP and near indirect
CALL branch instructions retired. Branch prediction predicts the branch
target and enables the processor to begin executing instructions long
before the branch true execution path is known. All branches utilize the
branch prediction unit (BPU) for prediction. This unit predicts the target
address not only based on the EIP of the branch but also based on the
execution path through which execution reached this EIP. The BPU can
efficiently predict the following branch types: conditional branches,
direct calls and jumps, indirect calls and jumps, returns.
- br_inst_retired.far_branch
- FAR counts the number of far branch instructions retired. Branch
prediction predicts the branch target and enables the processor to begin
executing instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU) for
prediction. This unit predicts the target address not only based on the
EIP of the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps, indirect calls
and jumps, returns.
- br_misp_retired.all_branches
- ALL_BRANCHES counts the number of any mispredicted branch instructions
retired. This umask is an architecturally defined event. This event counts
the number of retired branch instructions that were mispredicted by the
processor, categorized by type. A branch misprediction occurs when the
processor predicts that the branch would be taken, but it is not, or
vice-versa. When the misprediction is discovered, all the instructions
executed in the wrong (speculative) path must be discarded, and the
processor must start fetching from the correct path.
- br_misp_retired.jcc
- JCC counts the number of mispredicted conditional branches (JCC)
instructions retired. This event counts the number of retired branch
instructions that were mispredicted by the processor, categorized by type.
A branch misprediction occurs when the processor predicts that the branch
would be taken, but it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong (speculative) path
must be discarded, and the processor must start fetching from the correct
path.
- br_misp_retired.taken_jcc
- TAKEN_JCC counts the number of mispredicted taken conditional branch (JCC)
instructions retired. This event counts the number of retired branch
instructions that were mispredicted by the processor, categorized by type.
A branch misprediction occurs when the processor predicts that the branch
would be taken, but it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong (speculative) path
must be discarded, and the processor must start fetching from the correct
path.
- br_misp_retired.ind_call
- IND_CALL counts the number of mispredicted near indirect CALL branch
instructions retired. This event counts the number of retired branch
instructions that were mispredicted by the processor, categorized by type.
A branch misprediction occurs when the processor predicts that the branch
would be taken, but it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong (speculative) path
must be discarded, and the processor must start fetching from the correct
path.
- br_misp_retired.return
- RETURN counts the number of mispredicted near RET branch instructions
retired. This event counts the number of retired branch instructions that
were mispredicted by the processor, categorized by type. A branch
misprediction occurs when the processor predicts that the branch would be
taken, but it is not, or vice-versa. When the misprediction is discovered,
all the instructions executed in the wrong (speculative) path must be
discarded, and the processor must start fetching from the correct
path.
- br_misp_retired.non_return_ind
- NON_RETURN_IND counts the number of mispredicted near indirect JMP and
near indirect CALL branch instructions retired. This event counts the
number of retired branch instructions that were mispredicted by the
processor, categorized by type. A branch misprediction occurs when the
processor predicts that the branch would be taken, but it is not, or
vice-versa. When the misprediction is discovered, all the instructions
executed in the wrong (speculative) path must be discarded, and the
processor must start fetching from the correct path.
- uops_retired.ms
- This event counts the number of micro-ops retired that were supplied from
MSROM.
- uops_retired.all
- This event counts the number of micro-ops retired. The processor decodes
complex macro instructions into a sequence of simpler micro-ops. Most
instructions are composed of one or two micro-ops. Some instructions are
decoded into longer sequences such as repeat instructions, floating point
transcendental instructions, and assists. In some cases micro-op sequences
are fused or whole instructions are fused into one micro-op. See other
UOPS_RETIRED events for differentiating retired fused and non-fused
micro-ops.
- machine_clears.smc
- This event counts the number of times that a program writes to a code
section. Self-modifying code causes a severe penalty in all Intel?
architecture processors.
- machine_clears.memory_ordering
- This event counts the number of times that pipeline was cleared due to
memory ordering issues.
- machine_clears.fp_assist
- This event counts the number of times that pipeline stalled due to FP
operations needing assists.
- machine_clears.all
- Machine clears happen when something happens in the machine that causes
the hardware to need to take special care to get the right answer. When
such a condition is signaled on an instruction, the front end of the
machine is notified that it must restart, so no more instructions will be
decoded from the current path. All instructions "older" than
this one will be allowed to finish. This instruction and all
"younger" instructions must be cleared, since they must not be
allowed to complete. Essentially, the hardware waits until the problematic
instruction is the oldest instruction in the machine. This means all older
instructions are retired, and all pending stores (from older instructions)
are completed. Then the new path of instructions from the front end are
allowed to start into the machine. There are many conditions that might
cause a machine clear (including the receipt of an interrupt, or a trap or
a fault). All those conditions (including but not limited to
MACHINE_CLEARS.MEMORY_ORDERING, MACHINE_CLEARS.SMC, and
MACHINE_CLEARS.FP_ASSIST) are captured in the ANY event. In addition, some
conditions can be specifically counted (i.e. SMC, MEMORY_ORDERING,
FP_ASSIST). However, the sum of SMC, MEMORY_ORDERING, and FP_ASSIST
machine clears will not necessarily equal the number of ANY.
- no_alloc_cycles.rob_full
- Counts the number of cycles when no uops are allocated and the ROB is full
(less than 2 entries available).
- no_alloc_cycles.mispredicts
- Counts the number of cycles when no uops are allocated and the alloc pipe
is stalled waiting for a mispredicted jump to retire. After the
misprediction is detected, the front end will start immediately but the
allocate pipe stalls until the mispredicted.
- no_alloc_cycles.rat_stall
- Counts the number of cycles when no uops are allocated and a RATstall is
asserted.
- no_alloc_cycles.not_delivered
- The NO_ALLOC_CYCLES.NOT_DELIVERED event is used to measure front-end
inefficiencies, i.e. when front-end of the machine is not delivering
micro-ops to the back-end and the back-end is not stalled. This event can
be used to identify if the machine is truly front-end bound. When this
event occurs, it is an indication that the front-end of the machine is
operating at less than its theoretical peak performance. Background: We
can think of the processor pipeline as being divided into 2 broader parts:
Front-end and Back-end. Front-end is responsible for fetching the
instruction, decoding into micro-ops (uops) in machine understandable
format and putting them into a micro-op queue to be consumed by back end.
The back-end then takes these micro-ops, allocates the required resources.
When all resources are ready, micro-ops are executed. If the back-end is
not ready to accept micro-ops from the front-end, then we do not want to
count these as front-end bottlenecks. However, whenever we have
bottlenecks in the back-end, we will have allocation unit stalls and
eventually forcing the front-end to wait until the back-end is ready to
receive more UOPS. This event counts the cycles only when back-end is
requesting more uops and front-end is not able to provide them. Some
examples of conditions that cause front-end efficiencies are: Icache
misses, ITLB misses, and decoder restrictions that limit the the front-end
bandwidth.
- no_alloc_cycles.all
- The NO_ALLOC_CYCLES.ALL event counts the number of cycles when the
front-end does not provide any instructions to be allocated for any
reason. This event indicates the cycles where an allocation stalls occurs,
and no UOPS are allocated in that cycle.
- rs_full_stall.mec
- Counts the number of cycles and allocation pipeline is stalled and is
waiting for a free MEC reservation station entry. The cycles should be
appropriately counted in case of the cracked ops e.g. In case of a cracked
load-op, the load portion is sent to M.
- rs_full_stall.all
- Counts the number of cycles the Alloc pipeline is stalled when any one of
the RSs (IEC, FPC and MEC) is full. This event is a superset of all the
individual RS stall event counts.
- inst_retired.any_p
- This event counts the number of instructions that retire execution. For
instructions that consist of multiple micro-ops, this event counts the
retirement of the last micro-op of the instruction. The counter continues
counting during hardware interrupts, traps, and inside interrupt
handlers.
- cycles_div_busy.all
- Cycles the divider is busy.This event counts the cycles when the divide
unit is unable to accept a new divide UOP because it is busy processing a
previously dispatched UOP. The cycles will be counted irrespective of
whether or not another divide UOP is waiting to enter the divide unit
(from the RS). This event might count cycles while a divide is in progress
even if the RS is empty. The divide instruction is one of the longest
latency instructions in the machine. Hence, it has a special event
associated with it to help determine if divides are delaying the
retirement of instructions.
- cpu_clk_unhalted.core_p
- This event counts the number of core cycles while the core is not in a
halt state. The core enters the halt state when it is running the HLT
instruction. In mobile systems the core frequency may change from time to
time. For this reason this event may have a changing ratio with regards to
time.
- cpu_clk_unhalted.ref
- This event counts the number of reference cycles that the core is not in a
halt state. The core enters the halt state when it is running the HLT
instruction. In mobile systems the core frequency may change from time.
This event is not affected by core frequency changes but counts as if the
core is running at the maximum frequency all the time.
- l2_reject_xq.all
- This event counts the number of demand and prefetch transactions that the
L2 XQ rejects due to a full or near full condition which likely indicates
back pressure from the IDI link. The XQ may reject transactions from the
L2Q (non-cacheable requests), BBS (L2 misses) and WOB (L2 write-back
victims).
- core_reject_l2q.all
- Counts the number of (demand and L1 prefetchers) core requests rejected by
the L2Q due to a full or nearly full w condition which likely indicates
back pressure from L2Q. It also counts requests that would have gone
directly to the XQ, but are rejected due to a full or nearly full
condition, indicating back pressure from the IDI link. The L2Q may also
reject transactions from a core to insure fairness between cores, or to
delay a core?s dirty eviction when the address conflicts incoming external
snoops. (Note that L2 prefetcher requests that are dropped are not counted
by this event.)
- longest_lat_cache.reference
- This event counts requests originating from the core that references a
cache line in the L2 cache.
- longest_lat_cache.miss
- This event counts the total number of L2 cache references and the number
of L2 cache misses respectively.
- icache.accesses
- This event counts all instruction fetches, not including most uncacheable
fetches.
- icache.hit
- This event counts all instruction fetches from the instruction cache.
- icache.misses
- This event counts all instruction fetches that miss the Instruction cache
or produce memory requests. This includes uncacheable fetches. An
instruction fetch miss is counted only once and not once for every cycle
it is outstanding.
- fetch_stall.itlb_fill_pending_cycles
- Counts cycles that fetch is stalled due to an outstanding ITLB miss. That
is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes due to an ITLB miss. Note: this event is not the
same as page walk cycles to retrieve an instruction translation.
- fetch_stall.icache_fill_pending_cycles
- Counts cycles that fetch is stalled due to an outstanding ICache miss.
That is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes due to an ICache miss. Note: this event is not the
same as the total number of cycles spent retrieving instruction cache
lines from the memory hierarchy. Counts cycles that fetch is stalled due
to any reason. That is, the decoder queue is able to accept bytes, but the
fetch unit is unable to provide bytes. This will include cycles due to an
ITLB miss, ICache miss and other events.
- fetch_stall.all
- Counts cycles that fetch is stalled due to any reason. That is, the
decoder queue is able to accept bytes, but the fetch unit is unable to
provide bytes. This will include cycles due to an ITLB miss, ICache miss
and other events.
- baclears.all
- The BACLEARS event counts the number of times the front end is resteered,
mainly when the Branch Prediction Unit cannot provide a correct prediction
and this is corrected by the Branch Address Calculator at the front end.
The BACLEARS.ANY event counts the number of baclears for any type of
branch.
- baclears.return
- The BACLEARS event counts the number of times the front end is resteered,
mainly when the Branch Prediction Unit cannot provide a correct prediction
and this is corrected by the Branch Address Calculator at the front end.
The BACLEARS.RETURN event counts the number of RETURN baclears.
- baclears.cond
- The BACLEARS event counts the number of times the front end is resteered,
mainly when the Branch Prediction Unit cannot provide a correct prediction
and this is corrected by the Branch Address Calculator at the front end.
The BACLEARS.COND event counts the number of JCC (Jump on Condtional Code)
baclears.
- ms_decoded.ms_entry
- Counts the number of times the MSROM starts a flow of UOPS. It does not
count every time a UOP is read from the microcode ROM. The most common
case that this counts is when a micro-coded instruction is encountered by
the front end of the machine. Other cases include when an instruction
encounters a fault, trap, or microcode assist of any sort. The event will
count MSROM startups for UOPS that are speculative, and subsequently
cleared by branch mispredict or machine clear. Background: UOPS are
produced by two mechanisms. Either they are generated by hardware that
decodes instructions into UOPS, or they are delivered by a ROM (called the
MSROM) that holds UOPS associated with a specific instruction. MSROM UOPS
might also be delivered in response to some condition such as a fault or
other exceptional condition. This event is an excellent mechanism for
detecting instructions that require the use of MSROM instructions.
- decode_restriction.predecode_wrong
- Counts the number of times a decode restriction reduced the decode
throughput due to wrong instruction length prediction.
- rehabq.ld_block_st_forward
- This event counts the number of retired loads that were prohibited from
receiving forwarded data from the store because of address mismatch.
- rehabq.ld_block_std_notready
- This event counts the cases where a forward was technically possible, but
did not occur because the store data was not available at the right
time.
- rehabq.st_splits
- This event counts the number of retire stores that experienced cache line
boundary splits.
- rehabq.ld_splits
- This event counts the number of retire loads that experienced cache line
boundary splits.
- rehabq.lock
- This event counts the number of retired memory operations with lock
semantics. These are either implicit locked instructions such as the XCHG
instruction or instructions with an explicit LOCK prefix (0xF0).
- rehabq.sta_full
- This event counts the number of retired stores that are delayed because
there is not a store address buffer available.
- rehabq.any_ld
- This event counts the number of load uops reissued from Rehabq.
- rehabq.any_st
- This event counts the number of store uops reissued from Rehabq.
- mem_uops_retired.l1_miss_loads
- This event counts the number of load ops retired that miss in L1 Data
cache. Note that prefetch misses will not be counted.
- mem_uops_retired.l2_hit_loads
- This event counts the number of load ops retired that hit in the L2.
- mem_uops_retired.l2_miss_loads
- This event counts the number of load ops retired that miss in the L2.
- mem_uops_retired.dtlb_miss_loads
- This event counts the number of load ops retired that had DTLB miss.
- mem_uops_retired.utlb_miss
- This event counts the number of load ops retired that had UTLB miss.
- mem_uops_retired.hitm
- This event counts the number of load ops retired that got data from the
other core or from the other module.
- mem_uops_retired.all_loads
- This event counts the number of load ops retired.
- mem_uops_retired.all_stores
- This event counts the number of store ops retired.
- page_walks.d_side_walks
- This event counts when a data (D) page walk is completed or started. Since
a page walk implies a TLB miss, the number of TLB misses can be counted by
counting the number of pagewalks.
- page_walks.d_side_cycles
- This event counts every cycle when a D-side (walks due to a load) page
walk is in progress. Page walk duration divided by number of page walks is
the average duration of page-walks.
- page_walks.i_side_walks
- This event counts when an instruction (I) page walk is completed or
started. Since a page walk implies a TLB miss, the number of TLB misses
can be counted by counting the number of pagewalks.
- page_walks.i_side_cycles
- This event counts every cycle when a I-side (walks due to an instruction
fetch) page walk is in progress. Page walk duration divided by number of
page walks is the average duration of page-walks.
- page_walks.walks
- This event counts when a data (D) page walk or an instruction (I) page
walk is completed or started. Since a page walk implies a TLB miss, the
number of TLB misses can be counted by counting the number of
pagewalks.
- page_walks.cycles
- This event counts every cycle when a data (D) page walk or instruction (I)
page walk is in progress. Since a pagewalk implies a TLB miss, the
approximate cost of a TLB miss can be determined from this event.
- br_inst_retired.all_taken_branches
- ALL_TAKEN_BRANCHES counts the number of all taken branch instructions
retired. Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction unit
(BPU) for prediction. This unit predicts the target address not only based
on the EIP of the branch but also based on the execution path through
which execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and jumps,
indirect calls and jumps, returns.