skx_events
—
processor model specific performance counter
events
This manual page describes events specific to the following Intel
CPU models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.
CPU models described by this document:
- Family 0x6,
Model 0x55, Stepping 0x0
- Family
0x6, Model 0x55, Stepping 0x1
- Family
0x6, Model 0x55, Stepping 0x2
- Family
0x6, Model 0x55, Stepping 0x3
- Family
0x6, Model 0x55, Stepping 0x4
The following events are supported:
- ld_blocks.store_forward
- Counts the number of times where store forwarding was prevented for a load
operation. The most common case is a load blocked due to the address of
memory access (partially) overlapping with a preceding uncompleted store.
Note: See the table of not supported store forwards in the Optimization
Guide.
- ld_blocks.no_sr
- The number of times that split load operations are temporarily blocked
because all resources for handling the split accesses are in use.
- ld_blocks_partial.address_alias
- Counts false dependencies in MOB when the partial comparison upon loose
net check and dependency was resolved by the Enhanced Loose net mechanism.
This may not result in high performance penalties. Loose net checks can
fail when loads and stores are 4k aliased.
- dtlb_load_misses.miss_causes_a_walk
- Counts demand data loads that caused a page walk of any page size
(4K/2M/4M/1G). This implies it missed in all TLB levels, but the walk need
not have completed.
- dtlb_load_misses.walk_completed_4k
- Counts completed page walks (4K sizes) caused by demand data loads. This
implies address translations missed in the DTLB and further levels of TLB.
The page walk can end with or without a fault.
- dtlb_load_misses.walk_completed_2m_4m
- Counts completed page walks (2M/4M sizes) caused by demand data loads.
This implies address translations missed in the DTLB and further levels of
TLB. The page walk can end with or without a fault.
- dtlb_load_misses.walk_completed_1g
- Counts completed page walks (1G sizes) caused by demand data loads. This
implies address translations missed in the DTLB and further levels of TLB.
The page walk can end with or without a fault.
- dtlb_load_misses.walk_completed
- Counts completed page walks (all page sizes) caused by demand data loads.
This implies it missed in the DTLB and further levels of TLB. The page
walk can end with or without a fault.
- dtlb_load_misses.walk_pending
- Counts 1 per cycle for each PMH that is busy with a page walk for a load.
EPT page walk duration are excluded in Skylake microarchitecture.
- dtlb_load_misses.walk_active
- Counts cycles when at least one PMH (Page Miss Handler) is busy with a
page walk for a load.
- dtlb_load_misses.stlb_hit
- Counts loads that miss the DTLB (Data TLB) and hit the STLB (Second level
TLB).
- memory_disambiguation.history_reset
- tbd
- int_misc.recovery_cycles
- Core cycles the Resource allocator was stalled due to recovery from an
earlier branch misprediction or machine clear event.
- int_misc.recovery_cycles_any
- Core cycles the allocator was stalled due to recovery from earlier clear
event for any thread running on the physical core (e.g. misprediction or
memory nuke).
- int_misc.clear_resteer_cycles
- Cycles the issue-stage is waiting for front-end to fetch from resteered
path following branch misprediction or machine clear events.
- uops_issued.stall_cycles
- Counts cycles during which the Resource Allocation Table (RAT) does not
issue any Uops to the reservation station (RS) for the current
thread.
- uops_issued.any
- Counts the number of uops that the Resource Allocation Table (RAT) issues
to the Reservation Station (RS).
- uops_issued.vector_width_mismatch
- Counts the number of Blend Uops issued by the Resource Allocation Table
(RAT) to the reservation station (RS) in order to preserve upper bits of
vector registers. Starting with the Skylake microarchitecture, these Blend
uops are needed since every Intel SSE instruction executed in Dirty Upper
State needs to preserve bits 128-255 of the destination register. For more
information, refer to Mixing Intel AVX and Intel SSE Code section of the
Optimization Guide.
- uops_issued.slow_lea
- Number of slow LEA uops being allocated. A uop is generally considered
SlowLea if it has 3 sources (e.g. 2 sources + immediate) regardless if as
a result of LEA instruction or not.
- arith.divider_active
- Cycles when divide unit is busy executing divide or square root
operations. Accounts for integer and floating-point operations.
- l2_rqsts.demand_data_rd_miss
- Counts the number of demand Data Read requests that miss L2 cache. Only
not rejected loads are counted.
- l2_rqsts.rfo_miss
- Counts the RFO (Read-for-Ownership) requests that miss L2 cache.
- l2_rqsts.code_rd_miss
- Counts L2 cache misses when fetching instructions.
- l2_rqsts.all_demand_miss
- Demand requests that miss L2 cache.
- l2_rqsts.pf_miss
- Counts requests from the L1/L2/L3 hardware prefetchers or Load software
prefetches that miss L2 cache.
- l2_rqsts.miss
- All requests that miss L2 cache.
- l2_rqsts.demand_data_rd_hit
- Counts the number of demand Data Read requests, initiated by load
instructions, that hit L2 cache
- l2_rqsts.rfo_hit
- Counts the RFO (Read-for-Ownership) requests that hit L2 cache.
- l2_rqsts.code_rd_hit
- Counts L2 cache hits when fetching instructions, code reads.
- l2_rqsts.pf_hit
- Counts requests from the L1/L2/L3 hardware prefetchers or Load software
prefetches that hit L2 cache.
- l2_rqsts.all_demand_data_rd
- Counts the number of demand Data Read requests (including requests from
L1D hardware prefetchers). These loads may hit or miss L2 cache. Only non
rejected loads are counted.
- l2_rqsts.all_rfo
- Counts the total number of RFO (read for ownership) requests to L2 cache.
L2 RFO requests include both L1D demand RFO misses as well as L1D RFO
prefetches.
- l2_rqsts.all_code_rd
- Counts the total number of L2 code requests.
- l2_rqsts.all_demand_references
- Demand requests to L2 cache.
- l2_rqsts.all_pf
- Counts the total number of requests from the L2 hardware prefetchers.
- l2_rqsts.references
- All L2 requests.
- core_power.lvl0_turbo_license
- Core cycles where the core was running with power-delivery for baseline
license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and
low-current AVX 256-bit codes.
- core_power.lvl1_turbo_license
- Core cycles where the core was running with power-delivery for license
level 1. This includes high current AVX 256-bit instructions as well as
low current AVX 512-bit instructions.
- core_power.lvl2_turbo_license
- Core cycles where the core was running with power-delivery for license
level 2 (introduced in Skylake Server michroarchtecture). This includes
high current AVX 512-bit instructions.
- core_power.throttle
- Core cycles the out-of-order engine was throttled due to a pending power
level request.
- longest_lat_cache.miss
- Counts core-originated cacheable requests that miss the L3 cache (Longest
Latency cache). Requests include data and code reads, Reads-for-Ownership
(RFOs), speculative accesses and hardware prefetches from L1 and L2. It
does not include all misses to the L3.
The following errata may apply to this: SKL057
- longest_lat_cache.reference
- Counts core-originated cacheable requests to the L3 cache (Longest Latency
cache). Requests include data and code reads, Reads-for-Ownership (RFOs),
speculative accesses and hardware prefetches from L1 and L2. It does not
include all accesses to the L3.
The following errata may apply to this: SKL057
- sw_prefetch_access.nta
- Number of PREFETCHNTA instructions executed.
- sw_prefetch_access.t0
- Number of PREFETCHT0 instructions executed.
- sw_prefetch_access.t1_t2
- Number of PREFETCHT1 or PREFETCHT2 instructions executed.
- sw_prefetch_access.prefetchw
- Number of PREFETCHW instructions executed.
- cpu_clk_unhalted.thread_p
- This is an architectural event that counts the number of thread cycles
while the thread is not in a halt state. The thread enters the halt state
when it is running the HLT instruction. The core frequency may change from
time to time due to power or thermal throttling. For this reason, this
event may have a changing ratio with regards to wall clock time.
- cpu_clk_unhalted.thread_p_any
- Core cycles when at least one thread on the physical core is not in halt
state.
- cpu_clk_unhalted.ring0_trans
- Counts when the Current Privilege Level (CPL) transitions from ring 1, 2
or 3 to ring 0 (Kernel).
- cpu_clk_thread_unhalted.ref_xclk
- Core crystal clock cycles when the thread is unhalted.
- cpu_clk_thread_unhalted.ref_xclk_any
- Core crystal clock cycles when at least one thread on the physical core is
unhalted.
- cpu_clk_unhalted.ref_xclk_any
- Core crystal clock cycles when at least one thread on the physical core is
unhalted.
- cpu_clk_unhalted.ref_xclk
- Core crystal clock cycles when the thread is unhalted.
- cpu_clk_thread_unhalted.one_thread_active
- Core crystal clock cycles when this thread is unhalted and the other
thread is halted.
- cpu_clk_unhalted.one_thread_active
- Core crystal clock cycles when this thread is unhalted and the other
thread is halted.
- l1d_pend_miss.pending_cycles
- Counts duration of L1D miss outstanding in cycles.
- l1d_pend_miss.pending
- Counts duration of L1D miss outstanding, that is each cycle number of Fill
Buffers (FB) outstanding required by Demand Reads. FB either is held by
demand loads, or it is held by non-demand loads and gets hit at least once
by demand. The valid outstanding interval is defined until the FB
deallocation by one of the following ways: from FB allocation, if FB is
allocated by demand from the demand Hit FB, if it is allocated by hardware
or software prefetch.Note: In the L1D, a Demand Read contains cacheable or
noncacheable demand loads, including ones causing cache-line splits and
reads due to page walks resulted from any request type.
- l1d_pend_miss.pending_cycles_any
- Cycles with L1D load Misses outstanding from any thread on physical
core.
- l1d_pend_miss.fb_full
- Number of times a request needed a FB (Fill Buffer) entry but there was no
entry available for it. A request includes cacheable/uncacheable demands
that are load, store or SW prefetch instructions.
- dtlb_store_misses.miss_causes_a_walk
- Counts demand data stores that caused a page walk of any page size
(4K/2M/4M/1G). This implies it missed in all TLB levels, but the walk need
not have completed.
- dtlb_store_misses.walk_completed_4k
- Counts completed page walks (4K sizes) caused by demand data stores. This
implies address translations missed in the DTLB and further levels of TLB.
The page walk can end with or without a fault.
- dtlb_store_misses.walk_completed_2m_4m
- Counts completed page walks (2M/4M sizes) caused by demand data stores.
This implies address translations missed in the DTLB and further levels of
TLB. The page walk can end with or without a fault.
- dtlb_store_misses.walk_completed_1g
- Counts completed page walks (1G sizes) caused by demand data stores. This
implies address translations missed in the DTLB and further levels of TLB.
The page walk can end with or without a fault.
- dtlb_store_misses.walk_completed
- Counts completed page walks (all page sizes) caused by demand data stores.
This implies it missed in the DTLB and further levels of TLB. The page
walk can end with or without a fault.
- dtlb_store_misses.walk_pending
- Counts 1 per cycle for each PMH that is busy with a page walk for a store.
EPT page walk duration are excluded in Skylake microarchitecture.
- dtlb_store_misses.walk_active
- Counts cycles when at least one PMH (Page Miss Handler) is busy with a
page walk for a store.
- dtlb_store_misses.stlb_hit
- Stores that miss the DTLB (Data TLB) and hit the STLB (2nd Level
TLB).
- load_hit_pre.sw_pf
- Counts all not software-prefetch load dispatches that hit the fill buffer
(FB) allocated for the software prefetch. It can also be incremented by
some lock instructions. So it should only be used with profiling so that
the locks can be excluded by ASM (Assembly File) inspection of the nearby
instructions.
- ept.walk_pending
- Counts cycles for each PMH (Page Miss Handler) that is busy with an EPT
(Extended Page Table) walk for any request type.
- l1d.replacement
- Counts L1D data line replacements including opportunistic replacements,
and replacements that require stall-for-replace or block-for-replace.
- tx_mem.abort_conflict
- Number of times a TSX line had a cache conflict.
- tx_mem.abort_capacity
- Number of times a transactional abort was signaled due to a data capacity
limitation for transactional reads or writes.
- tx_mem.abort_hle_store_to_elided_lock
- Number of times a TSX Abort was triggered due to a non-release/commit
store to lock.
- tx_mem.abort_hle_elision_buffer_not_empty
- Number of times a TSX Abort was triggered due to commit but Lock Buffer
not empty.
- tx_mem.abort_hle_elision_buffer_mismatch
- Number of times a TSX Abort was triggered due to release/commit but data
and address mismatch.
- tx_mem.abort_hle_elision_buffer_unsupported_alignment
- Number of times a TSX Abort was triggered due to attempting an unsupported
alignment from Lock Buffer.
- tx_mem.hle_elision_buffer_full
- Number of times we could not allocate Lock Buffer.
- partial_rat_stalls.scoreboard
- This event counts cycles during which the microcode scoreboard stalls
happen.
- tx_exec.misc1
- Counts the number of times a class of instructions that may cause a
transactional abort was executed. Since this is the count of execution, it
may not always cause a transactional abort.
- tx_exec.misc2
- Unfriendly TSX abort triggered by a vzeroupper instruction.
- tx_exec.misc3
- Unfriendly TSX abort triggered by a nest count that is too deep.
- tx_exec.misc4
- RTM region detected inside HLE.
- tx_exec.misc5
- Counts the number of times an HLE XACQUIRE instruction was executed inside
an RTM transactional region.
- rs_events.empty_end
- Counts end of periods where the Reservation Station (RS) was empty. Could
be useful to precisely locate front-end Latency Bound issues.
- rs_events.empty_cycles
- Counts cycles during which the reservation station (RS) is empty for the
thread.; Note: In ST-mode, not active thread should drive 0. This is
usually caused by severely costly branch mispredictions, or allocator/FE
issues.
- offcore_requests_outstanding.cycles_with_demand_data_rd
- Counts cycles when offcore outstanding Demand Data Read transactions are
present in the super queue (SQ). A transaction is considered to be in the
Offcore outstanding state between L2 miss and transaction completion sent
to requestor (SQ de-allocation).
- offcore_requests_outstanding.demand_data_rd
- Counts the number of offcore outstanding Demand Data Read transactions in
the super queue (SQ) every cycle. A transaction is considered to be in the
Offcore outstanding state between L2 miss and transaction completion sent
to requestor. See the corresponding Umask under OFFCORE_REQUESTS.Note: A
prefetch promoted to Demand is counted from the promotion point.
- offcore_requests_outstanding.demand_data_rd_ge_6
- Cycles with at least 6 offcore outstanding Demand Data Read transactions
in uncore queue.
- offcore_requests_outstanding.demand_code_rd
- Counts the number of offcore outstanding Code Reads transactions in the
super queue every cycle. The 'Offcore outstanding' state of the
transaction lasts from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the corresponding Umask
under OFFCORE_REQUESTS.
- offcore_requests_outstanding.cycles_with_demand_code_rd
- Counts the number of offcore outstanding Code Reads transactions in the
super queue every cycle. The 'Offcore outstanding' state of the
transaction lasts from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the corresponding Umask
under OFFCORE_REQUESTS.
- offcore_requests_outstanding.demand_rfo
- Counts the number of offcore outstanding RFO (store) transactions in the
super queue (SQ) every cycle. A transaction is considered to be in the
Offcore outstanding state between L2 miss and transaction completion sent
to requestor (SQ de-allocation). See corresponding Umask under
OFFCORE_REQUESTS.
- offcore_requests_outstanding.cycles_with_demand_rfo
- Counts the number of offcore outstanding demand rfo Reads transactions in
the super queue every cycle. The 'Offcore outstanding' state of the
transaction lasts from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the corresponding Umask
under OFFCORE_REQUESTS.
- offcore_requests_outstanding.cycles_with_data_rd
- Counts cycles when offcore outstanding cacheable Core Data Read
transactions are present in the super queue. A transaction is considered
to be in the Offcore outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See corresponding Umask
under OFFCORE_REQUESTS.
- offcore_requests_outstanding.all_data_rd
- Counts the number of offcore outstanding cacheable Core Data Read
transactions in the super queue every cycle. A transaction is considered
to be in the Offcore outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See corresponding Umask
under OFFCORE_REQUESTS.
- offcore_requests_outstanding.l3_miss_demand_data_rd
- Counts number of Offcore outstanding Demand Data Read requests that miss
L3 cache in the superQ every cycle.
- offcore_requests_outstanding.l3_miss_demand_data_rd_ge_6
- Cycles with at least 6 Demand Data Read requests that miss L3 cache in the
superQ.
- offcore_requests_outstanding.cycles_with_l3_miss_demand_data_rd
- Cycles with at least 1 Demand Data Read requests who miss L3 cache in the
superQ.
- idq.mite_cycles
- Counts cycles during which uops are being delivered to Instruction Decode
Queue (IDQ) from the MITE path. Counting includes uops that may 'bypass'
the IDQ.
- idq.mite_uops
- Counts the number of uops delivered to Instruction Decode Queue (IDQ) from
the MITE path. Counting includes uops that may 'bypass' the IDQ. This also
means that uops are not being delivered from the Decode Stream Buffer
(DSB).
- idq.dsb_cycles
- Counts cycles during which uops are being delivered to Instruction Decode
Queue (IDQ) from the Decode Stream Buffer (DSB) path. Counting includes
uops that may 'bypass' the IDQ.
- idq.dsb_uops
- Counts the number of uops delivered to Instruction Decode Queue (IDQ) from
the Decode Stream Buffer (DSB) path. Counting includes uops that may
'bypass' the IDQ.
- idq.ms_dsb_cycles
- Counts cycles during which uops initiated by Decode Stream Buffer (DSB)
are being delivered to Instruction Decode Queue (IDQ) while the Microcode
Sequencer (MS) is busy. Counting includes uops that may 'bypass' the
IDQ.
- idq.all_dsb_cycles_any_uops
- Counts the number of cycles uops were delivered to Instruction Decode
Queue (IDQ) from the Decode Stream Buffer (DSB) path. Count includes uops
that may 'bypass' the IDQ.
- idq.all_dsb_cycles_4_uops
- Counts the number of cycles 4 uops were delivered to Instruction Decode
Queue (IDQ) from the Decode Stream Buffer (DSB) path. Count includes uops
that may 'bypass' the IDQ.
- idq.ms_mite_uops
- Counts the number of uops initiated by MITE and delivered to Instruction
Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy. Counting
includes uops that may 'bypass' the IDQ.
- idq.all_mite_cycles_any_uops
- Counts the number of cycles uops were delivered to the Instruction Decode
Queue (IDQ) from the MITE (legacy decode pipeline) path. Counting includes
uops that may 'bypass' the IDQ. During these cycles uops are not being
delivered from the Decode Stream Buffer (DSB).
- idq.all_mite_cycles_4_uops
- Counts the number of cycles 4 uops were delivered to the Instruction
Decode Queue (IDQ) from the MITE (legacy decode pipeline) path. Counting
includes uops that may 'bypass' the IDQ. During these cycles uops are not
being delivered from the Decode Stream Buffer (DSB).
- idq.ms_cycles
- Counts cycles during which uops are being delivered to Instruction Decode
Queue (IDQ) while the Microcode Sequencer (MS) is busy. Counting includes
uops that may 'bypass' the IDQ. Uops maybe initiated by Decode Stream
Buffer (DSB) or MITE.
- idq.ms_uops
- Counts the total number of uops delivered by the Microcode Sequencer (MS).
Any instruction over 4 uops will be delivered by the MS. Some instructions
such as transcendentals may additionally generate uops from the MS.
- idq.ms_switches
- Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode
pipeline) to the Microcode Sequencer.
- icache_16b.ifdata_stall
- Cycles where a code line fetch is stalled due to an L1 instruction cache
miss. The legacy decode pipeline works at a 16 Byte granularity.
- icache_64b.iftag_hit
- Instruction fetch tag lookups that hit in the instruction cache (L1I).
Counts at 64-byte cache-line granularity.
- icache_64b.iftag_miss
- Instruction fetch tag lookups that miss in the instruction cache (L1I).
Counts at 64-byte cache-line granularity.
- icache_64b.iftag_stall
- Cycles where a code fetch is stalled due to L1 instruction cache tag
miss.
- itlb_misses.miss_causes_a_walk
- Counts page walks of any page size (4K/2M/4M/1G) caused by a code fetch.
This implies it missed in the ITLB and further levels of TLB, but the walk
need not have completed.
- itlb_misses.walk_completed_4k
- Counts completed page walks (4K page sizes) caused by a code fetch. This
implies it missed in the ITLB (Instruction TLB) and further levels of TLB.
The page walk can end with or without a fault.
- itlb_misses.walk_completed_2m_4m
- Counts completed page walks (2M/4M page sizes) caused by a code fetch.
This implies it missed in the ITLB (Instruction TLB) and further levels of
TLB. The page walk can end with or without a fault.
- itlb_misses.walk_completed_1g
- Counts completed page walks (1G page sizes) caused by a code fetch. This
implies it missed in the ITLB (Instruction TLB) and further levels of TLB.
The page walk can end with or without a fault.
- itlb_misses.walk_completed
- Counts completed page walks (all page sizes) caused by a code fetch. This
implies it missed in the ITLB (Instruction TLB) and further levels of TLB.
The page walk can end with or without a fault.
- itlb_misses.walk_pending
- Counts 1 per cycle for each PMH (Page Miss Handler) that is busy with a
page walk for an instruction fetch request. EPT page walk duration are
excluded in Skylake michroarchitecture.
- itlb_misses.stlb_hit
- Instruction fetch requests that miss the ITLB and hit the STLB.
- ild_stall.lcp
- Counts cycles that the Instruction Length decoder (ILD) stalls occurred
due to dynamically changing prefix length of the decoded instruction (by
operand size prefix instruction 0x66, address size prefix instruction 0x67
or REX.W for Intel64). Count is proportional to the number of prefixes in
a 16B-line. This may result in a three-cycle penalty for each LCP (Length
changing prefix) in a 16-byte chunk.
- idq_uops_not_delivered.cycles_fe_was_ok
- Counts cycles FE delivered 4 uops or Resource Allocation Table (RAT) was
stalling FE.
- idq_uops_not_delivered.cycles_le_3_uop_deliv.core
- Cycles with less than 3 uops delivered by the front-end.
- idq_uops_not_delivered.cycles_le_2_uop_deliv.core
- Cycles with less than 2 uops delivered by the front-end.
- idq_uops_not_delivered.cycles_le_1_uop_deliv.core
- Counts, on the per-thread basis, cycles when less than 1 uop is delivered
to Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core >=
3.
- idq_uops_not_delivered.cycles_0_uops_deliv.core
- Counts, on the per-thread basis, cycles when no uops are delivered to
Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core =4.
- idq_uops_not_delivered.core
- Counts the number of uops not delivered to Resource Allocation Table (RAT)
per thread adding 4 x when Resource Allocation Table (RAT) is not stalled
and Instruction Decode Queue (IDQ) delivers x uops to Resource Allocation
Table (RAT) (where x belongs to {0,1,2,3}). Counting does not cover cases
when: a. IDQ-Resource Allocation Table (RAT) pipe serves the other thread.
b. Resource Allocation Table (RAT) is stalled for the thread (including
uop drops and clear BE conditions). c. Instruction Decode Queue (IDQ)
delivers four uops.
- uops_dispatched_port.port_0
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 0.
- uops_dispatched_port.port_1
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 1.
- uops_dispatched_port.port_2
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 2.
- uops_dispatched_port.port_3
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 3.
- uops_dispatched_port.port_4
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 4.
- uops_dispatched_port.port_5
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 5.
- uops_dispatched_port.port_6
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 6.
- uops_dispatched_port.port_7
- Counts, on the per-thread basis, cycles during which at least one uop is
dispatched from the Reservation Station (RS) to port 7.
- resource_stalls.any
- Counts resource-related stall cycles.
- resource_stalls.sb
- Counts allocation stall cycles caused by the store buffer (SB) being full.
This counts cycles that the pipeline back-end blocked uop delivery from
the front-end.
- cycle_activity.cycles_l2_miss
- Cycles while L2 cache miss demand load is outstanding.
- cycle_activity.cycles_l3_miss
- Cycles while L3 cache miss demand load is outstanding.
- cycle_activity.stalls_total
- Total execution stalls.
- cycle_activity.stalls_l2_miss
- Execution stalls while L2 cache miss demand load is outstanding.
- cycle_activity.stalls_l3_miss
- Execution stalls while L3 cache miss demand load is outstanding.
- cycle_activity.cycles_l1d_miss
- Cycles while L1 cache miss demand load is outstanding.
- cycle_activity.stalls_l1d_miss
- Execution stalls while L1 cache miss demand load is outstanding.
- cycle_activity.cycles_mem_any
- Cycles while memory subsystem has an outstanding load.
- cycle_activity.stalls_mem_any
- Execution stalls while memory subsystem has an outstanding load.
- exe_activity.exe_bound_0_ports
- Counts cycles during which no uops were executed on all ports and
Reservation Station (RS) was not empty.
- exe_activity.1_ports_util
- Counts cycles during which a total of 1 uop was executed on all ports and
Reservation Station (RS) was not empty.
- exe_activity.2_ports_util
- Counts cycles during which a total of 2 uops were executed on all ports
and Reservation Station (RS) was not empty.
- exe_activity.3_ports_util
- Cycles total of 3 uops are executed on all ports and Reservation Station
(RS) was not empty.
- exe_activity.4_ports_util
- Cycles total of 4 uops are executed on all ports and Reservation Station
(RS) was not empty.
- exe_activity.bound_on_stores
- Cycles where the Store Buffer was full and no outstanding load.
- lsd.uops
- Number of uops delivered to the back-end by the LSD(Loop Stream
Detector).
- lsd.cycles_active
- Counts the cycles when at least one uop is delivered by the LSD
(Loop-stream detector).
- dsb2mite_switches.count
- This event counts the number of the Decode Stream Buffer (DSB)-to-MITE
switches including all misses because of missing Decode Stream Buffer
(DSB) cache and u-arch forced misses. Note: Invoking MITE requires two or
three cycles delay.
- dsb2mite_switches.penalty_cycles
- Counts Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles.
These cycles do not include uops routed through because of the switch
itself, for example, when Instruction Decode Queue (IDQ) pre-allocation is
unavailable, or Instruction Decode Queue (IDQ) is full. SBD-to-MITE switch
true penalty cycles happen after the merge mux (MM) receives Decode Stream
Buffer (DSB) Sync-indication until receiving the first MITE uop. MM is
placed before Instruction Decode Queue (IDQ) to merge uops being fed from
the MITE and Decode Stream Buffer (DSB) paths. Decode Stream Buffer (DSB)
inserts the Sync-indication whenever a Decode Stream Buffer (DSB)-to-MITE
switch occurs.Penalty: A Decode Stream Buffer (DSB) hit followed by a
Decode Stream Buffer (DSB) miss can cost up to six cycles in which no uops
are delivered to the IDQ. Most often, such switches from the Decode Stream
Buffer (DSB) to the legacy pipeline cost 02 cycles.
- itlb.itlb_flush
- Counts the number of flushes of the big or small ITLB pages. Counting
include both TLB Flush (covering all sets) and TLB Set Clear
(set-specific).
- offcore_requests.demand_data_rd
- Counts the Demand Data Read requests sent to uncore. Use it in conjunction
with OFFCORE_REQUESTS_OUTSTANDING to determine average latency in the
uncore.
- offcore_requests.demand_code_rd
- Counts both cacheable and non-cacheable code read requests.
- offcore_requests.demand_rfo
- Counts the demand RFO (read for ownership) requests including regular
RFOs, locks, ItoM.
- offcore_requests.all_data_rd
- Counts the demand and prefetch data reads. All Core Data Reads include
cacheable 'Demands' and L2 prefetchers (not L3 prefetchers). Counting also
covers reads due to page walks resulted from any request type.
- offcore_requests.l3_miss_demand_data_rd
- Demand Data Read requests who miss L3 cache.
- offcore_requests.all_requests
- Counts memory transactions reached the super queue including requests
initiated by the core, all L3 prefetches, page walks, etc..
- uops_executed.cycles_ge_4_uops_exec
- Cycles where at least 4 uops were executed per-thread.
- uops_executed.cycles_ge_3_uops_exec
- Cycles where at least 3 uops were executed per-thread.
- uops_executed.cycles_ge_2_uops_exec
- Cycles where at least 2 uops were executed per-thread.
- uops_executed.cycles_ge_1_uop_exec
- Cycles where at least 1 uop was executed per-thread.
- uops_executed.stall_cycles
- Counts cycles during which no uops were dispatched from the Reservation
Station (RS) per thread.
- uops_executed.thread
- Number of uops to be executed per-thread each cycle.
- uops_executed.core
- Number of uops executed from any thread.
- uops_executed.core_cycles_none
- Cycles with no micro-ops executed from any thread on physical core.
- uops_executed.core_cycles_ge_4
- Cycles at least 4 micro-op is executed from any thread on physical
core.
- uops_executed.core_cycles_ge_3
- Cycles at least 3 micro-op is executed from any thread on physical
core.
- uops_executed.core_cycles_ge_2
- Cycles at least 2 micro-op is executed from any thread on physical
core.
- uops_executed.core_cycles_ge_1
- Cycles at least 1 micro-op is executed from any thread on physical
core.
- uops_executed.x87
- Counts the number of x87 uops executed.
- offcore_requests_buffer.sq_full
- Counts the number of cases when the offcore requests buffer cannot take
more entries for the core. This can happen when the superqueue does not
contain eligible entries, or when L1D writeback pending FIFO requests is
full.Note: Writeback pending FIFO has six entries.
- tlb_flush.dtlb_thread
- Counts the number of DTLB flush attempts of the thread-specific
entries.
- tlb_flush.stlb_any
- Counts the number of any STLB flush attempts (such as entire, VPID, PCID,
InvPage, CR3 write, etc.).
- inst_retired.any_p
- Counts the number of instructions (EOMs) retired. Counting covers
macro-fused instructions individually (that is, increments by two).
The following errata may apply to this: SKL091, SKL044
- inst_retired.prec_dist
- A version of INST_RETIRED that allows for a more unbiased distribution of
samples across instructions retired. It utilizes the Precise Distribution
of Instructions Retired (PDIR) feature to mitigate some bias in how
retired instructions get sampled.
The following errata may apply to this: SKL091, SKL044
- inst_retired.total_cycles_ps
- Number of cycles using an always true condition applied to PEBS
instructions retired event. (inst_ret< 16)
The following errata may apply to this: SKL091, SKL044
- other_assists.any
- Number of times a microcode assist is invoked by HW other than FP-assist.
Examples include AD (page Access Dirty) and AVX* related assists.
- uops_retired.total_cycles
- Number of cycles using always true condition (uops_ret < 16) applied to
non PEBS uops retired event.
- uops_retired.stall_cycles
- This event counts cycles without actually retired uops.
- uops_retired.retire_slots
- Counts the retirement slots used.
- uops_retired.macro_fused
- Counts the number of macro-fused uops retired. (non precise)
- machine_clears.count
- Number of machine clears (nukes) of any type.
- machine_clears.memory_ordering
- Counts the number of memory ordering Machine Clears detected. Memory
Ordering Machine Clears can result from one of the following:a. memory
disambiguation,b. external snoop, orc. cross SMT-HW-thread snoop (stores)
hitting load buffer.
The following errata may apply to this: SKL089
- machine_clears.smc
- Counts self-modifying code (SMC) detected, which causes a machine
clear.
- br_inst_retired.all_branches
- Counts all (macro) branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.conditional
- This event counts conditional branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.near_call
- This event counts both direct and indirect near call instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.all_branches_pebs
- This is a precise version of BR_INST_RETIRED.ALL_BRANCHES that counts all
(macro) branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.near_return
- This event counts return instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.not_taken
- This event counts not taken branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.cond_ntaken
- This event counts not taken branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.near_taken
- This event counts taken branch instructions retired.
The following errata may apply to this: SKL091
- br_inst_retired.far_branch
- This event counts far branch instructions retired.
The following errata may apply to this: SKL091
- br_misp_retired.all_branches
- Counts all the retired branch instructions that were mispredicted by the
processor. A branch misprediction occurs when the processor incorrectly
predicts the destination of the branch. When the misprediction is
discovered at execution, all the instructions executed in the wrong
(speculative) path must be discarded, and the processor must start
fetching from the correct path.
- br_misp_retired.conditional
- This event counts mispredicted conditional branch instructions
retired.
- br_misp_retired.near_call
- Counts both taken and not taken retired mispredicted direct and indirect
near calls, including both register and memory indirect.
- br_misp_retired.all_branches_pebs
- This is a precise version of BR_MISP_RETIRED.ALL_BRANCHES that counts all
mispredicted macro branch instructions retired.
- br_misp_retired.near_taken
- Number of near branch instructions retired that were mispredicted and
taken.
- fp_arith_inst_retired.scalar_double
- Number of SSE/AVX computational scalar double precision floating-point
instructions retired; some instructions will count twice as noted below.
Each count represents 1 computation. Applies to SSE* and AVX* scalar
double precision floating-point instructions: ADD SUB MUL DIV MIN MAX
RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions
count twice as they perform 2 calculations per element.
- fp_arith_inst_retired.scalar_single
- Number of SSE/AVX computational scalar single precision floating-point
instructions retired; some instructions will count twice as noted below.
Each count represents 1 computation. Applies to SSE* and AVX* scalar
single precision floating-point instructions: ADD SUB MUL DIV MIN MAX
RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions
count twice as they perform 2 calculations per element.
- fp_arith_inst_retired.128b_packed_double
- Number of SSE/AVX computational 128-bit packed double precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 2 computation operations, one for each
element. Applies to SSE* and AVX* packed double precision floating-point
instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT14 RCP14
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they
perform 2 calculations per element.
- fp_arith_inst_retired.128b_packed_single
- Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 4 computation operations, one for each
element. Applies to SSE* and AVX* packed single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB.
DPP and FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.
- fp_arith_inst_retired.256b_packed_double
- Number of SSE/AVX computational 256-bit packed double precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 4 computation operations, one for each
element. Applies to SSE* and AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB.
DPP and FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.
- fp_arith_inst_retired.256b_packed_single
- Number of SSE/AVX computational 256-bit packed single precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 8 computation operations, one for each
element. Applies to SSE* and AVX* packed single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB.
DPP and FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.
- fp_arith_inst_retired.512b_packed_double
- Number of SSE/AVX computational 512-bit packed double precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 8 computation operations, one for each
element. Applies to SSE* and AVX* packed double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB.
DPP and FM(N)ADD/SUB instructions count twice as they perform 8
calculations per element.
- fp_arith_inst_retired.512b_packed_single
- Number of SSE/AVX computational 512-bit packed single precision
floating-point instructions retired; some instructions will count twice as
noted below. Each count represents 16 computation operations, one for each
element. Applies to SSE* and AVX* packed single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB.
DPP and FM(N)ADD/SUB instructions count twice as they perform 16
calculations per element.
- hle_retired.start
- Number of times we entered an HLE region. Does not count nested
transactions.
- hle_retired.commit
- Number of times HLE commit succeeded.
- hle_retired.aborted
- Number of times HLE abort was triggered.
- hle_retired.aborted_mem
- Number of times an HLE execution aborted due to various memory events
(e.g., read/write capacity and conflicts).
- hle_retired.aborted_timer
- Number of times an HLE execution aborted due to hardware timer
expiration.
- hle_retired.aborted_unfriendly
- Number of times an HLE execution aborted due to HLE-unfriendly
instructions and certain unfriendly events (such as AD assists etc.).
- hle_retired.aborted_memtype
- Number of times an HLE execution aborted due to incompatible memory
type.
- hle_retired.aborted_events
- Number of times an HLE execution aborted due to unfriendly events (such as
interrupts).
- rtm_retired.start
- Number of times we entered an RTM region. Does not count nested
transactions.
- rtm_retired.commit
- Number of times RTM commit succeeded.
- rtm_retired.aborted
- Number of times RTM abort was triggered.
- rtm_retired.aborted_mem
- Number of times an RTM execution aborted due to various memory events
(e.g. read/write capacity and conflicts).
- rtm_retired.aborted_timer
- Number of times an RTM execution aborted due to uncommon conditions.
- rtm_retired.aborted_unfriendly
- Number of times an RTM execution aborted due to HLE-unfriendly
instructions.
- rtm_retired.aborted_memtype
- Number of times an RTM execution aborted due to incompatible memory
type.
- rtm_retired.aborted_events
- Number of times an RTM execution aborted due to none of the previous 4
categories (e.g. interrupt).
- fp_assist.any
- Counts cycles with any input and output SSE or x87 FP assist. If an input
and output assist are detected on the same cycle the event increments by
1.
- hw_interrupts.received
- Counts the number of hardware interruptions received by the
processor.
- rob_misc_events.lbr_inserts
- Increments when an entry is added to the Last Branch Record (LBR) array
(or removed from the array in case of RETURNs in call stack mode). The
event requires LBR enable via IA32_DEBUGCTL MSR and branch type selection
via MSR_LBR_SELECT.
- rob_misc_events.pause_inst
- Number of retired PAUSE instructions (that do not end up with a VMExit to
the VMM; TSX aborted Instructions may be counted). This event is not
supported on first SKL and KBL products.
- mem_inst_retired.stlb_miss_loads
- Retired load instructions that miss the STLB.
- mem_inst_retired.stlb_miss_stores
- Retired store instructions that miss the STLB.
- mem_inst_retired.lock_loads
- Retired load instructions with locked access.
- mem_inst_retired.split_loads
- Counts retired load instructions that split across a cacheline
boundary.
- mem_inst_retired.split_stores
- Counts retired store instructions that split across a cacheline
boundary.
- mem_inst_retired.all_loads
- All retired load instructions.
- mem_inst_retired.all_stores
- All retired store instructions.
- mem_load_retired.l1_hit
- Counts retired load instructions with at least one uop that hit in the L1
data cache. This event includes all SW prefetches and lock instructions
regardless of the data source.
- mem_load_retired.l2_hit
- Retired load instructions with L2 cache hits as data sources.
- mem_load_retired.l3_hit
- Counts retired load instructions with at least one uop that hit in the L3
cache.
- mem_load_retired.l1_miss
- Counts retired load instructions with at least one uop that missed in the
L1 cache.
- mem_load_retired.l2_miss
- Retired load instructions missed L2 cache as data sources.
- mem_load_retired.l3_miss
- Counts retired load instructions with at least one uop that missed in the
L3 cache.
- mem_load_retired.fb_hit
- Counts retired load instructions with at least one uop was load missed in
L1 but hit FB (Fill Buffers) due to preceding miss to the same cache line
with data not ready.
- mem_load_l3_hit_retired.xsnp_miss
- Retired load instructions which data sources were L3 hit and cross-core
snoop missed in on-pkg core cache.
- mem_load_l3_hit_retired.xsnp_hit
- Retired load instructions which data sources were L3 and cross-core snoop
hits in on-pkg core cache.
- mem_load_l3_hit_retired.xsnp_hitm
- Retired load instructions which data sources were HitM responses from
shared L3.
- mem_load_l3_hit_retired.xsnp_none
- Retired load instructions which data sources were hits in L3 without
snoops required.
- mem_load_l3_miss_retired.local_dram
- Retired load instructions which data sources missed L3 but serviced from
local DRAM.
- mem_load_l3_miss_retired.remote_dram
- Retired load instructions which data sources missed L3 but serviced from
remote dram
- mem_load_l3_miss_retired.remote_hitm
- Retired load instructions whose data sources was remote HITM.
- mem_load_l3_miss_retired.remote_fwd
- Retired load instructions whose data sources was forwarded from a remote
cache.
- mem_load_misc_retired.uc
- Retired instructions with at least 1 uncacheable load or lock.
- baclears.any
- Counts the number of times the front-end is resteered when it finds a
branch instruction in a fetch line. This occurs for the first time a
branch instruction is fetched or when the branch is not tracked by the BPU
(Branch Prediction Unit) anymore.
- core_snoop_response.rsp_ihiti
- tbd
- core_snoop_response.rsp_ihitfse
- tbd
- core_snoop_response.rsp_shitfse
- tbd
- core_snoop_response.rsp_sfwdm
- tbd
- core_snoop_response.rsp_ifwdm
- tbd
- core_snoop_response.rsp_ifwdfe
- tbd
- core_snoop_response.rsp_sfwdfe
- tbd
- l2_trans.l2_wb
- Counts L2 writebacks that access L2 cache.
- l2_lines_in.all
- Counts the number of L2 cache lines filling the L2. Counting does not
cover rejects.
- l2_lines_out.silent
- Counts the number of lines that are silently dropped by L2 cache when
triggered by an L2 cache fill. These lines are typically in Shared state.
A non-threaded event.
- l2_lines_out.non_silent
- Counts the number of lines that are evicted by L2 cache when triggered by
an L2 cache fill. Those lines can be either in modified state or clean
state. Modified lines may either be written back to L3 or directly written
to memory and not allocated in L3. Clean lines may either be allocated in
L3 or dropped.
- l2_lines_out.useless_pref
- This event is deprecated. Refer to new event
L2_LINES_OUT.USELESS_HWPF
- l2_lines_out.useless_hwpf
- Counts the number of lines that have been hardware prefetched but not used
and now evicted by L2 cache
- sq_misc.split_lock
- Counts the number of cache line split locks sent to the uncore.
- idi_misc.wb_upgrade
- Counts number of cache lines that are allocated and written back to L3
with the intention that they are more likely to be reused shortly.
- idi_misc.wb_downgrade
- Counts number of cache lines that are dropped and not written back to L3
as they are deemed to be less likely to be reused shortly.