AMD_F1AH_ZEN5_EVENTS(3CPC) CPU Performance Counters Library Functions AMD_F1AH_ZEN5_EVENTS(3CPC)

amd_f1ah_zen5_eventsAMD Family 1ah Zen5 processor performance monitoring events

This manual page describes events specfic to AMD Family 1ah Zen5 processors. For more information, please consult the appropriate AMD BIOS and Kernel Developer's guide or Open-Source Register Reference.

Each of the events listed below includes the AMD mnemonic which matches the name found in the AMD manual and a brief summary of the event. If available, a more detailed description of the event follows and then any additional unit values that modify the event. Each unit can be combined to create a new event in the system by placing the '.' character between the event name and the unit name.

The following events are supported:

FP retired x87 uops

Number of retired x87 arithmetic operations. Can be used to calculate x87 FLOPs.

This event has the following units which may be used to modify the behavior of the event:

x87 Divide or square root uops.
x87 Multiply uops.
x87 Add/subtract uops.
FP retired SSE and AVX FLOPs

Number of SSE and AVX floating point arithmetic operations retired. Number of arithmetic operations retired is dependent on number of uops retired, data size (scalar/128/256/512), data type (BF16/FP16/FP32/FP64) and type of operation (add/sub/mul/mac/...). Use MergeEvent feature for accurate results.

FP uops retired by size

Report number of FP uops retired by size. Can be used to determine how vectorized code is and how much MMX / x87 content is in the code.

This event has the following units which may be used to modify the behavior of the event:

Packed 512-bit uops retired.
Packed 256-bit uops retired.
Packed 128-bit uops retired.
Scalar uops retired.
MMX uops retired.
x87 uops retired.
FP uops retired sorted by vector or scalar

Number of FP uops retired of selected type sorted by vector (AVX/SSE packed) or scalar (x87, AVX/SSE scalar). Can be used to profile FP codes.

FP executed integer type uops sorted by vector or scalar

Number of integer uops executed in the FP retired of selected type sorted by vector (SSE/AVX) or scalar (MMX). Can be used to profile vector INT / MMX codes.

FP uops retired sorted by packed 128 or packed 256

Number of FP uops retired of selected type sorted by 128-bit packed dest (XMM) or 256-bit packed dest (YMM). Can be used to profile FP codes.

FP executed packed integer uops sorted by packed 128 or packed 256

Number of integer uops executed in FP retired of selected type sorted by 128-bit packed dest (XMM) or 256-bit packed dest (YMM). Can be used to profile FP codes.

FP Dispatch Faults

Number of FP dispatch faults triggered by type. Dispatch fill/spill faults occur when FP either does not have the data needed to operate on in its local registers (fill), or FP needs to empty out upper register data for proper SSE merging behavior when executing AVX code (spill).

This event has the following units which may be used to modify the behavior of the event:

YMM spill fault
YMM fill fault
XMM Fill fault
x87 Fill fault
Bad Status 2

Store To Load Interlock (STLI) are loads that were unable to complete because of a possible match with an older store, and the older store could not do Store To Load Forwarding (STLF) for some reason.

This event has the following units which may be used to modify the behavior of the event:

Store-to-load conflicts: A load was unable to complete due to a non-forwardable conflict with an older store. Most commonly, a load's address range partially but not completely overlaps with an uncompleted older store. Software can avoid this problem by using same-size and same-alignment loads and stores when accessing the same data. Vector/SIMD code is particularly susceptible to this problem; software should construct wide vector stores by manipulating vector elements in registers using shuffle/blend/swap instructions prior to storing to memory, instead of using narrow element-by-element stores.
Retired Lock Instructions

Counts retired atomic read-modify-write instructions with a LOCK prefix.

Retired CLFLUSH Instructions

The number of retired CLFLUSH instructions. This is a non-speculative event.

Retired CPUID Instructions

The number of CPUID instructions retired.

LS Dispatch

Counts the number of operations dispatched to the LS unit. Unit Masks events are ADDed.

This event has the following units which may be used to modify the behavior of the event:

Dispatch of a single op that performs a load from and store to the same memory address.
Dispatch of a single op that performs a memory store.
Dispatch of a single op that performs a memory load.
SMIs Received

Counts the number of System Management Interrupts (SMIs) received.

Interrupts Taken

Counts the number of interrupts taken.

This event has the following units which may be used to modify the behavior of the event:

Number of interrupts taken. This event is also counted when UnitMask[7:0]=0.
Store to Load Forward

Number of STLF hits.

Store Globally Visible Cancels 2

Counts reasons why a Store Coalescing Buffer (SCB) commit is canceled.

This event has the following units which may be used to modify the behavior of the event:

Older SCB we are waiting on to become globally visible was unable to become globally visible.
LS MAB Allocates by Type

Counts when an LS pipe allocates a Miss Address Buffer (MAB) entry to make a miss request.

Demand Data Cache Fills by Data Source

Counts fills into the DC that were initiated by demand ops, per data source.

This event has the following units which may be used to modify the behavior of the event:

Requests that return from Extension Memory.
Requests that target another NUMA node and return from DRAM or MMIO.
Requests that target another NUMA node and return from another CCX's cache.
Requests that target the same NUMA node and return from DRAM or MMIO.
Requests that target the same NUMA node and return from another CCX's cache.
Data returned from L3 or different L2 in the same CCX.
Data returned from local L2.
Any Data Cache Fills by Data Source

Counts all fills into the DC, per data source.

This event has the following units which may be used to modify the behavior of the event:

Requests that return from Extension Memory.
Requests that target another NUMA node and return from DRAM or MMIO.
Requests that target another NUMA node and return from another CCX's cache.
Requests that target the same NUMA node and return from DRAM or MMIO.
Requests that target the same NUMA node and return from another CCX's cache.
Data returned from L3 or different L2 in the same CCX.
Data returned from local L2.
L1 DTLB Reloads

Counts L1DTLB reloads

This event has the following units which may be used to modify the behavior of the event:

DTLB reload to a 1G page that missed in the L2DTLB.
DTLB reload to a 2M page that missed in the L2DTLB.
DTLB reload to a coalesced page that missed in the L2DTLB.
DTLB reload to a 4K page that missed in the L2DTLB.
DTLB reload to a 1G page that hit in the L2DTLB.
DTLB reload to a 2M page that hit in the L2DTLB.
DTLB reload to a coalesced page that hit in the L2DTLB.
DTLB reload to a 4K page that hit in the L2DTLB.
Misaligned Load Flows

The number of misaligned load flows.

This event has the following units which may be used to modify the behavior of the event:

The number of 4KB misaligned (i.e., page crossing) loads or LdOpSt.
The number of 64B misaligned (i.e., cacheline crossing) loads or LdOpSt.
Prefetch Instructions Dispatched

Software Prefetch Instructions Dispatched (speculative)

This event has the following units which may be used to modify the behavior of the event:

PrefetchNTA instruction. See docAPM3 PREFETCHlevel.
PrefetchW instruction. See docAPM3 PREFETCHlevel.
PrefetchT0, T1, and T2 instructions. See docAPM3 PREFETCHlevel.
Write Combining Buffer Close

Counts events that cause a Write Combining Buffer (WCB) entry to close.

This event has the following units which may be used to modify the behavior of the event:

All 64 bytes of the WCB entry have been written.
Ineffective Software Prefetches

The number of software prefetches that did not fetch data outside of the processor core.

This event has the following units which may be used to modify the behavior of the event:

Software PREFETCH instruction saw a match on an already-allocated miss request.
Software PREFETCH instruction saw a DC hit.
Software Prefetch Data Cache Fills by Data Source

Counts fills into the DC that were initiated by software prefetch instructions, per data source.

This event has the following units which may be used to modify the behavior of the event:

Requests that return from Extension Memory.
Requests that target another NUMA node and return from DRAM or MMIO.
Requests that target another NUMA node and return from another CCX's cache.
Requests that target the same NUMA node and return from DRAM or MMIO.
Requests that target the same NUMA node and return from another CCX's cache.
Data returned from L3 or different L2 in the same CCX.
Data returned from local L2.
Hardware Prefetch Data Cache Fills by Data Source

Counts fills into the DC that were initiated by hardware prefetches, per data source.

This event has the following units which may be used to modify the behavior of the event:

Requests that return from Extension Memory.
Requests that target another NUMA node and return from DRAM or MMIO.
Requests that target another NUMA node and return from another CCX's cache.
Requests that target the same NUMA node and return from DRAM or MMIO.
Requests that target the same NUMA node and return from another CCX's cache.
Data returned from L3 or different L2 in the same CCX.
Data returned from local L2.
Allocated DC misses

Counts the number of in-flight DC misses each cycle.

Cycles Not in Halt

Counts cycles when the thread is not in a HALTed state

All TLB Flushes

TLB flush events.

P0 Freq Cycles not in Halt

Counts cycles not in Halt, at the P0 P-state frequency, regardless of the current Pstate.

This event has the following units which may be used to modify the behavior of the event:

Counts at the P0 frequency (same as Core::X86::Msr::MPERF) when not in Halt.
Instruction Cache Refills From L2

The number of 64 byte instruction cache lines fulfilled from the L2 cache.

Instruction Cache Refills from System

The number of 64 byte instruction cache line fulfilled from system memory or another cache.

L1 ITLB Miss, L2ITLB Hit

The number of instruction fetches that miss in the L1 ITLB but hit in the L2 ITLB.

L1 ITLB Miss, L2 ITLB Miss

The number of instruction fetches that miss in both the L1 ITLB and L2 ITLB.

This event has the following units which may be used to modify the behavior of the event:

Walk for >4k Coalesced page (implemented as 16k)
Walk for 1G page
Walk for 2M page
Walk to 4k page
BP Pipe Correction or Cancel

The Branch Predictor flushed its own pipeline due to internal conditions such as a second level prediction structure. Does not count the number of bubbles caused by these internal flushes.

Variable Target Predictions

The number of times a branch used the indirect predictor to make a prediction.

Early Redirects

Number of times that an Early Redirect is sent to Branch Predictor. This happens when either the decoder or dispatch logic is able to detect that the Branch Predictor needs to be redirected.

ITLB Instruction Fetch Hits

The number of instruction fetches that hit in the L1ITLB.

This event has the following units which may be used to modify the behavior of the event:

L1 Instruction TLB Hit (1G page size)
L1 Instruction TLB Hit (2M page size)
L1 Instruction TLB Hit (4k or 16k coalesced page size)
BP Redirects

Counts redirects of the branch predictor. To support legacy software, counts both EX mispredict and resyncs when unit_mask[7:0] is set to 0.

This event has the following units which may be used to modify the behavior of the event:

Mispredict redirect from EX (execution-time)
Resync redirect (Retire-time) from RT
Fetch IBS events

Counts significant Fetch IBS State transitions.

This event has the following units which may be used to modify the behavior of the event:

Counts the number of valid Fetch Instruction Based Sampling (fetch IBS) samples that were collected. Each valid sample also created an IBS interrupt.
Counts the number of Fetch IBS tagged fetches that were discarded due to IBS filtering. When a tagged fetch is discarded the Fetch IBS facility will automatically tag a new fetch.
Counts when the Fetch IBS facility discards an IBS tagged fetch for reasons other than IBS filtering. When a tagged fetch is discarded the Fetch IBS facility will automatically tag a new fetch.
Counts the number of fetches tagged for Fetch IBS. Not all tagged fetches create an IBS interrupt and valid fetch sample.
IC Tag Hit and Miss Events

Counts the number of microtag and full tag events as selected by unit mask.

Op Cache Hit or Miss

Counts Op Cache micro-tag hit/miss events.

Op Queue Empty

Cycles where the Op Queue is empty.

Source of Op Dispatched From Decoder

Counts the number of ops dispatched from the decoder classified by op source.

This event has the following units which may be used to modify the behavior of the event:

Count of ops dispatched from OpCache
Count of ops dispatched from x86 decoder
Types of Ops Dispatched From Decoder

Counts the number of ops dispatched from the decoder classified by op type. The UnitMask value encodes which types of ops are counted.

Dynamic Tokens Dispatch Stall Cycles 1

Cycles where a dispatch group is valid but does not get dispatched due to a Token Stall. UnitMask bits select the stall types included in the count.

This event has the following units which may be used to modify the behavior of the event:

FP NSQ token stall
taken branch buffer resource stall.
STQ Tokens unavailable
Load Queue Token Stall.
Integer Physical Register File resource stall.
Dynamic Tokens Dispatch Stall Cycles 2

Cycles where a dispatch group is valid but does not get dispatched due to a token stall. UnitMask bits select the stall types included in the count.

This event has the following units which may be used to modify the behavior of the event:

Retire queue tokens unavailable
Integer Execution flush recovery pending
Agen tokens unavailable
ALU tokens unavailable
No_Dispatch_per_Slot

Counts the number of dispatch slots (each cycle) that remained unused for reasons selected by UnitMask.

Dispatch Additional Resource Stalls

This PMC event counts additional resource stalls that are not captured by Dispatch_Stall_Cycle_Dynamic_Tokens_Part_1 or Dispatch_Stall_Cycles_Dynamic_Tokens_Part_2.

Retired Instructions

The number of instructions retired.

Retired Macro-Ops

The number of macro-ops retired.

Retired Branch Instructions

The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts.

Retired Branch Instructions Mispredicted.

The number of retired branch instructions, that were mispredicted. Note that only EX mispredicts are counted.

Retired Taken Branch Instructions

The number of taken branches that were retired. This includes all types of architectural control flow changes, including exceptions and interrupts.

Retired Taken Branch Instructions Mispredicted.

The number of retired taken branch instructions that were mispredicted. Note that only EX mispredicts are counted.

Retired Far Control Transfers

The number of far control transfers retired including far call/jump/return, IRET, SYSCALL and SYSRET, plus exceptions and interrupts. Far control transfers are not subject to branch prediction.

Retired Near Return Branch Instructions

The number of near return instructions (RET [C3] or RET Iw [C2]) retired.

Retired Near Return Branch Instructions Mispredicted

The number of near returns retired that were not correctly predicted by the return address predictor. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction. Note that only EX mispredicts are counted.

Retired Indirect Branch Instructions Mispredicted

The number of indirect branches retired that were not correctly predicted. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction. Note that only EX mispredicts are counted.

Retired MMX FP Instructions

The number of MMX, SSE or x87 instructions retired. The UnitMask allows the selection of the individual classes of instructions as given in the table. Each increment represents one complete instruction. Since this event includes non-numeric instructions it is not suitable for measuring MFLOPs

This event has the following units which may be used to modify the behavior of the event:

SSE instructions (SSE, SSE2, SSE3, SSSE3, SSE4A, SSE41, SSE42, AVX).
MMX instructions
x87 instructions
Retired Indirect Branch Instructions

The number of indirect branches retired.

Retired Conditional Branch Instructions

Count of conditional branch instructions that retired

Div Cycles Busy count

Counts cycles when the divider is busy

Div Op Count

Counts number of divide ops

Cycles with no retire

This event counts cycles when the hardware thread does not retire any ops for reasons selected by UnitMask[4:0]. UnitMask events [4:0] are mutually exclusive. If multiple reasons apply for a given cycle, the lowest numbered UnitMask event is counted.

This event has the following units which may be used to modify the behavior of the event:

The number cycles where ops could have retired (i.e. did not fall into the sub-events [0]...[3]) but did not retire because the thread arbitration did not select the thread for retire.
The number of cycles where ops could have retired (self and older ops are complete), but were stopped from retirement for other reasons: retire breaks, traps, faults, etc.
The number of cycles where the oldest retire slot did not have its completion bits set.
The number of cycles when there were no valid ops in the retire queue. This may be caused by front-end bottlenecks or pipeline redirects.
Retired Microcoded Instructions

The number of retired microcoded instructions.

Retired Microcode Ops

The number of microcode ops that have retired.

Retired Conditional Branch Instructions Mispredicted

The number of retired conditional branch instructions that were not correctly predicted because of a branch direction mismatch.

Retired Unconditional Branch Instructions Mispredicted

The number of retired unconditional indirect branch instructions that were mispredicted.

Retired Unconditional Branch Instructions

Retired Unconditional Branch Instructions

Tagged IBS Ops

Counts Op IBS related events

This event has the following units which may be used to modify the behavior of the event:

Number of times an op could not be tagged by IBS because of a previous tagged op that has not yet signaled interrupt.
Number of Ops tagged by IBS that retired
Number of Ops tagged by IBS
Retired Fused Instructions

Counts retired fused instructions.

Requests to L2 Group1

All L2 Cache Requests (Breakdown 1 - Common)

This event has the following units which may be used to modify the behavior of the event:

Data Cache Reads (including hardware and software prefetch).
Data Cache Stores
Data Cache Shared Reads
Instruction Cache Reads.
 
All prefetches accepted by L2 pipeline, hit or miss. Types of PF and L2 hit/miss broken out in a separate perfmon event
Various Noncacheable requests. Non-cached Data Reads, Non- cached Instruction Reads, Self-modifying code checks.
Requests to L2 Group2

All L2 Cache Requests (Breakdown 2 - Rare).

This event has the following units which may be used to modify the behavior of the event:

LS sized read, coherent non-cacheable.
LS sized read, non-coherent, non-cacheable.
Write Combining Buffer Requests

Write Combining Buffer operations. For information on Write Combining see docAPM2 sections: Memory System, Memory Types, Buffering and Combining Memory Writes.

This event has the following units which may be used to modify the behavior of the event:

Write Combining Buffer close
Core to L2 Cacheable Request Access Status

L2 Cache Request Outcomes (not including L2 Prefetch).

This event has the following units which may be used to modify the behavior of the event:

Data Cache Shared Read Hit in L2.
Modifiable
Data Cache Read Hit Non-Modifiable Line in L2.
Data Cache Store Hit in L2.
Data Cache Req Miss in L2.
Instruction Cache Hit Modifiable Line in L2.
Instruction Cache Hit Non-Modifiable Line in L2.
Instruction Cache Req Miss in L2.
L2 Prefetch Hit in L2

Counts all L2 prefetches accepted by L2 pipeline which hit in the L2 cache.

L2 Prefetcher Hits in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 cache and hit the L3.

L2 Prefetcher Misses in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 and the L3 caches

L2 Fill Response Source

Counts fill responses based on their source. Selecting an event mask of 0xfe will count all L3 responses. This will count all L3 responses to fill requests. This event is similar to LS PMC 0x44

This event has the following units which may be used to modify the behavior of the event:

Requests that return from Extension Memory
Requests that target another NUMA node and return from either DRAM or MMIO from another NUMA node, either from the same or different NUMA node.
Requests that target another NUMA node and return from another CCX's cache.
Requests that target the same NUMA node and return from either DRAM or MMIO from the same NUMA node.
Requests that target the same NUMA node and return from another CCX's cache.
Data returned from L3 or different L2 in the same CCX.

cpc(3CPC)

March 25, 2019 OmniOS