dwww Home | Manual pages | Find package

PERF-ARM-SPE(1)                   perf Manual                  PERF-ARM-SPE(1)

NAME
       perf-arm-spe - Support for Arm Statistical Profiling Extension within
       Perf tools

SYNOPSIS
       perf record -e arm_spe//

DESCRIPTION
       The SPE (Statistical Profiling Extension) feature provides accurate
       attribution of latencies and events down to individual instructions.
       Rather than being interrupt-driven, it picks an instruction to sample
       and then captures data for it during execution. Data includes execution
       time in cycles. For loads and stores it also includes data address,
       cache miss events, and data origin.

       The sampling has 5 stages:

        1. Choose an operation

        2. Collect data about the operation

        3. Optionally discard the record based on a filter

        4. Write the record to memory

        5. Interrupt when the buffer is full

   Choose an operation
       This is chosen from a sample population, for SPE this is an
       IMPLEMENTATION DEFINED choice of all architectural instructions or all
       micro-ops. Sampling happens at a programmable interval. The
       architecture provides a mechanism for the SPE driver to infer the
       minimum interval at which it should sample. This minimum interval is
       used by the driver if no interval is specified. A pseudo-random
       perturbation is also added to the sampling interval by default.

   Collect data about the operation
       Program counter, PMU events, timings and data addresses related to the
       operation are recorded. Sampling ensures there is only one sampled
       operation is in flight.

   Optionally discard the record based on a filter
       Based on programmable criteria, choose whether to keep the record or
       discard it. If the record is discarded then the flow stops here for
       this sample.

   Write the record to memory
       The record is appended to a memory buffer

   Interrupt when the buffer is full
       When the buffer fills, an interrupt is sent and the driver signals Perf
       to collect the records. Perf saves the raw data in the perf.data file.

OPENING THE FILE
       Up until this point no decoding of the SPE data was done by either the
       kernel or Perf. Only when the recorded file is opened with perf report
       or perf script does the decoding happen. When decoding the data, Perf
       generates "synthetic samples" as if these were generated at the time of
       the recording. These samples are the same as if normal sampling was
       done by Perf without using SPE, although they may have more attributes
       associated with them. For example a normal sample may have just the
       instruction pointer, but an SPE sample can have data addresses and
       latency attributes.

WHY SAMPLING?
       •   Sampling, rather than tracing, cuts down the profiling problem to
           something more manageable for hardware. Only one sampled operation
           is in flight at a time.

       •   Allows precise attribution data, including: Full PC of instruction,
           data virtual and physical addresses.

       •   Allows correlation between an instruction and events, such as TLB
           and cache miss. (Data source indicates which particular cache was
           hit, but the meaning is implementation defined because different
           implementations can have different cache configurations.)

       However, SPE does not provide any call-graph information, and relies on
       statistical methods.

COLLISIONS
       When an operation is sampled while a previous sampled operation has not
       finished, a collision occurs. The new sample is dropped. Collisions
       affect the integrity of the data, so the sample rate should be set to
       avoid collisions.

       The sample_collision PMU event can be used to determine the number of
       lost samples. Although this count is based on collisions before
       filtering occurs. Therefore this can not be used as an exact number for
       samples dropped that would have made it through the filter, but can be
       a rough guide.

THE EFFECT OF MICROARCHITECTURAL SAMPLING
       If an implementation samples micro-operations instead of instructions,
       the results of sampling must be weighted accordingly.

       For example, if a given instruction A is always converted into two
       micro-operations, A0 and A1, it becomes twice as likely to appear in
       the sample population.

       The coarse effect of conversions, and, if applicable, sampling of
       speculative operations, can be estimated from the sample_pop and
       inst_retired PMU events.

KERNEL REQUIREMENTS
       The ARM_SPE_PMU config must be set to build as either a module or
       statically.

       Depending on CPU model, the kernel may need to be booted with page
       table isolation disabled (kpti=off). If KPTI needs to be disabled, this
       will fail with a console message "profiling buffer inaccessible. Try
       passing kpti=off on the kernel command line".

CAPTURING SPE WITH PERF COMMAND-LINE TOOLS
       You can record a session with SPE samples:

           perf record -e arm_spe// -- ./mybench

       The sample period is set from the -c option, and because the minimum
       interval is used by default it’s recommended to set this to a higher
       value. The value is written to PMSIRR.INTERVAL.

   Config parameters
       These are placed between the // in the event and comma separated. For
       example -e arm_spe/load_filter=1,min_latency=10/

           branch_filter=1     - collect branches only (PMSFCR.B)
           event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
           jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
           load_filter=1       - collect loads only (PMSFCR.LD)
           min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
           pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
           pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
           store_filter=1      - collect stores only (PMSFCR.ST)
           ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)

       * Latency is the total latency from the point at which sampling started
       on that instruction, rather than only the execution latency.

       Only some events can be filtered on; these include:

           bit 1     - instruction retired (i.e. omit speculative instructions)
           bit 3     - L1D refill
           bit 5     - TLB refill
           bit 7     - mispredict
           bit 11    - misaligned access

       So to sample just retired instructions:

           perf record -e arm_spe/event_filter=2/ -- ./mybench

       or just mispredicted branches:

           perf record -e arm_spe/event_filter=0x80/ -- ./mybench

   Viewing the data
       By default perf report and perf script will assign samples to separate
       groups depending on the attributes/events of the SPE record. Because
       instructions can have multiple events associated with them, the samples
       in these groups are not necessarily unique. For example perf report
       shows these groups:

           Available samples
           0 arm_spe//
           0 dummy:u
           21 l1d-miss
           897 l1d-access
           5 llc-miss
           7 llc-access
           2 tlb-miss
           1K tlb-access
           36 branch-miss
           0 remote-access
           900 memory

       The arm_spe// and dummy:u events are implementation details and are
       expected to be empty.

       To get a full list of unique samples that are not sorted into groups,
       set the itrace option to generate instruction samples. The period
       option is also taken into account, so set it to 1 instruction unless
       you want to further downsample the already sampled SPE data:

           perf report --itrace=i1i

       Memory access details are also stored on the samples and this can be
       viewed with:

           perf report --mem-mode

   Common errors
       •   "Cannot find PMU ‘arm_spe’. Missing kernel support?"

               Module not built or loaded, KPTI not disabled (see above), or running on a VM

       •   "Arm SPE CONTEXT packets not found in the traces."

               Root privilege is required to collect context packets. But these only increase the accuracy of
               assigning PIDs to kernel samples. For userspace sampling this can be ignored.

       •   Excessively large perf.data file size

               Increase sampling interval (see above)

SEE ALSO
       perf-record(1), perf-script(1), perf-report(1), perf-inject(1)

perf                              11/18/2025                   PERF-ARM-SPE(1)

Generated by dwww version 1.16 on Wed Dec 17 09:52:05 CET 2025.