PERF-ARM-SPE(1) perf Manual PERF-ARM-SPE(1)
perf-arm-spe - Support for Arm Statistical Profiling Extension
within Perf tools
perf record -e arm_spe//
The SPE (Statistical Profiling Extension) feature provides
accurate attribution of latencies and events down to individual
instructions. Rather than being interrupt-driven, it picks an
instruction to sample and then captures data for it during
execution. Data includes execution time in cycles. For loads and
stores it also includes data address, cache miss events, and data
origin.
The sampling has 5 stages:
1. Choose an operation
2. Collect data about the operation
3. Optionally discard the record based on a filter
4. Write the record to memory
5. Interrupt when the buffer is full
Choose an operation
This is chosen from a sample population, for SPE this is an
IMPLEMENTATION DEFINED choice of all architectural instructions or
all micro-ops. Sampling happens at a programmable interval. The
architecture provides a mechanism for the SPE driver to infer the
minimum interval at which it should sample. This minimum interval
is used by the driver if no interval is specified. A pseudo-random
perturbation is also added to the sampling interval by default.
Collect data about the operation
Program counter, PMU events, timings and data addresses related to
the operation are recorded. Sampling ensures there is only one
sampled operation is in flight.
Optionally discard the record based on a filter
Based on programmable criteria, choose whether to keep the record
or discard it. If the record is discarded then the flow stops here
for this sample.
Write the record to memory
The record is appended to a memory buffer
Interrupt when the buffer is full
When the buffer fills, an interrupt is sent and the driver signals
Perf to collect the records. Perf saves the raw data in the
perf.data file.
Up until this point no decoding of the SPE data was done by either
the kernel or Perf. Only when the recorded file is opened with
perf report or perf script does the decoding happen. When decoding
the data, Perf generates "synthetic samples" as if these were
generated at the time of the recording. These samples are the same
as if normal sampling was done by Perf without using SPE, although
they may have more attributes associated with them. For example a
normal sample may have just the instruction pointer, but an SPE
sample can have data addresses and latency attributes.
• Sampling, rather than tracing, cuts down the profiling problem
to something more manageable for hardware. Only one sampled
operation is in flight at a time.
• Allows precise attribution data, including: Full PC of
instruction, data virtual and physical addresses.
• Allows correlation between an instruction and events, such as
TLB and cache miss. (Data source indicates which particular
cache was hit, but the meaning is implementation defined
because different implementations can have different cache
configurations.)
However, SPE does not provide any call-graph information, and
relies on statistical methods.
When an operation is sampled while a previous sampled operation
has not finished, a collision occurs. The new sample is dropped.
Collisions affect the integrity of the data, so the sample rate
should be set to avoid collisions.
The sample_collision PMU event can be used to determine the number
of lost samples. Although this count is based on collisions before
filtering occurs. Therefore this can not be used as an exact
number for samples dropped that would have made it through the
filter, but can be a rough guide.
If an implementation samples micro-operations instead of
instructions, the results of sampling must be weighted
accordingly.
For example, if a given instruction A is always converted into two
micro-operations, A0 and A1, it becomes twice as likely to appear
in the sample population.
The coarse effect of conversions, and, if applicable, sampling of
speculative operations, can be estimated from the sample_pop and
inst_retired PMU events.
The ARM_SPE_PMU config must be set to build as either a module or
statically.
Depending on CPU model, the kernel may need to be booted with page
table isolation disabled (kpti=off). If KPTI needs to be disabled,
this will fail with a console message "profiling buffer
inaccessible. Try passing kpti=off on the kernel command line".
For the full criteria that determine whether KPTI needs to be
forced off or not, see function unmap_kernel_at_el0() in the
kernel sources. Common cases where it’s not required are on the
CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is
mandatory.
The SPE interrupt must also be described by the firmware. If the
module is loaded and KPTI is disabled (or isn’t required to be
disabled) but the SPE PMU still doesn’t show in
/sys/bus/event_source/devices/, then it’s possible that the SPE
interrupt isn’t described by ACPI or DT. In this case no warning
will be printed by the driver.
You can record a session with SPE samples:
perf record -e arm_spe// -- ./mybench
The sample period is set from the -c option, and because the
minimum interval is used by default it’s recommended to set this
to a higher value. The value is written to PMSIRR.INTERVAL.
Config parameters
These are placed between the // in the event and comma separated.
For example -e arm_spe/load_filter=1,min_latency=10/
branch_filter=1 - collect branches only (PMSFCR.B)
event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)
load_filter=1 - collect loads only (PMSFCR.LD)
min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)
pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
store_filter=1 - collect stores only (PMSFCR.ST)
ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
* Latency is the total latency from the point at which sampling
started on that instruction, rather than only the execution
latency.
Only some events can be filtered on; these include:
bit 1 - instruction retired (i.e. omit speculative instructions)
bit 3 - L1D refill
bit 5 - TLB refill
bit 7 - mispredict
bit 11 - misaligned access
So to sample just retired instructions:
perf record -e arm_spe/event_filter=2/ -- ./mybench
or just mispredicted branches:
perf record -e arm_spe/event_filter=0x80/ -- ./mybench
Viewing the data
By default perf report and perf script will assign samples to
separate groups depending on the attributes/events of the SPE
record. Because instructions can have multiple events associated
with them, the samples in these groups are not necessarily unique.
For example perf report shows these groups:
Available samples
0 arm_spe//
0 dummy:u
21 l1d-miss
897 l1d-access
5 llc-miss
7 llc-access
2 tlb-miss
1K tlb-access
36 branch
0 remote-access
900 memory
The arm_spe// and dummy:u events are implementation details and
are expected to be empty.
To get a full list of unique samples that are not sorted into
groups, set the itrace option to generate instruction samples. The
period option is also taken into account, so set it to 1
instruction unless you want to further downsample the already
sampled SPE data:
perf report --itrace=i1i
Memory access details are also stored on the samples and this can
be viewed with:
perf report --mem-mode
Common errors
• "Cannot find PMU ‘arm_spe’. Missing kernel support?"
Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
or running on a VM. See 'Kernel Requirements' above.
• "Arm SPE CONTEXT packets not found in the traces."
Root privilege is required to collect context packets. But these only increase the accuracy of
assigning PIDs to kernel samples. For userspace sampling this can be ignored.
• Excessively large perf.data file size
Increase sampling interval (see above)
PMU events
SPE has events that can be counted on core PMUs. These are
prefixed with SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED,
SAMPLE_COLLISION and SAMPLE_FEED_BR.
These events will only count when an SPE event is running on the
same core that the PMU event is opened on, otherwise they read as
0. There are various ways to ensure that the PMU event and SPE
event are scheduled together depending on the way the event is
opened. For example opening both events as per-process events on
the same process, although it’s not guaranteed that the PMU event
is enabled first when context switching. For that reason it may be
better to open the PMU event as a systemwide event and then open
SPE on the process of interest.
Discard mode
SPE related (SAMPLE_* etc) core PMU events can be used without the
overhead of collecting sample data if discard mode is supported
(optional from Armv8.6). First run a system wide SPE session (or
on the core of interest) using options to minimize output. Then
run perf stat:
perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
perf stat -e SAMPLE_FEED_LD
perf-record(1), perf-script(1), perf-report(1), perf-inject(1)
This page is part of the perf (Performance analysis tools for
Linux (in Linux source tree)) project. Information about the
project can be found at
⟨https://perf.wiki.kernel.org/index.php/Main_Page⟩. If you have a
bug report for this manual page, send it to
linux-kernel@vger.kernel.org. This page was obtained from the
project's upstream Git repository
⟨http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git⟩
on 2025-08-11. (At that time, the date of the most recent commit
that was found in the repository was 2025-08-10.) If you discover
any rendering problems in this HTML version of the page, or you
believe there is a better or more up-to-date source for the page,
or you have corrections or improvements to the information in this
COLOPHON (which is not part of the original manual page), send a
mail to man-pages@man7.org
perf 2025-01-10 PERF-ARM-SPE(1)
Pages that refer to this page: perf(1), perf-c2c(1), perf-mem(1)