OPAL/Skiboot In-Memory Collection (IMC) interface Documentation

Overview:

In-Memory-Collection (IMC) is performance monitoring infrastrcuture for counters that (once started) can be read from memory at any time by an operating system. Such counters include those for the Nest and Core units, enabling continuous monitoring of resource utilisation on the chip.

The API is agnostic as to how these counters are implemented. For the Nest units, they’re implemented by having microcode in an on-chip microcontroller and for core units, they are implemented as part of core logic to gather data and periodically write it to the memory locations.

Nest (On-Chip, Off-Core) unit:

Nest units have dedicated hardware counters which can be programmed to monitor various chip resources such as memory bandwidth, xlink bandwidth, alink bandwidth, PCI, NVlink and so on. These Nest unit PMU counters can be programmed in-band via scom. But alternatively, programming of these counters and periodically moving the counter data to memory are offloaded to a hardware engine part of OCC (On-Chip Controller).

Microcode, starts to run at system boot in OCC complex, initialize these Nest unit PMUs and periodically accumulate the nest pmu counter values to memory. List of supported events by the microcode is packages as a DTS and stored in IMA_CATALOG partition.

Core unit:

Core IMC PMU counters are handled in the core-imc unit. Each core has 4 Core Performance Monitoring Counters (CPMCs) which are used by Core-IMC logic. Two of these are dedicated to count core cycles and instructions. The 2 remaining CPMCs have to multiplex 128 events each.

Core IMC hardware does not support interrupts and it peridocially (based on sampling duration) fetches the counter data and accumulate to main memory. Memory to accumulate counter data are refered from “PDBAR” (per-core scom) and “LDBAR” per-thread spr.

Trace mode of IMC:

POWER9 support two modes for IMC which are the Accumulation mode and Trace mode. In Accumulation mode event counts are accumulated in system memory. Hypervisor/kernel then reads the posted counts periodically, or when requested. In IMC Trace mode, the 64 bit trace scom value is initialized with the event information. The CPMC*SEL and CPMC_LOAD in the trace scom, specifies the event to be monitored and the sampling duration. On each overflow in the CPMC*SEL, hardware snapshots the program counter along with event counts and writes into memory pointed by LDBAR. LDBAR has bits to indicate whether hardware is configured for accumulation or trace mode. Currently the event monitored for trace-mode is fixed as cycle.

PMI interrupt handling is avoided, since IMC trace mode snapshots the program counter and update to the memory. And this also provide a way for the operating system to do instruction sampling in real time without PMI(Performance Monitoring Interrupts) processing overhead.

Example:

Performance data using ‘perf top’ with and without trace-imc event:

PMI interrupts count when `perf top` command is executed without trace-imc event.

# cat /proc/interrupts  (a snippet from the output)
9944      1072        804        804       1644        804       1306
804        804        804        804        804        804        804
804        804       1961       1602        804        804       1258
[-----------------------------------------------------------------]
803        803        803        803        803        803        803
803        803        803        803        804        804        804
804        804        804        804        804        804        803
803        803        803        803        803       1306        803
803   Performance monitoring interrupts

PMI interrupts count when `perf top` command executed with trace-imc event (executed right after ‘perf top’ without trace-imc event).

# perf top -e trace_imc/trace_cycles/
12.50%  [kernel]          [k] arch_cpu_idle
11.81%  [kernel]          [k] __next_timer_interrupt
11.22%  [kernel]          [k] rcu_idle_enter
10.25%  [kernel]          [k] find_next_bit
 7.91%  [kernel]          [k] do_idle
 7.69%  [kernel]          [k] rcu_dynticks_eqs_exit
 5.20%  [kernel]          [k] tick_nohz_idle_stop_tick
     [-----------------------]

# cat /proc/interrupts (a snippet from the output)

9944      1072        804        804       1644        804       1306
804        804        804        804        804        804        804
804        804       1961       1602        804        804       1258
[-----------------------------------------------------------------]
803        803        803        803        803        803        803
803        803        803        804        804        804        804
804        804        804        804        804        804        803
803        803        803        803        803       1306        803
803   Performance monitoring interrupts

Here the PMI interrupts count remains the same.

OPAL APIs:

The OPAL API is simple: a call to init a counter type, and calls to start and stop collection. The memory locations are described in the device tree.

See OPAL_IMC_COUNTERS_INIT and IMC Device Tree Bindings