skiboot-5.8

skiboot v5.8 was released on Thursday August 31st 2017. It is the first release of skiboot 5.8, which becomes the new stable release. It follows the 5.7 release, first released 25th July 2017.

skiboot v5.8 contains all bug fixes as of skiboot-5.4.6 and skiboot-5.1.20 (the currently maintained stable releases). We do not currently expect to do any 5.7.x stable releases.

For how the skiboot stable releases work, see Skiboot stable tree rules and releases for details.

Over skiboot-5.7, we have the following changes:

New Features

  • sensors: occ: Add support to clear sensor groups

    Adds a generic API to clear sensor groups. OCC inband sensor groups such as CSM, Profiler and Job Scheduler can be cleared using this API. It will clear the min/max of all sensors belonging to OCC sensor groups.

  • sensors: occ: Add CSM_{min/max} sensors

    HWMON’s lowest/highest attribute is used by CSM agent, so map min/max device-tree properties “sensor-data-min” and “sensor-data-max” to the min/max of CSM.

  • sensors: occ: Add support for OCC inband sensors

    Add support to parse and export OCC inband sensors which are copied by OCC to main memory in P9. Each OCC writes three buffers which includes one names buffer for sensor meta data and two buffers for sensor readings. While OCC writes to one buffer the sensor values can be read from the other buffer. The sensors are updated every 100ms.

    This patch adds power, temperature, current and voltage sensors to /ibm,opal/sensors device-tree node which can be exported by the ibmpowernv-hwmon driver in Linux.

  • psr: occ: Add support to change power-shifting-ratio

    Add support to set the CPU-GPU power shifting ratio which is used by the OCC power capping algorithm. PSR value of 100 takes all power away from CPU first and a PSR value of 0 caps GPU first.

  • powercap: occ: Add a generic powercap framework

    This patch adds a generic powercap framework and exports OCC powercap sensors using which system powercap can be set inband through OPAL-OCC command-response interface.

  • phb4: Enable PCI peer-to-peer

    P9 supports PCI peer-to-peer: a PCI device can write directly to the mmio space of another PCI device. It completely by-passes the CPU.

    It requires some configuration on the PHBs involved:

    1. on the initiating side, the address for the read/write operation is in the mmio space of the target, i.e. well outside the range normally allowed. So we disable range-checking on the TVT entry in bypass mode.
    2. on the target side, we need to explicitly enable p2p by setting a bit in a configuration register. It has the side-effect of reserving an outbound (as seen from the CPU) store queue for p2p. Therefore we only enable p2p on the PHBs using it, as we don’t want to waste the resource if we don’t have to.

    P9 supports p2p mmio writes. Reads are currently only supported if the two devices are under the same PHB but that is expected to change in the future, and it raises questions about intermediate switches configuration, so we report an error for the time being.

    The patch adds a new OPAL call to allow the OS to declare a p2p (initiator, target) pair.

  • NX 842 and GZIP support on POWER9

POWER9 DD2

Further support for POWER9 DD2 revision chips. Notable changes include:

  • xscom: Grab P9 DD2 revision level

  • vas: Set mmio enable bits in DD2

    POWER9 DD2 added some new “enable” bits that must be set for VAS to work. These bits were unused in DD1.

  • hdat: Add POWER9 DD2.0 specific pa_features

    Same as the default but with TM off.

POWER9

Since skiboot-5.8-rc1:

  • hw/npu2.c: Add ibm,nvlink-speed device-tree property

    NVLink2 links can support multiple different speeds. However the device driver has no way of determining which speed was programmed so pass it down as a device tree property.

  • hw/npu2-hw-procedures.c: Update PHY_RESET procedure

    Newer versions of Hostboot will have various clocks powered down by default to save power. Therefore we need to power them up before accessing the OBUS PHY.

  • p8-i2c: Fix random data corruption (POWER9 specific) While waiting for the OCC to signal that it has finished using the I2C master we put the master into the, poorly named, occache_dis state. While in this state the transaction hasn’t been started, but p8_i2c_check_status() will only skip it’s checks when the master is in the idle state. Any action that checks that cranks the I2C state machine (interrupt, poll, etc) will call p8_i2c_check_status() and since the master is not idle, it will check the status register, see the transaction complete flag set and complete the i2c request without actually doing anything.

    If the transaction was a I2C read, the resulting output will be a zeroed data buffer.

  • hw/p8-i2c: Fix OCC locking (POWER9 specific)

    There’s a few issues with the Host<->OCC I2C bus handshaking. First up, skiboot is currently examining the wrong bit when checking if the OCC is currently using the bus. Secondly, when we need to wait for the OCC to release the bus we are scheduling a recovery timer to run zero timebase ticks after the current moment so the recovery timeout handler will run immediately after the bus was requested, which will in turn re-schedule itself, etc, etc. There’s also a race between the OCC interrupt and the recovery handler which can result in an assertion failure in the recovery thread. All of this is bad.

    This patch addresses all these issues and sets the recovery timeout to 10ms.

  • vas: export chip-id to vas platform device This is needed so VAS in the kernel can perform cpu to vas id mapping.

  • slw: Modify the power9 stop0_lite latency & residency

    Currently skiboot exposes the exit-latency for stop0_lite as 200ns and the target-residency to be 2us.

    However, the kernel cpu-idle infrastructure rounds up the latency to microseconds and lists the stop0_lite latency as 0us, putting it on par with snooze state. As a result, when the predicted latency is small (< 1us), cpuidle will select stop0_lite instead of snooze. The difference between these states is that snooze doesn’t require an interrupt to exit from the state, but stop0_lite does. And the value 200ns doesn’t include the interrupt latency.

    This shows up in the context_switch2 benchmark (http://ozlabs.org/~anton/junkcode/context_switch2.c) where the number of context switches per second with the stop0_lite disabled is found to be roughly 30% more than with stop0_lite enabled. This can be correlated with the number of times cpuidle enters stop0_lite compared to snooze.

    Hence, bump up the exit latency of stop0_lite to 1us. Since the target residency is chosen to be 10 times the exit latency, set the target residency to 10us.

    With these values, we see a 50% improvement in the number of context switches.

Since skiboot-5.7:

  • Base NPU2 support on POWER9 DD2

  • hdata/i2c: Work around broken I2C array version

    Work around a bug in the I2C devices array that shows the array version as being v2 when only the v1 data is populated.

  • Recognize the 2s2u zz platform

    OPAL currently doesn’t know about the 2s2u zz. It recognizes such a box as a generic BMC machine and fails to boot. Add the 2s2u as a supported platform.

    There will subsequently be a 2s2u-L system which may have a different compatible property, which will need to be handled later.

  • hdata/spira: POWER9 NX isn’t software compatible with P7/P8 NX, don’t claim so

  • NX: Add P9 NX support for gzip compression engine

    Power 9 introduces NX gzip compression engine. This patch adds gzip compression support in NX. Virtual Accelerator Switch (VAS) is used to access NX gzip engine and the channel configuration will be done with the receive FIFO. So RxFIFO address, logical partition ID (lpid), process ID (pid) and thread ID (tid) are used to configure RxFIFO. P9 NX supports high and normal priority FIFOS. Skiboot configures User Mode Access Control (UMAC) noitify match register with these values and also enables other registers to enable / disable the engine.

    Creates the following device-tree entries to provide RxFIFO address, RxFIFO size, Fifo priority, lpid, pid and tid values so that kernel can drive P9 NX gzip engine.

    The following nodes are located under an xscom node: ::

    /xscom@<xscom_addr>/nx@<nx_addr>

    /ibm,gzip-high-fifo : High priority gzip RxFIFO /ibm,gzip-normal-fifo : Normal priority gzip RxFIFO

    Each RxFIFO node contain:s

    compatible

    ibm,p9-nx-gzip

    priority

    High or Normal

    rx-fifo-address

    RxFIFO address

    rx-fifo-size

    RxFIFO size

    lpid

    0xfff (1’s for 12 bits in UMAC notify match register)

    pid

    gzip coprocessor type

    tid

    counter for gzip

  • NX: Add P9 NX support for 842 compression engine

    This patch adds changes needed for 842 compression engine on power 9. Virtual Accelerator Switch (VAS) is used to access NX 842 engine on P9 and the channel setup will be done with receive FIFO. So RxFIFO address, logical partition ID (lpid), process ID (pid) and thread ID (tid) are used for this setup. p9 NX supports high and normal priority FIFOs. skiboot is not involved to process data with 842 engine, but configures User Mode Access Control (UMAC) noitify match register with these values and export them to kernel with device-tree entries.

    Also configure registers to setup and enable / disable the engine with the appropriate registers. Creates the following device-tree entries to provide RxFIFO address, RxFIFO size, Fifo priority, lpid, pid and tid values so that kernel can drive P9 NX 842 engine.

    The following nodes are located under an xscom node: /xscom@<xscom_addr>/nx@<nx_addr>

    /ibm,842-high-fifo

    High priority 842 RxFIFO

    /ibm,842-normal-fifo

    Normal priority 842 RxFIFO

    Each RxFIFO node contains:

    compatible

    ibm,p9-nx-842

    priority

    High or Normal

    rx-fifo-address

    RxFIFO address

    rx-fifo-size

    RXFIFO size

    lpid

    0xfff (1’s for 12 bits set in UMAC notify match register)

    pid

    842 coprocessor type

    tid

    Counter for 842

  • vas: Create MMIO device tree node

    Create a device tree node for VAS and add properties that Linux will need to configure/use VAS.

  • opal: Extract sw checkstop fir address from HDAT.

    Extract sw checkstop fir address info from HDAT and populate device tree node ibm,sw-checkstop-fir.

    This patch is required for OPAL_CEC_REBOOT2 OPAL call to work as expected on p9.

    With this patch a device property ‘ibm,sw-checkstop-fir’ is now properly populated:

    # lsprop ibm,sw-checkstop-fir
    ibm,sw-checkstop-fir
                     05012000 0000001f
    

PHB4

  • hdat: Fix PCIe GEN4 lane-eq setting for DD2

    For PCIe GEN4, DD2 uses only 1 byte per PCIe lane for the lane-eq settings (DD1 uses 2 bytes)

  • pci: Wait for CRS and switch link when restoring bus numbers

    When a complete reset occurs, after the PHB recovers it propagates a reset down the wire to every device. At the same time, skiboot talks to every device in order to restore the state of devices to what they were before the reset.

    In some situations, such as devices that recovered slowly and/or were behind a switch, skiboot attempted to access config space of the device before the link was up and the device could respond.

    Fix this by retrying CRS until the device responds correctly, and for devices behind a switch, making sure the switch has its link up first.

  • pci: Track whether a PCI device is a virtual function

    This can be checked from config space, but we will need to know this when restoring the PCI topology, and it is not always safe to access config space during this period.

  • phb4: Enhanced PCIe training tracing

    This add more details to the PCI training tracing (aka Rick Mata mode). It enables the PCIe Link Training and Status State Machine (LTSSM) tracing and details on speed and link width.

    Output now looks like this when enabled (via nvram):

    [    1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
    [    1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling
    [    1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling
    [    1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery
    [    1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery
    [    1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery
    [    1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0
    [    1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained  GEN4:x16:L0
    [    1.141757964,3] PHB#0000[0:0]: TRACE: Link trained.
    [    1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
    [    1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling
    [    1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config
    [    1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery
    [    1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery
    [    1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0
    [    1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained  GEN3:x08:L0
    [    1.114854650,3] PHB#0001[0:1]: TRACE: Link trained.
    
  • phb4: Fix reading wrong size registers in EEH dump

    These registers are supposed to be 16bit, and it makes part of the register dump misleading.

  • phb4: Ignore slot state if performing complete reset

    If a PHB is being completely reset, its state is about to be blown away anyway, so if it’s not in an appropriate state, creset it regardless.

  • phb4: Prepare for link down when creset called from kernel

    phb4_creset() is typically called by functions that prepare the link to go down. In cases where creset() is called directly by the kernel, this isn’t the case and it can cause issues. Prepare for link down in creset, just like we do in freset and hreset.

  • phb4: Skip attempting to fix PHBs broken on boot

    If a PHB is marked broken it didn’t work on boot, and if it didn’t work on boot then there’s no point trying to recover it later

  • phb4: Fix duplicate in EEH register dump

  • phb4: Be more conservative on link presence timeout

    In this patch we tuned our link timing to be more agressive: cf960e2884 phb4: Improve reset and link training timing

    Cards should take only 32ms but unfortunately we’ve seen some take up to 440ms. Hence bump our timer up to 1000ms.

    This can hurt boot times on systems where slots indicate a hotplug status but no electrical link is present (which we’ve seen). Since we have to wait 1 second between PERST and touching config space anyway, it shouldn’t hurt too much.

  • phb4: Assert PERST before PHB reset

    Currently we don’t assert PERST before issuing a PHB reset. This means any link issues while resetting the PHB will be logged as errors.

    This asserts PERST before we start resetting the PHB to avoid this.

  • Revert “phb4: Read PERST signal rather than assuming it’s asserted”

    This reverts commit b42ff2b904165addf32e77679cebb94a08086966

    The original patch assumes that PERST has been asserted well before (> 250ms) we hit here (ie. during hostboot).

    In a subesquent patch this will no longer be the case as we need to assert PERST during PHB reset, which may only be a few milliseconds before we hit this code.

    Hence revert this patch. Go back to the software mechanism using skip_perst to determine if PERST should be asserted or not. This allows us to keep the speed optimisation on boot.

  • phb4: Set REGB error enables based on link state

    Currently we always set these enables when initing the PHB. If the link is already down, we shouldn’t set them as it may cause spurious errors.

    This changes the code to only sets them if the link is up.

  • phb4: Mark PHB as fenced on creset

    If we have to inject an error to trigger recover, we end up not marking the PHB as fenced in the PHB struct. This fixes that.

  • phb4: Clear errors before deasserting reset

    During reset we may have logged some errors (eg. due to the link going down).

    Hence before we deassert PERST or Hot Reset, we need to clear these errors. This ensures that once link training starts, only new errors are logged.

  • phb4: Disable device config space access when fenced

    On DD2 you can’t access device config space when fenced, so just disable access whenever we are fenced.

  • phb4: Dump devctl and devstat registers

    Dump devctl and devstat registers. These would have been useful when debugging the MPS issue.

  • phb4: Only clear some PHB config space registers on errors

    Currently on error we clear the entire PHB config space. This is a problem as the PCIe Maximum Payload Size (MPS) negotiation may have already occurred. Clearing MPS in the PHB back to a default of 128 bytes will result an error for a device which already has a larger MPS configured.

    This will manifest itself as error due to a malformed TLP packet. ie. phbPblErrorStatus bit 41  = "Malformed TLP error"

    This has been seen after kexec on with some adapters.

    This fixes the problem by only clearing a subset of registers on a phb error.

Utilities

  • external/xscom-utils: Add --list-bits

    When using getscom/putscom it’s helpful to know what bits are set in the register. This patch adds an option to print out which bits are set along with the value that was read/written to the register. Note that this output indicates which bits are set using the IBM bit ordering since that’s what the XSCOM documentation uses.

opal-prd

  • opal-prd: Do not pass pnor file while starting daemon.

    This change to the included systemd init file means opal-prd can start and run on IBM FSP based systems.

    We do not have pnor support on all the system. Also we have logic to autodetect PNOR. Hence do not pass --pnor by default.

  • opal-prd: Disable pnor access interface on FSP system

    On FSP system host does not have access to PNOR. Hence disable PNOR access interfaces.

OPAL Sensors

  • sensor-groups : occ: Add ‘ops’ DT property

    Add new device-tree property ‘ops’ to define different operations supported on each sensor-group.

  • OCC: Map OCC sensor to a chip-id

    Parse device tree to get chip-id for OCC sensor.

  • HDAT: Add chip-id property to ipmi sensors

    Presently we do not have a way to map sensor to chip id. Hence we are always passing chip id 0 for occ_reset request (see occ_sensor_id_to_chip()).

    This patch adds chip-id property to sensors (whenever its available) so that we can map occ sensor to chip-id and pass valid chip-id to occ_reset request.

  • xive: Check for valid PIR index when decoding

    This fixes an unlikely but possible assert() fail on kdump.

  • sensors: occ: Skip the deconfigured core sensors

    This patch skips the deconfigured cores from the core sensors while parsing the sensor names in the main memory as these sensor values are not updated by OCC.

IBM FSP systems

Since skiboot-5.8-rc1:

  • mktime: fix off-by-one error calling days_in_month

    From auditing all the mktime() users, there seems to be only a very small window around new years day where we could possibly return incorrect data to the OS, and even then, there would have to be FSP reset/reload on FSP machines. I don’t think there’s an opportunity on other machines.

Tests

Since skiboot-5.8-rc1:

  • travis: Debian Stretch must pass
  • test kernels: link with -N
  • core/test/run-msg: don’t depend on unittest mem layout

Since skiboot-5.7:

  • hdata_to_dt: use a realistic PVR and chip revision

  • nx: PR_INFO that NX RNG and Crypto not yet supported on POWER9

  • external/pflash: Add tests

  • external/pflash: Reinstate the progress bars

    Recent work did some optimising which unfortunately removed some of the progress bars in pflash.

    It turns out that there’s only one thing people prefer to correctly programmed flash chips, it is the ability to watch little equals characters go across their screens for potentially minutes.

  • external/pflash: Correct erase alignment checks

    pflash should check the alignment of addresses and sizes when asked to erase. There are two possibilities:

    1. The user has specified sizes manually in which case pflash should be as flexible as possible, blocklevel_smart_erase() permits this. To prevent possible mistakes pflash will require –force to perform a manual erase of unaligned sizes.
    2. The user used -P to specify a partition, partitions aren’t necessarily erase granule aligned anymore, blocklevel_smart_erase() can handle. In this it doesn’t make sense to warn/error about misalignment since the misalignment is inherent to the FFS partition and not really user input.
  • external/pflash: Check the result of strtoul

    Also add 0x in front of –info output to avoid a copy and paste mistake.

  • libflash/file: Break up MTD erase ioctl() calls

    Unfortunately not all drivers are created equal and several drivers on which pflash relies block in the kernel for quite some time and ignore signals.

    This is really only a problem if pflash is to perform large erases. So don’t, perform these ops in small chunks.

    An in kernel fix is possible in most cases but it takes time and systems will be running older drivers for quite some time. Since sector erases aren’t significantly slower than whole chip erases there isn’t much of a performance penalty to breaking up the erase ioctl()s.

General

Since skiboot-5.8-rc1:

  • gcov: support GCC 7.1+
  • Tests build and pass on Debian A few things related to the Debian toolchain.

Since skiboot-5.7:

  • opal-msg: Increase the max-async completion count by max chips possible

  • occ: Add support for OPAL-OCC command/response interface

    This patch adds support for a shared memory based command/response interface between OCC and OPAL. In HOMER, there is an OPAL command buffer and an OCC response buffer which is used to send inband commands to OCC.

  • HDAT/device-tree: only add lid-type on pre-POWER9 systems

    Largely a relic of back when we had multiple entry points into OPAL depending on which mechanism on an FSP we were using to get loaded, this isn’t needed on modern P9 as we only have one entry point (we don’t do the PHYP LID hack).

Contributors

  • Processed 156 csets from 17 developers
  • 1 employers found
  • A total of 6888 lines added, 1089 removed (delta 5799)

Developers with the most changesets

Developer # %
Cyril Bur 35 (22.4%)
Stewart Smith 32 (20.5%)
Michael Neuling 23 (14.7%)
Sukadev Bhattiprolu 11 (7.1%)
Reza Arbab 10 (6.4%)
Russell Currey 9 (5.8%)
Shilpasri G Bhat 9 (5.8%)
Oliver O’Halloran 5 (3.2%)
Haren Myneni 5 (3.2%)
Alistair Popple 4 (2.6%)
Vasant Hegde 4 (2.6%)
Nicholas Piggin 3 (1.9%)
Andrew Donnellan 2 (1.3%)
Gautham R. Shenoy 1 (0.6%)
Mahesh Salgaonkar 1 (0.6%)
Ananth N Mavinakayanahalli 1 (0.6%)
Frederic Barrat 1 (0.6%)

Developers with the most changed lines

Developer # %
Shilpasri G Bhat 1935 (27.9%)
Cyril Bur 1868 (26.9%)
Stewart Smith 866 (12.5%)
Sukadev Bhattiprolu 663 (9.5%)
Haren Myneni 584 (8.4%)
Michael Neuling 384 (5.5%)
Frederic Barrat 168 (2.4%)
Reza Arbab 98 (1.4%)
Oliver O’Halloran 98 (1.4%)
Vasant Hegde 93 (1.3%)
Alistair Popple 77 (1.1%)
Russell Currey 60 (0.9%)
Mahesh Salgaonkar 28 (0.4%)
Andrew Donnellan 11 (0.2%)
Gautham R. Shenoy 6 (0.1%)
Nicholas Piggin 4 (0.1%)
Ananth N Mavinakayanahalli 1 (0.0%)

Developers with the most signoffs

Developer # %
Stewart Smith 124 (97.6%)
Benjamin Herrenschmidt 2 (1.6%)
Vaidyanathan Srinivasan 1 (0.8%)
Total 127 (100%)

Developers with the most reviews

Developer # %
Samuel Mendoza-Jonas 19 (52.8%)
Andrew Donnellan 11 (30.6%)
Vasant Hegde 2 (5.6%)
Cédric Le Goater 1 (2.8%)
Russell Currey 1 (2.8%)
Reza Arbab 1 (2.8%)
Cyril Bur 1 (2.8%)
Total 36 (100%)

Developers with the most test credits

Developer # %
Vasant Hegde 1 (50.0%)
Hari Bathini 1 (50.0%)

Developers who gave the most tested-by credits

Developer # %
Russell Currey 1 (50.0%)
Mahesh Salgaonkar 1 (50.0%)

Developers with the most report credits

Developer # %
Anton Blanchard 1 (16.7%)
Mark Linimon 1 (16.7%)
Pavaman Subramaniyam 1 (16.7%)
Pridhiviraj Paidipeddi 1 (16.7%)
Rob Lippert 1 (16.7%)
Michael Neuling 1 (16.7%)

Developers who gave the most report credits

Developer # %
Stewart Smith 2 (33.3%)
Michael Neuling 1 (16.7%)
Andrew Donnellan 1 (16.7%)
Cyril Bur 1 (16.7%)
Gautham R. Shenoy 1 (16.7%)