PCI
===

Debugging
---------

There exist a couple of NVRAM options for enabling extra debug functionality
to help debug PCI issues. These are not ABI and may be changed or removed at
**any** time.

Verbose EEH
^^^^^^^^^^^

::

   nvram -p ibm,skiboot --update-config pci-eeh-verbose=true

Disable EEH MMIO
^^^^^^^^^^^^^^^^
::
   nvram -p ibm,skiboot --update-config pci-eeh-mmio=disabled


Check for RX errors after link training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some PHB4 PHYs can get stuck in a bad state where they are constantly
retraining the link. This happens transparently to skiboot and Linux
but will causes PCIe to be slow. Resetting the PHB4 clears the
problem.

We can detect this case by looking at the RX errors count where we
check for link stability. This patch does this by modifying the link
optimal code to check for RX errors. If errors are occurring we
retrain the link irrespective of the chip rev or card.

Normally when this problem occurs, the RX error count is maxed out at
255. When there is no problem, the count is 0. We chose 8 as the max
rx errors value to give us some margin for a few errors. There is also
a knob that can be used to set the error threshold for when we should
retrain the link. i.e. ::

      nvram -p ibm,skiboot --update-config phb-rx-err-max=8

Retrain link if degraded
^^^^^^^^^^^^^^^^^^^^^^^^

On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and
below) the PCIe PHY can lockup causing training issues. This can cause
a degradation in speed or width in ~5% of training cases (depending on
the card). This is fixed in later chip revisions. This issue can also
cause PCIe links to not train at all, but this case is already
handled.

There is code in skiboot that checks if the PCIe link has trained optimally
and if not, does a full PHB reset (to fix the PHY lockup) and retrain.

One complication is some devices are known to train degraded unless
device specific configuration is performed. Because of this, we only
retrain when the device is in a whitelist. All devices in the current
whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon.

We always gather information on the link and print it in the logs even
if the card is not in the whitelist.

For testing purposes, there's an nvram to retry all PCIe cards and all
P9 chips when a degraded link is detected. The new option is
'pci-retry-all=true' which can be set using: ::

  nvram -p ibm,skiboot --update-config pci-retry-all=true

This option may increase the boot time if used on a badly behaving
card.

Maximum link speed
^^^^^^^^^^^^^^^^^^

Was useful during bringup on P9 DD1.

::
   nvram -p ibm,skiboot --update-config pcie-max-link-speed=4


Ric Mata Mode
^^^^^^^^^^^^^

This mode (for PHB4) will trace the training process closely. This activates
as soon as PERST is deasserted and produces human readable output of
the process.

It will also add the PCIe Link Training and Status State Machine (LTSSM) tracing
and details on speed and link width.

Output looks a bit like this ::

  [    1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
  [    1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling
  [    1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling
  [    1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery
  [    1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery
  [    1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery
  [    1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0
  [    1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained  GEN4:x16:L0
  [    1.141757964,3] PHB#0000[0:0]: TRACE: Link trained.
  [    1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
  [    1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling
  [    1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config
  [    1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery
  [    1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery
  [    1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0
  [    1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained  GEN3:x08:L0
  [    1.114854650,3] PHB#0001[0:1]: TRACE: Link trained.

Enabled via NVRAM: ::

  nvram -p ibm,skiboot --update-config pci-tracing=true

Named after the person the output of this mode is typically sent to.


**WARNING**: The documentation below **urgently needs updating** and is *woefully* incomplete.

IODA PE Setup Sequences
-----------------------

(**WARNING**: this was rescued from old internal documentation. Needs verification)

To setup basic PE mappings, the host performs this basic sequence:

For ibm,opal-ioda2, prior to allocating PHB resources to PEs, the host must
allocate memory for PE structures and then calls
``opal_pci_set_phb_table_memory( phb_id, rtt_addr, ivt_addr, ivt_len, rrba_addr, peltv_addr)`` to define them to the PHB. OPAL returns ``OPAL_UNSUPPORTED`` status for ``ibm,opal-ioda`` PHBs.

The host calls ``opal_pci_set_pe( phb_id, pe_number, bus, dev, func, validate_mask, bus_mask, dev_mask, func mask)`` to map a PE to a PCI RID or range of RIDs in the same PE domain.

The host calls ``opal_pci_set_peltv(phb_id, parent_pe, child_pe, state)`` to
set a parent PELT vector bit for the child PE argument to 1 (a child of the
parent) or 0 (not in the parent PE domain).

IODA MMIO Setup Sequences
-------------------------

(**WARNING**: this was rescued from old internal documentation. Needs verification)


The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x0)`` to disable the MMIO window.

The host calls ``opal_pci_set_phb_mmio_window( phb_id, mmio_window, starting_real_address, starting_pci_address, segment_size)`` to change the MMIO window location in PCI and/or processor real address space, or to change the size -- and corresponding window size -- of a particular MMIO window.

The host calls ``opal_pci_map_pe_mmio_window( pe_number, mmio_window, segment_number)`` to map PEs to window segments, for each segment mapped to each PE.

The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x1)`` to enable the MMIO window.

IODA MSI Setup Sequences
------------------------

(**WARNING**: this was rescued from old internal documentation. Needs verification)

To setup MSIs:

1. For ibm,opal-ioda PHBs, the host chooses an MVE for a PE to use and calls ``opal_pci_set_mve( phb_id, mve_number, pe_number,)`` to setup the MVE for the PE number. HAL treats this call as a NOP and returns hal_success status for ibm,opal-ioda2 PHBs.
2. The host chooses an XIVE to use with a PE and calls
   a. ``opal_pci_set_xive_pe( phb_id, xive_number, pe_number)`` to authorize that PE to signal that XIVE as an interrupt. The host must call this function for each XIVE assigned to a particular PE, but may use this call for all XIVEs prior to calling ``opel_pci_set_mve()`` to bind the PE XIVEs to an MVE. For MSI conventional, the host must bind a unique MVE for each sequential set of 32 XIVEs.
   b. The host forms the interrupt_source_number from the combination of the device tree MSI property base BUID and XIVE number, as an input to ``opal_set_xive(interrupt_source_number, server_number, priority)`` and ``opal_get_xive(interrupt_source_number, server_number, priority)`` to set or return the server and priority numbers within an XIVE.
   c. ``opal_get_msi_64[32](phb_id, mve_number, xive_num, msi_range, msi_address, message_data)`` to determine the MSI DMA address (32 or 64 bit) and message data value for that xive.

      For MSI conventional, the host uses this for each sequential power of 2 set of 1 to 32 MSIs, to determine the MSI DMA address and starting message data value for that MSI range. For MSI-X, the host calls this uniquely for each MSI interrupt with an msi_range input value of 1.
3. For ``ibm,opal-ioda`` PHBs, once the MVE and XIVRs are setup for a PE, the host calls ``opal_pci_set_mve_enable( phb_id, mve_number, state)`` to enable that MVE to be a valid target of MSI DMAs. The host may also call this function to disable an MVE when changing PE domains or states.

IODA DMA Setup Sequences
------------------------

(**WARNING**: this was rescued from old internal documentation. Needs verification)

To Manage DMA Windows :

1. The host calls ``opal_pci_map_pe_dma_window( phb_id, dma_window_number, pe_number, tce_levels, tce_table_addr, tce_table_size, tce_page_size, utin64_t* pci_start_addr )`` to setup a DMA window for a PE to translate through a TCE table structure in KVM memory.
2. The host calls ``opal_pci_map_pe_dma_window_real( phb_id, dma_window_number, pe_number, mem_low_addr, mem_high_addr)`` to setup a DMA window for a PE that is translated (but validated by the PHB as an untranlsated address space authorized to this PE).

Device Tree Bindings
--------------------

See :doc:`device-tree/pci` for device tree information.