How to log errors on OPAL

Currently the errors reported by OPAL interfaces are in free form, where as errors reported by service processor is in standard Platform Error Log (PEL) format. For out-of band management via IPMI interfaces, it is necessary to push down the errors to service processor via mailbox (reported by OPAL) in PEL format.

PEL size can vary from 2K-16K bytes, fields of which needs to populated based on the kind of event and error that needs to be reported. All the information needed to be reported as part of the error, is passed by user using the error-logging interfaces outlined below. Following which, PEL structure is generated based on the input and then passed on to service processor.

We do create eSEL error log format for some service processors but it’s just a wrapper around PEL format. Actual data still stays in PEL format.

Error logging interfaces in OPAL

Interfaces are provided for the user to log/report an error in OPAL. Using these interfaces relevant error information is collected and later converted to PEL format and then pushed to service processor.

Step 1: To report an error, invoke opal_elog_create() with required argument.

struct errorlog *opal_elog_create(struct opal_err_info *e_info, uint32_t tag);

Parameters:

  • struct opal_err_info *e_info Struct to hold information identifying error/event source.

  • uint32_t tag: Unique value to identify the data.

    Ideal to have ASCII value for 4-byte string.

The opal_err_info struct holds several pieces of information to help identify the error/event. The struct can be obtained via the DEFINE_LOG_ENTRY macro as below - it only needs to be called once.

DEFINE_LOG_ENTRY(OPAL_RC_ATTN, OPAL_PLATFORM_ERR_EVT, OPAL_CHIP,
               OPAL_PLATFORM_FIRMWARE, OPAL_PREDICTIVE_ERR_GENERAL,
               OPAL_NA);

The various attributes set by this macro are described below.

uint8_t opal_error_event_type: Classification of error/events type reported on OPAL.

/* Platform Events/Errors: Report Machine Check Interrupt */
#define OPAL_PLATFORM_ERR_EVT           0x01
/* INPUT_OUTPUT: Report all I/O related events/errors */
#define OPAL_INPUT_OUTPUT_ERR_EVT       0x02
/* RESOURCE_DEALLOC: Hotplug events and errors */
#define OPAL_RESOURCE_DEALLOC_ERR_EVT   0x03
/* MISC: Miscellaneous error */
#define OPAL_MISC_ERR_EVT               0x04

uint16_t component_id: Component ID of OPAL component as listed in include/errorlog.h.

uint8_t subsystem_id: ID of the sub-system reporting error.

/* OPAL Subsystem IDs listed for reporting events/errors */
#define OPAL_PROCESSOR_SUBSYSTEM        0x10
#define OPAL_MEMORY_SUBSYSTEM           0x20
#define OPAL_IO_SUBSYSTEM               0x30
#define OPAL_IO_DEVICES                 0x40
#define OPAL_CEC_HARDWARE               0x50
#define OPAL_POWER_COOLING              0x60
#define OPAL_MISC                       0x70
#define OPAL_SURVEILLANCE_ERR           0x7A
#define OPAL_PLATFORM_FIRMWARE          0x80
#define OPAL_SOFTWARE                   0x90
#define OPAL_EXTERNAL_ENV               0xA0

uint8_t event_severity: Severity of the event/error to be reported.

#define OPAL_INFO                                   0x00
#define OPAL_RECOVERED_ERR_GENERAL                  0x10

/* 0x2X series is to denote set of Predictive Error */
/* 0x20 Generic predictive error */
#define OPAL_PREDICTIVE_ERR_GENERAL                         0x20
/* 0x21 Predictive error, degraded performance */
#define OPAL_PREDICTIVE_ERR_DEGRADED_PERF                   0x21
/* 0x22 Predictive error, fault may be corrected after reboot */
#define OPAL_PREDICTIVE_ERR_FAULT_RECTIFY_REBOOT            0x22
/*
 * 0x23 Predictive error, fault may be corrected after reboot,
 * degraded performance
 */
#define OPAL_PREDICTIVE_ERR_FAULT_RECTIFY_BOOT_DEGRADE_PERF 0x23
/* 0x24 Predictive error, loss of redundancy */
#define OPAL_PREDICTIVE_ERR_LOSS_OF_REDUNDANCY              0x24

/* 0x4X series for Unrecoverable Error */
/* 0x40 Generic Unrecoverable error */
#define OPAL_UNRECOVERABLE_ERR_GENERAL                      0x40
/* 0x41 Unrecoverable error bypassed with degraded performance */
#define OPAL_UNRECOVERABLE_ERR_DEGRADE_PERF                 0x41
/* 0x44 Unrecoverable error bypassed with loss of redundancy */
#define OPAL_UNRECOVERABLE_ERR_LOSS_REDUNDANCY              0x44
/* 0x45 Unrecoverable error bypassed with loss of redundancy
 * and performance
 */
#define OPAL_UNRECOVERABLE_ERR_LOSS_REDUNDANCY_PERF         0x45
/* 0x48 Unrecoverable error bypassed with loss of function */
#define OPAL_UNRECOVERABLE_ERR_LOSS_OF_FUNCTION             0x48

#define OPAL_ERROR_PANIC                                    0x50

uint8_t  event_subtype: Event Sub-type

#define OPAL_NA                                         0x00
#define OPAL_MISCELLANEOUS_INFO_ONLY                    0x01
#define OPAL_PREV_REPORTED_ERR_RECTIFIED                0x10
#define OPAL_SYS_RESOURCES_DECONFIG_BY_USER             0x20
#define OPAL_SYS_RESOURCE_DECONFIG_PRIOR_ERR            0x21
#define OPAL_RESOURCE_DEALLOC_EVENT_NOTIFY              0x22
#define OPAL_CONCURRENT_MAINTENANCE_EVENT               0x40
#define OPAL_CAPACITY_UPGRADE_EVENT                     0x60
#define OPAL_RESOURCE_SPARING_EVENT                     0x70
#define OPAL_DYNAMIC_RECONFIG_EVENT                     0x80
#define OPAL_NORMAL_SYS_PLATFORM_SHUTDOWN               0xD0
#define OPAL_ABNORMAL_POWER_OFF                         0xE0
uint8_t opal_srctype: SRC type, value should be OPAL_SRC_TYPE_ERROR.

SRC refers to System Reference Code. It is 4 byte hexa-decimal number that reflects the current system state. Eg: BB821010,

  • 1st byte -> BB -> SRC Type

  • 2nd byte -> 82 -> Subsystem

  • 3rd, 4th byte -> Component ID and Reason Code

SRC needs to be generated on the fly depending on the state of the system. All the parameters needed to generate a SRC should be provided during reporting of an event/error.

uint32_t reason_code: Reason for failure as stated in include/errorlog.h for OPAL.

Eg: Reason code for code-update failures can be

  • OPAL_RC_CU_INIT -> Initialisation failure

  • OPAL_RC_CU_FLASH -> Flash failure

Step 2: Data can be appended to the user data section using the either of

the below two interfaces:

void log_append_data(struct errorlog *buf, unsigned char *data,
                      uint16_t size);

Parameters:

struct opal_errorlog *buf: struct opal_errorlog pointer returned by opal_elog_create() call.

unsigned char *data: Pointer to the dump data

uint16_t size: Size of the dump data.

void log_append_msg(struct errorlog *buf, const char *fmt, ...);

Parameters:

struct opal_errorlog *buf: pointer returned by opal_elog_create() call.

const char *fmt: Formatted error log string.

Additional user data sections can be added to the error log to separate data (eg. readable text vs binary data) by calling log_add_section(). The interfaces in Step 2 operate on the ‘last’ user data section of the error log.

void log_add_section(struct errorlog *buf, uint32_t tag);

Parameters:

struct opal_errorlog *buf: pointer returned by opal_elog_create() call.

uint32_t tag: Unique value to identify the data.

Ideal to have ASCII value for 4-byte string.

Step 3: There is a platform hook for the OPAL error log to be committed on any

service processor(Currently used for FSP and BMC based machines).

Below is snippet of the code of how this hook is called.

void log_commit(struct errorlog *elog)
{
        ....
        ....
        if (platform.elog_commit) {
                rc = platform.elog_commit(elog);
                if (rc)
                        prerror("ELOG: Platform commit error %d"
                                "\n", rc);
                return;
        }
        ....
        ....
}
Step 3.1 FSP:
.elog_commit            = elog_fsp_commit

Once all the data for an error is logged in, the error needs to be committed in FSP.

In the process of committing an error to FSP, log info is first internally converted to PEL format and then pushed to the FSP. FSP then take cares of sending all logs including its own and OPAL’s one to the POWERNV.

OPAL maintains timeout field for all error logs it is sending to FSP. If it is not logged within allotted time period (e.g if FSP is down), in that case OPAL sends those logs to POWERNV.

Step 3.2 BMC:
.elog_commit            = ipmi_elog_commit

In case of BMC machines, error logs are first converted to eSEL format. i.e:

eSEL = SEL header + PEL data

SEL header contains below fields.

struct sel_header {
        uint16_t id;
        uint8_t record_type;
        uint32_t timestamp;
        uint16_t genid;
        uint8_t evmrev;
        uint8_t sensor_type;
        uint8_t sensor_num;
        uint8_t dir_type;
        uint8_t signature;
        uint8_t reserved[2];
}

After filling up the SEL header fields, OPAL copies the error log PEL data after the header section in the error log buffer. Then using IPMI interface, eSEL gets logged in BMC.

If the user does not intend to dump various user data sections, but just log the error with some amount of description around that error, they can do so using just the simple error logging interface.

log_simple_error(uint32_t reason_code, char *fmt, ...);

For example:

log_simple_error(OPAL_RC_SURVE_STATUS,
                "SURV: Error retrieving surveillance status: %d\n",
                                                        err_len);

Using the reason code, an error log is generated with the information derived from the look-up table, populated and committed to service processor. All of it is done with just one call.

Error logging retrieval from FSP:

FSP sends error log notification to OPAL via mailbox protocol.

OPAL maintains below lists:

  • Free list : List of free nodes.

  • Pending list : List of nodes which is yet to be read by the POWERNV.

  • Processed list : List of nodes which has been read but still waiting for acknowledgement.

Below is the structure of the node:

struct fsp_log_entry {
        uint32_t log_id;
        size_t log_size;
        struct list_node link;
};

OPAL maintains a state machine which has following states.

enum elog_head_state {
        ELOG_STATE_FETCHING,    /*In the process of reading log from FSP. */
        ELOG_STATE_FETCHED_INFO,/* Indicates reading log info is completed */
        ELOG_STATE_FETCHED_DATA,/* Indicates reading log is completed */
        ELOG_STATE_HOST_INFO,   /* Host read log info */
        ELOG_STATE_NONE,        /* Indicates to fetch next log */
        ELOG_STATE_REJECTED,    /* resend all pending logs to linux */
};

Initially, state of the state machine is ELOG_STATE_NONE. When OPAL gets the notification about the error log, it takes out the node from free list and put it into pending list and update the state machine to fetching state (ELOG_STATE_FETCHING). It also gives response back to FSP about the received error log notification.

It then queue mailbox message to get the error log data in OPAL error log buffer, once it is done state machine gets into fetched state (ELOG_STATE_FETCHED_DATA). After that, OPAL notifies POWERNV host to fetch new error log.

POWERNV uses the OPAL interface to get the error log info(elogid, elog_size, elog_type) first then it reads the error log data in its buffer that moves the pending error log to processed list. After reading, the state machine moves to ELOG_STATE_NONE state.

It acknowledges the error log id after reading error log data by sending the call to OPAL, which in turn sends the acknowledgement mbox message to FSP and moves error log id from processed list to again back to free node list and this process goes on every FSP error log.

Design constraints:

#define ELOG_READ_MAX_RECORD            128

Currently, the number of error logs from FSP, OPAL can hold is limited to 128. If OPAL run out of free node in the list for the new error log, it sends ‘Discarded by OPAL’ message to the FSP. At some point in the future, it is upto FSP when it notifies again to OPAL about the discarded error log.

#define ELOG_WRITE_MAX_RECORD           64

There is also limitation on the number of OPAL error logs OPAL can hold is 64. If it is run out of the buffers in the pool, it will log the message saying ‘Failed to get the buffer’.

Note

  • For more information regarding error logging and PEL format refer to PAPR doc and P7 PEL and SRC PLDD document.

  • Refer to include/errorlog.h for all the error logging interface parameters and include/pel.h for PEL structures.

Sample error logging

 DEFINE_LOG_ENTRY(OPAL_RC_ATTN, OPAL_PLATFORM_ERR_EVT, OPAL_ATTN,
                OPAL_PLATFORM_FIRMWARE, OPAL_PREDICTIVE_ERR_GENERAL,
                OPAL_NA);

 void report_error(int index)
 {
       struct errorlog *buf;
       char data1[] = "This is a sample user defined data section1";
       char data2[] = "Error logging sample. These are dummy errors. Section 2";
       char data3[] = "Sample error Sample error Sample error Sample error \
                        Sample error abcdefghijklmnopqrstuvwxyz";
       int tag;

       printf("ELOG: In machine check report error index: %d\n", index);

       /* To report an error, create an error log with relevant information
        * opal_elog_create(). Call returns a pre-allocated buffer of type
        * 'struct errorlog' buffer with relevant fields updated.
        */

       /* tag -> unique ascii tag to identify a particular data dump section */
       tag = 0x4b4b4b4b;
       buf = opal_elog_create(&e_info(OPAL_RC_ATTN), tag);
       if (buf == NULL) {
               printf("ELOG: Error getting buffer.\n");
               return;
       }

       /* Append data or text with log_append_data() or log_append_msg() */
       log_append_data(buf, data1, sizeof(data1));

       /* In case of user wanting to add multiple sections of various dump data
        * for better debug, data sections can be added using this interface
        * void log_add_section(struct errorlog *buf, uint32_t tag);
        */
       tag = 0x4c4c4c4c;
       log_add_section(buf, tag);
       log_append_data(buf, data2, sizeof(data2));
       log_append_data(buf, data3, sizeof(data3));

       /* Once all info is updated, ready to be sent to FSP */
       printf("ELOG:commit to FSP\n");
       log_commit(buf);
}

Sample output PEL dump got from FSP

$ errl -d -x 0x533C9B37
|   00000000     50480030  01004154  20150728  02000500     PH.0..AT ..(....   |
|   00000010     20150728  02000566  4B000107  00000000      ..(...fK.......   |
|   00000020     00000000  00000000  B0000002  533C9B37     ............S..7   |
|   00000030     55480018  01004154  80002000  00000000     UH....AT.. .....   |
|   00000040     00002000  01005300  50530050  01004154     .. ...S.PS.P..AT   |
|   00000050     02000008  00000048  00000080  00000000     .......H........   |
|   00000060     00000000  00000000  00000000  00000000     ................   |
|   00000070     00000000  00000000  42423832  31343130     ........BB821410   |
|   00000080     20202020  20202020  20202020  20202020                        |
|   00000090     20202020  20202020  4548004C  01004154             EH.L..AT   |
|   000000A0     38323836  2D343241  31303738  34415400     8286-42A10784AT.   |
|   000000B0     00000000  00000000  00000000  00000000     ................   |
|   000000C0     00000000  00000000  00000000  00000000     ................   |
|   000000D0     00000000  00000000  20150728  02000500     ........ ..(....   |
|   000000E0     00000000  4D54001C  01004154  38323836     ....MT....AT8286   |
|   000000F0     2D343241  31303738  34415400  00000000     -42A10784AT.....   |
|   00000100     5544003C  01004154  4B4B4B4B  00340000     UD....ATKKKK.4..   |
|   00000110     54686973  20697320  61207361  6D706C65     This is a sample   |
|   00000120     20757365  72206465  66696E65  64206461      user defined da   |
|   00000130     74612073  65637469  6F6E3100  554400A7     ta section1.UD..   |
|   00000140     01004154  4C4C4C4C  009F0000  4572726F     ..ATLLLL....Erro   |
|   00000150     72206C6F  6767696E  67207361  6D706C65     r logging sample   |
|   00000160     2E205468  65736520  61726520  64756D6D     . These are dumm   |
|   00000170     79206572  726F7273  2E205365  6374696F     y errors. Sectio   |
|   00000180     6E203200  53616D70  6C652065  72726F72     n 2.Sample error   |
|   00000190     2053616D  706C6520  6572726F  72205361      Sample error Sa   |
|   000001A0     6D706C65  20657272  6F722053  616D706C     mple error Sampl   |
|   000001B0     65206572  726F7220  09090953  616D706C     e error ...Sampl   |
|   000001C0     65206572  726F7220  61626364  65666768     e error abcdefgh   |
|   000001D0     696A6B6C  6D6E6F70  71727374  75767778     ijklmnopqrstuvwx   |
|   000001E0     797A00                                     yz.                |
|------------------------------------------------------------------------------|
|                       Platform Event Log - 0x533C9B37                        |
|------------------------------------------------------------------------------|
|                                Private Header                                |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
| Created at               : 07/28/2015 02:00:05                               |
| Committed at             : 07/28/2015 02:00:05                               |
| Creator Subsystem        : OPAL                                              |
| CSSVER                   :                                                   |
| Platform Log Id          : 0xB0000002                                        |
| Entry Id                 : 0x533C9B37                                        |
| Total Log Size           : 483                                               |
|------------------------------------------------------------------------------|
|                                 User Header                                  |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Log Committed by         : 4154                                              |
| Subsystem                : Platform Firmware                                 |
| Event Scope              : Unknown - 0x00000000                              |
| Event Severity           : Predictive Error                                  |
| Event Type               : Not Applicable                                    |
| Return Code              : 0x00000000                                        |
| Action Flags             : Report Externally                                 |
| Action Status            : Sent to Hypervisor                                |
|------------------------------------------------------------------------------|
|                        Primary System Reference Code                         |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
| SRC Format               : 0x80                                              |
| SRC Version              : 0x02                                              |
| Virtual Progress SRC     : False                                             |
| I5/OS Service Event Bit  : False                                             |
| Hypervisor Dump Initiated: False                                             |
| Power Control Net Fault  : False                                             |
|                                                                              |
| Valid Word Count         : 0x08                                              |
| Reference Code           : BB821410                                          |
| Hex Words 2 - 5          : 00000080 00000000 00000000 00000000               |
| Hex Words 6 - 9          : 00000000 00000000 00000000 00000000               |
|                                                                              |
|------------------------------------------------------------------------------|
|                             Extended User Header                             |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
| Reporting Machine Type   : 8286-42A                                          |
| Reporting Serial Number  : 10784AT                                           |
| FW Released Ver          :                                                   |
| FW SubSys Version        :                                                   |
| Common Ref Time          : 07/28/2015 02:00:05                               |
| Symptom Id Len           : 0                                                 |
| Symptom Id               :                                                   |
|------------------------------------------------------------------------------|
|                      Machine Type/Model & Serial Number                      |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
| Machine Type Model       : 8286-42A                                          |
| Serial Number            : 10784AT                                           |
|------------------------------------------------------------------------------|
|                              User Defined Data                               |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
|                                                                              |
|   00000000     4B4B4B4B  00340000  54686973  20697320     KKKK.4..This is    |
|   00000010     61207361  6D706C65  20757365  72206465     a sample user de   |
|   00000020     66696E65  64206461  74612073  65637469     fined data secti   |
|   00000030     6F6E3100                                   on1.               |
|                                                                              |
|------------------------------------------------------------------------------|
|                              User Defined Data                               |
|------------------------------------------------------------------------------|
| Section Version          : 1                                                 |
| Sub-section type         : 0                                                 |
| Created by               : 4154                                              |
|                                                                              |
|   00000000     4C4C4C4C  009F0000  4572726F  72206C6F     LLLL....Error lo   |
|   00000010     6767696E  67207361  6D706C65  2E205468     gging sample. Th   |
|   00000020     65736520  61726520  64756D6D  79206572     ese are dummy er   |
|   00000030     726F7273  2E205365  6374696F  6E203200     rors. Section 2.   |
|   00000040     53616D70  6C652065  72726F72  2053616D     Sample error Sam   |
|   00000050     706C6520  6572726F  72205361  6D706C65     ple error Sample   |
|   00000060     20657272  6F722053  616D706C  65206572      error Sample er   |
|   00000070     726F7220  09090953  616D706C  65206572     ror ...Sample er   |
|   00000080     726F7220  61626364  65666768  696A6B6C     ror abcdefghijkl   |
|   00000090     6D6E6F70  71727374  75767778  797A00       mnopqrstuvwxyz.    |
|                                                                              |
|------------------------------------------------------------------------------|