Another technique to view kernel panic messages

Another technique to view kernel panic messages

ometimes servers encounter fatal error conditions in the Linux kernel, and the kernel enters the “panic” state. In this state, it completely stops running any programs, prints a chain of function calls that led to the error to the screen, so that somebody could take a photo and present it to kernel developers as evidence, and then freezes until that somebody resets the server manually. In croit, however, the kernel is configured to reboot automatically after 45 seconds if it encounters a panic, and therefore, after the fact, it looks like a mysterious and unexpected server reboot.

In April 2023 we published a blog post that explained how to use netconsole to capture and view kernel panic messages. The technique presented there is still valid and reliable, however, it requires upfront preparations. In most cases, there is another way.

Some theory

In June 2009, Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba jointly published a version 4.0 of the ACPI specification. ACPI advertises the system components and the available firmware functionality to the operating system via tables that contain bytecode for the ACPI Machine Language which is implemented by the operating system. One of the new features of ACPI 4.0 is the inclusion of ACPI Hardware Error Interfaces and, in particular, the Error Record Serialization Table (ERST).

The availability and implementation of ACPI features, including ERST, can vary between different BIOS or firmware vendors. In 2024, still not all computers implement these interfaces, but the vast majority of server-class machines do.

Linux recognizes the Error Record Serialization Table that the system BIOS puts into the system RAM. Upon encountering a fatal crash, it follows instructions in this table to serialize the last portion of the kernel messages. They end up being stored in a platform-specific location, and the details of this storage location may vary depending on the implementation and hardware architecture. The most common implementation involves storing the serialized error records in on-board non-volatile memory, i.e., on the same flash chip that holds the system firmware. The bottom line is that the information persists across reboots and power cycles.

During the next boot, the Linux kernel retrieves the error records from the platform storage and makes them available in sysfs. There is also a systemd service, systemd-pstore.service , that moves them into /var , thus freeing the precious space in the system flash chip.

Instructions

To check whether a server supports ERST, run this command on it:

dmesg | grep ERST

If there is no output, it means that the system does not support ERST, and there is nothing you can do to enable it. On a server that supports ERST, the output would look like this:

[    0.007156] ACPI: ERST 0x00000000A77B4000 000230 (v01 AMIER  AMI.ERST 00000000 AMI. 00000000)
[    0.007190] ACPI: Reserving ERST table memory at [mem 0xa77b4000-0xa77b422f]
[    1.005528] ERST: Error Record Serialization Table (ERST) support is initialized.

If you get a similar output, congratulations: no further actions are needed. In croit OS images, everything is already set up to capture kernel crash logs to the platform storage via ERST by default.

You can crash a server for testing purposes like this:

echo c > /proc/sysrq-trigger

If a server crashes and reboots without any logs, and you suspect that it was a kernel panic, all that is needed to read the panic message is to look into /var/lib/systemd/pstore:

cat /var/lib/systemd/pstore/*/dmesg.txt

That’s it! Note, however, that this file will not persist across the next reboot, as /var is volatile on croit systems. Better copy it to a safer place - e.g., to your laptop.

If this command does not work, it means that the cause of the unexpected server reboot was not a kernel panic. The suggested diagnostic step is then to examine IPMI logs:

ipmitool sel list

Conclusion

As already mentioned in the previous blog post, a kernel panic is always a bug in the kernel. Having more than one option to capture the panic message, and especially a setup that is enabled by default on hardware that supports it, increases the chances for kernel developers to identify its root cause.