Troubleshooting Kernel Panic (ESXi/Linux)

Kernel panic on Linux is hard to identify and troubleshoot. Finding a root cause of a kernel panic often requires reproducing a situation that occurs rarely and collecting data that is difficult to gather.

Troubleshooting Kernel Panic (ESXi/Linux)

Kernel panic on Linux is hard to identify and troubleshoot. Finding a root cause of a kernel panic often requires reproducing a situation that occurs rarely and collecting data that is difficult to gather. As the name implies, the Linux kernel gets into a situation where it doesn’t know what to do next. When this happens, the kernel gives as much information as it can about what caused the problem, depending on what caused the panic.

1. Overview of Kernel Panic

There are two kinds of kernel panics:

  1. Hard Panic (also known as Aieee!) – mostly hardware related
  2. Soft Panic (also known as Oops…) – mostly software related

kernel-1024x808-1

Most panics are the result of unhandled processor exceptions in kernel code, such as references to invalid memory addresses and can also be caused by one or more of the following issues:

  • Defective or incompatible RAM (damaged RAM memory unit, dirty/disconnecting RAM unit, faulty RAM controller etc.)
  • Incompatible, obsolete, or corrupted kernel extensions. If a kernel extension or one of its dependencies is corrupted, such as the result of hard disk corruption, kernel panics are likely to occur when the kernel attempts to load or execute the extension (this means OS corruption 90% of the time)
  • Incompatible, obsolete, or corrupted drivers. Similar to kernel extensions, drivers for third-party hardware which are incompatible with the OS version you are using, or which have become corrupted, will cause in kernel panics (driver re-installation or upgrade needed)
  • Hard disk corruption, including bad sectors, directory corruption, and other hard-disk ills (easily the most common cause of panics)
  • Incorrect permissions on System-related files or folders (no root access to /root or /var etc.)
  • Insufficient RAM and available hard disk space (VM over provisioning, thin disk over growth etc.)
  • Improperly installed hardware or software (faulty add-ons or packets, not fully plugged in extension cards etc.)
  • Defective hardware or software. Hardware failures, including a defective CPU, or programming errors can result in kernel panics
  • Incompatible hardware (more RAM than CPU can handle, voltage hardware needs higher than PU capabilities etc.)

Only modules that are located within kernel space can directly cause the kernel to panic. To see what modules are dynamically loaded, do lsmod – this shows all dynamically loaded modules (Dialogic drivers, LiS, SCSI driver, filesystem, etc.). In addition to these dynamically loaded modules, components that are built into the kernel (memory map, etc.) can cause a panic.

Since hard panics and soft panics are different in nature, we will discuss how to deal with each separately.

2. Troubleshooting Hard Kernel Panic

2.1 Symptoms of a hard kernel panic

  • Machine is completely locked up and unusable
  • Num Lock / Caps Lock / Scroll Lock keys usually blink
  • If in console mode, dump is displayed on monitor (including the phrase Aieee!)
  • Similar to Windows Blue Screen

2.2 Hard kernel panic signature causes

The most common cause of a hard kernel panic is when a driver crashes within an interrupt handler, usually because it tried to access a null pointer within the interrupt handler. When this happens, that driver cannot handle any new interrupts and eventually the system crashes. This is not exclusive to Dialogic drivers.

Screenshot-from-2018-06-22-16-26-36-768x44

2.3 Hard kernel panic – information to collect

Depending on the nature of the panic, the kernel will log all information it can prior to locking up. Since a kernel panic is a drastic failure, it is uncertain how much information will be logged. Below are key pieces of information to collect. It is important to collect as many of these as possible, but there is no guarantee that all of them will be available, especially the first time a panic is seen.

  • /var/log/messages — sometimes the entire kernel panic stack trace will be logged there
  • Application / Library logs (RTF, cheetah, etc.) – may show what was happening before the panic
  • Other information about what happened just prior to the panic, or how to reproduce
  • Screen dump from console. Since the OS is locked, you cannot cut and paste from the screen. There are two common ways to get this info: - Digital Picture of screen (preferred, since it’s quicker and easier); even a quick picture from a smartphone will do
  • Copying screen with pen and paper or typing to another computer

If the dump is not available either in `/var/log/message` or on the screen, follow these tips to get a dump:

  • If in GUI mode, switch to full console mode – no dump info is passed to the GUI (not even to GUI shell).
  • Make sure screen stays on during full test run – if a screen saver kicks in, the screen won’t return after a kernel panic. Use these settings to ensure the screen stays on.
    setterm -blank 0
    setterm -powerdown 0
    setvesablank off
  • From console, copy dump from screen (see above).

2.4 Hard kernel panic – Troubleshooting with a full trace

The stack trace is the most important piece of information to use in troubleshooting a kernel panic. It is often crucial to have a full stack trace, something that may not be available if only a screen dump is provided – the top of the stack may scroll off the screen, leaving only a partial stack trace. If a full trace is available, it is usually sufficient to isolate root cause. To identify whether or not you have a large enough stack trace, look for a line with EIP, which will show what function call and module caused the panic. In the example below, this is shown in the following line:
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe

If the culprit is a Dialogic driver you will see a module name with:
streams-xxxxDriver
where xxxx is either dlgn, dvbm, mercd or similar.

Full trace example:
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
f89e568b
*pde = 32859002
*pte = 00000000
<strong>Oops</strong>: 0000
Kernel 2.4.9-31enterprise
CPU: 1
EIP: 0010:[] Tainted: PF
EFLAGS: 00010096
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe
eax: 00000000 ebx: f65f5410 ecx: f5e17710 edx: f65f5410
esi: 00001ea0 edi: f5e23c30 ebp: f65f5410 esp: f1cf7e78
ds: 0018 es: 0018 ss: 0018
Process pwcallmgr (pid: 10334, stackpage=f1cg7000)
Stack: 00000000 c01067fa 00000086 f1cf7ec0 00001ea0 f5e23c30 f65f5410 f89e53ec
f89fcd60 f5e16710 f65f5410 f65f5410 f8a54420 f1cf7ec0 f8a4d73a 0000139e
f5e16710 f89fcd60 00000086 f5e16710 f5e16754 f65f5410 0000034a f894e648
Call Trace: [setup_sigcontext+218/288] setup_sigcontext [kernel] 0xda
Call Trace: [] setup_sigcontext [kernel] 0xda
[] dlgnwput [streams-dlgnDriver] 0xe8
[] Sm_Handle [streams-dlgnDriver] 0×1ea0
[] intdrv_lock [streams-dlgnDriver] 0×0
[] Gn_Maxpm [streams-dlgnDriver] 0×8ba
[] Sm_Handle [streams-dlgnDriver] 0×1ea0
[] lis_safe_putnext [streams] 0×169
[] __insmod_streams-dvbmDriver_S.bss_L117376 [streams-dvbmDriver] 0xab8 [] dvbmwput [streams-dvbmDriver] 0×6f5
[] dvwinit [streams-dvbmDriver] 0×2c0
[] lis_safe_putnext [streams] 0×168
[] lis_strputpmsg [streams] 0×54c
[] __insmod_streams_S.rodata_L35552 [streams] 0×182e
[] sys_putpmsg [streams] 0×6f
[system_call+51/56] system_call [kernel] 0×33
[] system_call [kernel] 0×33
Jul 12 15:07:18 talus kernel:
Jul 12 15:07:18 talus kernel:
Code: 8b 70 0c 8b 06 83 f8 20 8b 54 24 20 8b 6c 24 24 76 1c 89 5c

2.5 Hard kernel panic – Troubleshooting without a full trace

If only a partial stack trace is available, it can be tricky to isolate the root cause, since there is no explicit information about what module of function call caused the panic. Instead, only commands leading up to the final command will be seen in a partial stack trace. In this case, it is very important to collect as much information as possible about what happened leading up to the kernel panic (application logs, library traces, steps to reproduce, etc).

Partial trace example (note there is no line with EIP information):
[] ip_rcv [kernel] 0×357
[] sramintr [streams_dlgnDriver] 0×32d
[] lis_spin_lock_irqsave_fcn [streams] 0×7d
[] inthw_lock [streams_dlgnDriver] 0×1c
[] pwswtbl [streams_dlgnDriver] 0×0
[] dlgnintr [streams_dlgnDriver] 0×4b
[] Gn_Maxpm [streams_dlgnDriver] 0×7ae
[] __run_timers [kernel] 0xd1
[] handle_IRQ_event [kernel] 0×5e
[] do_IRQ [kernel] 0xa4
[] default_idle [kernel] 0×0
[] default_idle [kernel] 0×0
[] call_do_IRQ [kernel] 0×5
[] default_idle [kernel] 0×0
[] default_idle [kernel] 0×0
[] default_idle [kernel] 0×2d
[] cpu_idle [kernel] 0×2d
[] __call_console_drivers [kernel] 0×4b
[] call_console_drivers [kernel] 0xeb
Code: 8b 50 0c 85 d2 74 31 f6 42 0a 02 74 04 89 44 24 08 31 f6 0f
<0> Kernel panic: <strong>Aiee</strong>, killing interrupt handler!
In interrupt handler – not syncing

If only a partial trace is available and the supporting information is not sufficient to isolate root cause, it may be useful to use KDB. KDB (Kernel Debugger) is a tool that is compiled into the kernel that causes the kernel to break into a shell rather than lock up when a panic occurs. This enables you to collect additional information about the panic, which is often useful in determining root cause.

3. Troubleshooting Soft Kernel Panic

3.1 Symptoms of a soft kernel panic

  • Much less severe than hard panic.
  • Usually results in a segmentation fault
  • Can see an oops message – search /var/log/messages for string Oops (see the full trace above)
  • Machine still somewhat usable (but should be rebooted after information is collected); machine might be really slow/unresponsive, but still operable

3.2 Soft kernel panic signature causes

Almost anything that causes a module to crash when it is not within an interrupt handler can cause a soft panic. In this case, the driver itself will crash, but will not cause catastrophic system failure since it was not locked in the interrupt handler. The same possible causes exist for soft panics as do for hard panics (i.e. accessing a null pointer during runtime).

Soft kernel panic – information to collect

When a soft panic occurs, the kernel will generate a dump that contains kernel symbols – this information is logged in /var/log/messages. To begin troubleshooting, use the ksymoops utility to turn kernel symbols into meaningful data. To generate a ksymoops file:

  1. Create new file from text of stack trace found in /var/log/messages. Make sure to strip off timestamps, otherwise ksymoops will fail
  2. Run ksymoops on new stack trace file:
    Generic: ksymoops -o [location of Dialogic drivers] filename
    Example: ksymoops -o /lib/modules/2.4.18-5/misc ksymoops.log
    All other defaults should work fine

Here’s the full ksymoops manual page.

Related Article