Debugging Nanos Unikernels with GDB and OPS

Some people think you can't debug unikernels. Tell that to the team of kernel engineers that routinely have to debug unikernels, sometimes without source code, sometimes without access to the system in question.

Today we are going to show you a newish feature we rolled into OPS. It makes debugging easier than what it was in the past.

As software engineers it's a very common thing to run into bugs or places where software does not work as expected. Indeed, there is a whole slew of companies that specialize in just collecting, parsing and presenting bugs so product owners can come back and triage them. The first thing you should do when trying to diagnose a bug in your unikernel is to figure out where it is happening. Is it a bug in OPS? Is it a bug in Nanos? Is it a bug in your actual unikernel?

If you can't boot your unikernel and you aren't getting any dumps there's a high likelihoood that the bug is in OPS. However, let's say you are getting a dump. How do you tell if it's in Nanos or your application?

You can start by turning on the '--trace' flag - when the unikernel dumps it will tell whether it's in Nanos or your program.

general protection fault in user mode, rip 0x13783afe7

Clearly in this example we see a GPF in user. Great! We now know it's a problem in our own program but what if it works outside normally on linux and it's taking a crash inside Nanos? Then let's debug inside! Let's walk through a quick little example where we inject a segfault:

#include <stdio.h>
#include <stdlib.h>

void mybad() {
  int x = 1;
  char *stuff = "asdf";

  printf("about to die\n");
  *(int*)0 = 0;
}

int main(void) {
  mybad();
  printf("should not get here\n");

  return 0;
}

We compile the example with debugging symbols on and link it statically. Nanos fully supports dynamic linking but static linking makes this contrived example easier to show:

cc main.c -static -g -o main

Now let's run ops with the gdb listener enabled. What this does behind the scenes is reload your unikernel with ASLR turned off as we randomize the location of the .text and other parts of your program by default -- just like Linux. We also disable hardware acceleration to remove any optimization weirdness. Then we stop qemu and inject a hook for gdb.

eyberg@box:~/segfault$ ops run -d main
booting /home/eyberg/.ops/images/main.img ...
You have disabled hardware acceleration

Waiting for gdb connection. Connect to qemu through "(gdb) target remote localhost:1234"
See further instructions in https://nanovms.gitbook.io/ops/debugging

If you've already been using ops for a while you'll notice we've recently switched the tracing flag to '--trace' and '-d' now is used for popping gdb.

Now let's go to another window and connect to gdb pointing it at whatever the kernel image you are using:

eyberg@box:~$ gdb ~/.ops/0.1.30/kernel.img
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/eyberg/.ops/0.1.30/kernel.img...done.

Next let's target the remote:

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x000000000000fff0 in ?? ()

Then let's load the symbols:

(gdb) symbol-file ~/segfault/main
Load new symbol table from "~/segfault/main"? (y or n) y
Reading symbols from ~/segfault/main...done.

As you can see we can now see source code:

(gdb) list
1       #include <stdio.h>
2       #include <stdlib.h>
3
4       void mybad() {
5         int x = 1;
6         char *stuff = "asdf";
7
8         printf("about to die\n");
9         *(int*)0 = 0;
10      }

This of course allows us to figure out where to put breakpoints or watchpoints.

We already know that our program is blowing up somewhere in the 'mybad' function. Let's put a breakpoint on it so whenever we execute it, the debugger will stop and allow us to figure out what is going on:

(gdb) b mybad
Breakpoint 1 at 0x400b5d: file main.c, line 4.

If we continue you can see we the program starts up and then immediately invokes mybad and we hit our breakpoint.

(gdb) c
Continuing.

Breakpoint 1, mybad () at main.c:4
4       void mybad() {

Now we can single step each line of code and print out variable values and such:

(gdb) s
5         int x = 1;
(gdb) s
6         char *stuff = "asdf";
(gdb) s
8         printf("about to die\n");
(gdb) p x
$1 = 1
(gdb) p stuff
$2 = 0x492184 "asdf"

Of course in this example if we step two more times we figure out that we are segfaulting on line 9.

(gdb) s
9         *(int*)0 = 0;
(gdb) s
Remote connection closed

Then we'll see our program segfault in the other shell:

assigned: 10.0.2.15
signal 11 received by tid 1, errno 0, code 1
   fault address 0x0
   core dump (unimplemented)
exit status 255

The problem was that line was a null-ptr de-ref. Your friendly crustaceans are probably screaming right now. :)

Hope this helps you out further down the path when you find various bugs in your program and you need to debug a unikernel and anyone who says you can't debug unikernels - just point them at this article.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.