Dumping Null Terminated Strings

We recently made a very large change in Nanos. We jettisoned all of the null terminated strings that we could. While it's something that's been on our long list of known issues we wanted to address we finally bit the bullet and started the transition.

A lot of people will bring up "attack surface reduction" and point at KLOCs as a "this is why unikernels are secure" argument but that's a pretty weak argument. Hackers have about a bajillion different methods to screw with your code. Yes, lines of code are important but so are attack vectors - arguably way more so. Going from 20 kilos of plutonium-239 down to 5 is great but you still have 5 kilos of weapons grade plutonium. What is better is saying we killed off 100% of the plutonium, got rid of 50% of the tomahawks and 80% of javelins.

Kernels are different from normal user programs in that they are constantly having to deal with raw memory. Until someone designs a new type of computer this is just reality. Some of it is a lot easier to deal with than others. For instance drivers tend to be a lot more wild west. Thus, validation of buffers is an important concept when it comes to security. While we'll touch on others in this article we're really going to be talking about strings today.

"I note with fear and horror that even in 1980, language designers and users have not learned this lesson."
- C. A. R. Hoare, designer of Algol 60, released in 1961, which did array bounds checking.

Bounds Checking

If you look at any of Kees Cook's talks - bounds checking becomes a recurring theme. Null termination of strings, a subset of this, is such an issue that it's listed as CWE 170 in Mitre's weakness list.

Functions to handle strings in operating systems are different than in applications as applications will tend to link to a library of some kind. In an operating system you get to write your own. It's also important as ones defined in various libc's can be quite complicated and much larger than you think.

This is also a reason why simply switching to an alt-libc you typically will have various tradeoffs including performance, security and portability differences.

So which string funtions are 'bad'? Well one can consult git's list of banned functions but I think it's more correct to put these on a spectrum of worst to best.

  • strcpy: doesn't check bounds
  • strcat: doesn't check bounds
  • sprintf: doesn't check bounds
  • vsprintf: doesn't check bounds
  • strncpy: doesn't null terminate on overflow

Different Types of Strings

There are of course other implementations of strings out there such as antirez sds and it depends on the type of project you're working on what you should use.

Null-terminated strings aren't the only string types on the block. Pascal strings would prefix a string with a known length but also have their own advantages/disadvantages.

In fact let's take a quick look at how those work. In this example we have a simple hello world. I've marked it with 'AAAAA' instead of 'hello world' so it's easy to find and we can contrast with other examples.

program Hello;
begin
  writeln ('AAAA');
end.

We can compile this with the free pascal compiler (fpc):

fpc hi.pp

Next we can use hexdump to look at what is produced.

hexdump -C hi | more

If we page down we'll find our 'A's as four '41's and the length conveniently placed right before.

00024000  04 41 41 41 41 00 00 00  00 00 00 00 00 00 00 00
|.AAAA...........|

If we double up the string to 8 'A's we can again look at the output and see that we've declared eight of them.

hexdump -C hi | more
00024000  08 41 41 41 41 41 41 41  41 00 00 00 00 00 00 00
|.AAAAAAAA.......|

This has a nice perk of giving us a O(1) sz enumerator.

How Do Other Langugages Do It?

Before we talk about how other languages do this let's talk about how Linux does this. The short of it is that Linux uses null terminated strings because they more or less have to. POSIX compatibility demands it. As an example take a look at the exec family of syscalls. What's interesting about what we are doing is that while we can run lots of software without modifications, we are not limited by the same rules that linux is. It will never be the year of the Nanos desktop -- and we're fine with that. :) Exec, however, isn't alone - even something such as open(2)'s first agument is a path that is null-terminated. Env vars are also null terminated. Even main's function prototype ... wait for it... has argv as an array of null-terminated strings:

int main(int argc, char **argv);

So it's not really easy, straight-forward or realistic to just snap your fingers and say "begone". We are also far less concerned with running everything linux runs and our subset - only virtualized server-side workloads with no concept of a large interactive userland simply doesn't need the same constraints. Those exec syscalls? We don't got em and we don't want em. This is one of the very common overlooked benefits of unikernel architecture - when you know you are in a vm you get to make new rules - just like all the interpreted languages get to do, albeit that's a different type of vm, but the same principle applies.

There are places that it's really hard to remove these from (look out device drivers) but that doesn't mean that we can't isolate those and prevent them from showing up elsewhere.

Some of you might be wondering why these exist at all. One of the reasons why null-terminated strings exist is because when C was created memory was a very scarce resource. One termination byte is less than four bytes to store an integer representing the size of a string. There have been a lot of languages made since then - what do they do? Let's take a look at two popular languages in use today.

If we take a look at go's slice type you'll see an eerily semblance to what we've started with:

type slice struct {
  array unsafe.Pointer
  len   int
  cap   int
}

Rust works more or less the same way but if you look at the source for str, you might be scratching your head asking where it is as you'll just find the implementation - it's actually defined as a 'primitive type' in the HIR (high-level intermediate representation) - essentially a representation of the AST that is generated by the parser.

Our Changes

This brings us to our current (but by no means finished) implementation. How do we do things like strlen in nanos then if we aren't using null-terminators?

For reference - we used to do something like this:

int runtime_strlen(const char *a)
{
    int i = 0;
    for (; *a; a++, i++);
    return i;
}

This of course is fairly basic and not so hot. Now we just call 'length' such as:

str.len

Since we still need null terminated strings and to be more precise, c string literals, we convert them at compile time into sstring types by using the ss() macro. We then have defined a new few functions and macros:

  • sstring
  • isstring
  • ss
  • ss_static_init
  • sstring_from_cstring

We have our base struct which contains a len size and a pointer to our string.

typedef struct sstring {
    bytes len;
    char *ptr;
} sstring;

sstring struct

We use this ss macro to transform at compile time string literals:

#define ss(x)  ({               \
    assert_string_literal(x);   \
    sstring s = {               \
        .len = sizeof(x) - 1,   \
        .ptr = x,               \
    };                          \
    s;                          \
})

ss

We can create new strings with the ss_static_init macro:

#define ss_static_init(x)   {   \
    .len = sizeof(x) - 1,       \
    .ptr = string_literal(x),   \
}

ss_static_init

Finally, by using sstring_from_cstring we have a way of converting c strings into sstring:

static inline sstring sstring_from_cstring(const char *cstring, bytes maxlen)
{
    sstring s = {
        .ptr = (char *)cstring,
    };
    bytes len;
    for (len = 0; len < maxlen; len++)
        if (cstring[len] == '\0')
            break;
    s.len = len;
    return s;
}

sstring_from_cstring

When you make a (multi-thousand line) change like this it also means that you might need to change other dependencies that you have such as the lwip fork we have (which has a ton of other changes in it as well). Lwip itself has a handful of string functions in it as well in addition to minor changes such as this:

ip6addr_aton(const char *cp, ip6_addr_t *addr)
ip6addr_aton(sstring cp, ip6_addr_t *addr)

and fixing various loops such as this:

-  for (s = cp; *s != 0; s++) {
+  for (size_t i = 0; i < cp.len; i++) {
+    const char *s = cp.ptr + i;

You might run into another issue such as this macro where the # operator converts x into a string literal.

#define HANDLER(x) x, #x

We now convert it via ss_static_init:

#define HANDLER(x) x, ss_static_init(#x)

Some people wrongly believe that a unikernel is not an operating system. It most definitely is. The architecture allows us to fix issues that are not so easy to fix in other systems primarily because of legacy support. I would definitely see more thrusts from us in this direction as more of our customers see the inherent benefits of what happens when your existing virtualized workloads can now be moved and retrofitted into newer systems.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.