Introducing Memfd Create and Anonymous Files for the Nanos Unikernel

We recently introduced support for the memfd_create syscall which allows one to create anonyous files. Anonymous files are different from something like a temporary file in that it lives in ram vs being stored on disk. As you can imagine this opens up a handful of use-cases, however it is worth exploring why Nanos didn't have this to begin with.

One of the main reasons why Nanos had not supported this to date is because Nanos doesn't do multiple processes and thus doesn't have a need to share memory between multiple processes. The multiple process software architecture model was highly prevalent in the 90s because commodity SMP enabled machines were not available outside of a few select high-end options. Performant threading was also not really an option. In fact if you look at a lot of the literature or even mailing list chatter from that period many people would talk down on threading because of the atrocious performance from that time period.

Things have thankfully changed in the past twenty to thirty years though. Nowadays performance oriented applications default to threads and multi-process design is heavily shunned. The amount of ram being used in certain applications is really starting to drive a nail in this coffin. What is interesting, however, is the lineage of that model is still kicking in a different form - go pop open any random docker-compose.yml on github.

The Nanos model encourages use of threads where they will all share the same heap by default. This is not the first time where we've added support for something where we originally didn't think we would.

Why is this relevant? Shared memory was an architectural concept to give extra enhancements to the multi-process model.

This ended up creating two new klibs which are the shmem and tmpfs klibs.

The original use case was brought up in this ticket that had this code:

#include <iostream>
#include <cassert>
#include <unistd.h>
#include <sys/mman.h>

int main()
{
    size_t capacity = getpagesize() * 2;

    int fd = memfd_create("queue_buffer", 0);
    ftruncate(fd, capacity);
    char* data = static_cast(mmap(NULL, 2 * capacity, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0));
    mmap(data, capacity, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0);
    mmap(data + capacity, capacity, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0);

    data[0] = 'a';
    assert(data[capacity] == 'a');
}

The original user had a ring buffer that used mmap. One benefit of using an anonymous file is that now instead of littering files in /tmp with O_TMPFILE your references can just go into memory and it's clearly a lot faster too. Of course malware authors have found a use-case for this as well. 'Fileless malware' refers to malware that doesn't touch the filesystem. For instance after explotiing a flaw an attacker could download a new binary, and execute it from memory without having to write it to disk which might trigger a defenders scanning solution and show up on a SIEM. (Obviously in Nanos we don't do fork/exec.)

One of the reasons memfd_create actually exists is for file sealing. For example you can write to a file with memfd_create, then seal it and ensure no more writes can occur to it. Keep in mind this work grew out of the need for those using shared memory (eg: memory shared amongst different processes) to ensure the other completely untrusted process won't modify the contents. We're not even talking about a hacker messing with things - we're just talking about a program that might've accidently overwritten a shared piece of memory or shrunk it or done something else. Now you might understand why some think this is not such a great programming model.

Having said that, file sealing is still a interesting use-case.

Why 2 New Klibs

To enable the shared memory functionality we needed to rely on an underlying tmpfs filesystem, however that work could be utilized differently and was written to be independent of this.

shmem

The shmem klib creates functionality for working with shared memory. This is a feature that is more commonly found in general purpose multi-process systems such as Linux with syscalls such as shm_open, shmget, and so forth. This is typically faster than using a unix socket or a pipe which is in return usually faster than setting up a tcp/ip connection betweeen the two processes.

However, if you've been following along, unikernel architecture doesn't do multiple processes so there was little reason to add support for it now, but as the prior example showed there are indeed use-cases even when you aren't trying to have multiple processes talk to each other.

tmpfs

The shmem klib itself relies on another klib called 'tmpfs'. Tmpfs allows you to create arbitrary filesystems that reside in virtual memory vs disk. This gives you a clear performance advantage with the tradeoff that everything in the filesystem is volatile, meaning if you reboot or crash everything is gone. You'll sometimes see people claim certain benchmarks by relying on tmpfs - we'd just remind you about "web scale" benchmarking.

The other limitation to take into account here is that whatever you want in the filesystem needs to fit into ram so if you're using some new fancy python ai framework and you want it on the tmpfs filesystem be aware that many of those can require a gig or more of disk (or in this case ram).

To use this new functionality all you need to do is add the klibs to your config:

{
  "Klibs": ["shmem", "tmpfs"]
}

The addition of not just one but two new klibs opens up a lot of terrain for you to create new systems with new capabilities. Happy hacking.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.