User Mode Concurrency Groups - Faster than Futexes

We took google's UMCG (user mode concurrency groups) implementation (which is a part of google fibers - no not the isp) and added it as an optional klib to nanos. This allows you to make your own user-space schedulers. This opens a whole new world of ultra highly scalable software. I can see a lot of applications for this including high performance databases and making various language runtimes a lot smarter. We also think this fits in very well with the Nanos architecture.

This is a feature that has never really successfully merged into mainline linux. Linux is used by a lot of different people with a lot of differing use-cases so sometimes you'll see a lot of frustration on the LKML on how something could/should be used. Suffice to say if you are a big enough company you have your own forks and patches so it doesn't really matter if something gets merged or not. This dynamic is something that unikernel development can play off of cause we have a very explicit set of use-cases and they all involve modern server-side development. We don't do desktop, multiple programs, bare metal or any of that. It's far easier for us to embrace optional features like this. This dynamic also encouraged us to port OpenBSD's pledge and unveil.

UMCG is also not the only user-space scheduling framework to come out of Google -- even though it's very different, take a look at ghOSt.

So What is It?

This feature is basically a light-weight M:N threading model that gives userspace control over threading decisions which is an excellent fit for unikernels where you only have one program anyways, however that program could have many threads. UMCG has server threads which are the 'N' kernel threads and 'M' worker application threads.

There are a few advantages to this approach. For one, the kernel doesn't necessarily know what an application is up to just like a hypervisor doesn't really know what a guest is up to for scheduling purposes.

For us it is very easy to look at successes for language specific implementations that are similar to this such as go routines, java virtual threads and even erlang processes (the latter to a degree - these are even more different).

Having said this, if I were a betting man -- I'd expect to see some form of this land in Linux eventually.

A key insight for UMCG is something a lot of people get hung up on. A lot of people think that switching to kernel and back is slow but in reality waiting for the next thread to be scheduled by the kernel can be slow as well.

How Are Things Done Today?

Today in Linux futexes are used and it's not even really something most developers ever find themselves working on directly. In fact if you are using simple pthreads right now you are more than likely using them right now. They've been in for quite a while now. You can verify that with this simple code sample:

#include <pthread.h>
#include <stdio.h>

void *thread(void *ptr) {
    long type = (long)ptr;
    fprintf(stderr,"Thread - %lu\n",type);
    return  ptr;
}

int main(int argc, char **argv) {
    pthread_t thread1, thread2;
    long thr = 1;
    long thr2 = 2;

    pthread_create(&thread1, NULL, *thread, (void *)thr);
    pthread_create(&thread2, NULL, *thread, (void *)thr2);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
    return 0;
}

If we strace we can see that it is using futex via set_robust_list and the explicit call to futex:

arch_prctl(ARCH_SET_FS, 0x7f3d1d431740) = 0
set_tid_address(0x7f3d1d431a10)         = 3217574
set_robust_list(0x7f3d1d431a20, 24)     = 0
rseq(0x7f3d1d432060, 0x20, 0, 0x53053053) = 0
mprotect(0x7f3d1d3f2000, 16384, PROT_READ) = 0
...*snip*...
futex(0x7f44407fe990, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 3340930, NULL, FUTEX_BITSET_MATCH_ANY) = 0

I'll let Ingo describe robust futexes more in detail. A mutex at first blush sounds like a relatively straight-forward concept but they get really complicated really fast. You could have a recursive mutex or one that avoids priority inversion with inheritance or protection protocols or you could have a fair vs unfair one. You can see a handful of mutex types for pthreads.

Let's look at some other languages and os's and see what they use and if it might make sense to use UMCG in those ecosystems.

NodeJS

Node is interesting because javascript is probably one of the most widely used languages today. (Also when I'm saying node you can just substitute that for bun or deno or any of the other runtimes out there.) Node.js programs (eg: the javascript, not the runtime) don't really do true multi-threading because javascript itself doesn't have multi-threading support. What happens in javascript land if you spin up say a 'worker thread' is that it spawns a new v8 isolate. Isolates don't share the heap thus they have to duplicate all the memory. Worker threads utilize message passing here for this exact reason.

Now I said javascript doesn't do multi-threading but node is a runtime so it has multiple threads and it specifically uses v8 and libuv - both of which have accesss to multi-threading. So while you may not utilize UMCG inside your node program I could see someone forking/patching libuv or v8 and adding this support.

Rust

Now rust on the other hand - I could definitely see a rust library being made here. Rust threads also use futexes. You might be asking yourself why I'm pointing out various languages and their use or not of futexes. Languages naturally have to target many types of operating systems which may or may not have the same sort of support and even if they do they might be out of sync with similar suppport in a different system. I'm also contrasting/comparing the different facilities that languages have access to by virtue of what type of language they are.

use std::thread;

const NTHREADS: u32 = 10;

fn main() {
    let mut children = vec![];

    for i in 0..NTHREADS {
        children.push(thread::spawn(move || {
        println!("this is thread number {}", i);
        }));
    }

    for child in children {
        let _ = child.join();
    }

}

Which again we can see here:

arch_prctl(ARCH_SET_FS, 0x7ff4c3328780) = 0
set_tid_address(0x7ff4c3328a50)         = 3226629
set_robust_list(0x7ff4c3328a60, 24)     = 0
rseq(0x7ff4c33290a0, 0x20, 0, 0x53053053) = 0

and here.

It'd be interesting to see what a rust implementation of this looks like. If anyone ends up hacking something together please let us know - we'd love to see it.

macOS

macOS used to have very slow mutexes with pthreads but it seems that is now history with ulock_wait && ulock_wake.

However if you run the same sample pthreads program with ktrace you'll see very different output. In particular we'll see 'BSC_bsdthread_create' and our ulock/uwakes:

sudo ktrace trace -Ss -f C4,S0x010c -c ./main

2024-02-05 20:41:35.410433 PST          1.8 BSC_bsdthread_create                 100317e9c        2 16fbff000        16fbff000        833f7f             4(AP-P) main(52791)
2024-02-05 20:41:35.410436 PST          2.6(2.6) BSC_bsdthread_create                 0                6fbff000         1 ce37             833f7f             4(AP-P) main(52791)
2024-02-05 20:41:35.410437 PST          0.7 BSC_sys_ulock_wait                   1020002          16fb73034 b03              0                833f7f             4(AP-P) main(52791)

and we can see calls to it in the source.

Coming back to rust there is even bindings against ulock in rust.

"Context switching lacks the context."
- Paul Turner

FreeBSD

FreeBSD has _umtx_op with UMTX_OP_WAKE and UMTX_OP_WAIT_UINT.

Our Implementation

Ok, now that we have taken the grand tour of how things work in other ecosystems let's take a look at how we do UMCG in Nanos (keep in mind we also have as a default, futex, so this isn't either/or).

We've ported this implementation based upon the version found at https://www.spinics.net/lists/kernel/msg4740112.html. I'd imagine we'll track any upstreaming that occurs in the future too.

If you are wanting to try this in Nanos now you can simply include it as a klib and set your nanos-version when using ops.

You can target the umcg runtime test via:

make TARGET=umcg run

If you edit test/runtime/umcg.manifest you can adjust how you invoke the test. (Note: These manifest files are auto-generated by ops on demand and not something you hand edit outside of nanos development.)

We added a new syscall of __NR_umcg_ctl and call it like so:

syscall(__NR_umcg_ctl, flags, cmd, next_tid, abs_timeout,events, event_sz);

Testing

Inside our tests we look at a few comparative methods that does a very simple test. We create a worker thread along with the main thread and measure how many times we can switch from the main to the worker and back again per unit of time.

Condition Variables

We first look at condition variables. Condition variables provide a method so that the developer doesn't have to poll constantly when using a mutex. This uses the pthread_cond_wait functions.

Futexes:

Next we look at futexes or 'fast user-space mutual exclusion's. This method waits for a value at an address to change and has a way to wait or wake up any address. This is pretty much what everything uses nowadays. We call the futex syscall directly in our test, however most of the applications you use will probably have pthreads calling it for you.

syscall(SYS_futex, word_ptr, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0, NULL, 0);

UMCG:

Now we get to the meat and the potatoes. For UMCG we register a server via the umcg_ctl syscall:

sys_umcg_ctl(0, UMCG_REGISTER_SERVER, 0, 0, NULL, 0) == 0);

Then we register a worker again with the same syscall:

sys_umcg_ctl(0, UMCG_REGISTER_WORKER, 0,  syscall(SYS_gettid) <<UMCG_WORKER_ID_SHIFT, NULL, 0) == 0);

We can wait for any worker thread:

(sys_umcg_ctl(0, UMCG_WAIT, 0, 0, NULL, 0) == 0)

Finally after the worker thread is finished we can unregister it:

sys_umcg_ctl(0, UMCG_UNREGISTER, 0, 0, NULL, 0) ==

As you can see it's a fairly straight-forward interface. Now let's take a look at the results:

Condition Variable Results

Condition variable performance test duration: 10 seconds...
Results: 4063380 cycles (406338 cycles/sec)

Futex Results

Futex performance test duration: 10 seconds...
Results: 4491598 cycles (449160 cycles/sec)

UMCG Results

UMCG performance test duration: 10 seconds...
Results: 6748530 cycles (674853 cycles/sec)

The results, while preliminary, are rather impressive and we think this unlocks all sorts of interesting applications, especially given the architecture that we are utilizing this in. (Keep in mind, for anyone wishing to replicate this at home that this is in a 1vcpu virtual machine on KVM, not on your native/metal laptop.) Likewise if you are looking to build a fibers like system there is more work here to do but I hope this inspired you to explore further and possibly start hacking around. What will you build?

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.