Segmentation Offload Comes to the Nanos Unikernel

Some people seem to think you can wave a few magic wands and instantly get superior performance -- unfortunately reality is not that cut and dry. Performance engineering requires extensive engineering, measuring and re-engineering to achieve results. Don't listen to charlatans that tell you otherwise.

As a quick example let's say you have a function that copies bytes from src to dst or even just sets a number of n bytes. A naive example might be to set/copy a single byte at one time and that little innocuous piece of code could be called everywhere. Not only could it be nested deeply but it could be called a lot. Perhaps it's even used in something like oh idk, memset?

static inline void runtime_memset(u8 *a, u8 b, bytes len)
{
    for (int i = 0; i < len; i++) ((u8 *)a)[i] = b;
}

This isn't theoretical - years ago we had code that looked exactly like this. However, perhaps you should be copying a word at a time or with different alignments? Or maybe you need to zero out a lot of pages? What is the difference in performance? Quite a lot.

There is a reason why kernels and operating systems can take years to come to fruition and this is one reason.

TCP Segmentation

We recently addressed another known performance capability that other systems have had for some time. We added TCP segmentation offload (TSO) to our network stack. What does this mean? Essentially we can shift some of the tcp segmentation into the NIC (or virtio-net in this case) vs in the (virtual CPU) which in our tests give us a 4X improvment in iperf throughput. On one hand it's amazing that we've gone so far without hitting this or a user/customer asking about it but on the other hand it's not so surprising. After all, one of our first changes to our networking stack was to add syn flood protection. The project we forked our tcp/ip stack from a while ago still doesn't have that.

Let's revisit some of these terms and define them so we have a better understanding of how this works.

First off - What is segmentation?

As you might know, TCP breaks up the application data (eg: your HTTP webserver traffic) into segments that are controlled by the MSS (Maximum Segment Size). MSS limits the payload size while the entire packet itself is controlled by the MTU.

When you send data using TCP/IP the data/body portion of the IP packet is formatted as a TCP segment.

Segments are encapsulated inside packets which are encapsulated inside frames.

These are all different terms but they are all very similar to the extent there is a term for it.

What is a PDU?

To make things simpler we can call these encapsulation abstractions PDUs - a protocol description unit.

  • Transport Layer: Segment
  • Network Layer: Packet
  • Data Link Layer: Frame

Let's get illustrative. Using HTTP as an example we stuff that data into the TCP body of a TCP segment.

-------------------------
| tcp header | tcp body |
-------------------------
TCP Segment

We then turn around and stuff that segment into an ip body to form a packet:

-------------------------
| ip header | ip body   |
-------------------------
IP Packet

Then we stuff the ip packet into (probably an ethernet) frame:

----------------------------------
| mac header | pkt body | chksum |
----------------------------------
Frame

Segmentation is going to occur as we are operating on a stream and the various limits prevent us from sending too much at a time. The question is whether our CPU does the segmentation and re-assembly or the NIC does. In many cases we have a NIC that can do this freeing up the CPU to do more application/system specific work. This is where offload comes in. The NIC is not alone in this either. Modern day architecture looks nothing like what many people might think of in their head. For more information on this check out this excellent talk.

Unsurprisingly there is a similar process with ip called ip fragmentation. The difference is that TCP segmentation will happen at endpoints (eg: client and server) but IP fragmententation can happen in transit. One reason is that MTUs can change from network to network.

We also added support for UDP fragmentation (UFO) offload which provides similar functionality. The changes we made also allow us to enable MRG_RXBUF so we can don't have huge rx side buffers which has the affect of lowering memory utilization.

Performance engineering is not a light switch. It's more like having 10,000 different knobs. This one knob pushed our throughput by 4X.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.