Are unikernels unfit for production? Are unikernels completely undebuggable? There are so many unsubstantiated claims being made about 'debugging unikernels' that I feel I need to address some of them. I'll state right off the bat that you should question each and every claim you hear to see if the person making it has actually tried to accomplish what they are proposing is impossible with unikernels or hell - have they even booted one? My guess is probably not because if you have you’d realize how insane some of these claims truly are.
Let’s start by breaking down the notion of ‘debugging unikernels’ because it’s thrown out there and yet never really gets spelled out. I've come up with the following categories of claims that have been mentioned:
Real Application Level Debugging
This surprisingly, or not so suprisingly works out of the box. We use GDB all the time to debug our kernel cause it’s c and we work mostly on kvm which is trivially to attach too. As for the application level tooling use the tools that exist. Have a problem with your ruby app? Use the tools that exist like irb. Have a problem with go? Delve exists but most go developers I know are comfortable enough with their composition abilities, test suites and such that they don’t feel it necessary to single step and hit breakpoints on their application. Regardless of which way you swing existing tools do the trick.
Typically we are talking about tools like Prometheus and others like this which also work out of the box. Having said that, when we boot the first Go unikernel guess what worked out of the box? Full APM support including the ability to see panics, garbage collection over time, request context tracing, open file descriptors, heap dumps, etc. Sometimes people like to lump logging into this vein as well - well syslog, elasticsearch, splunk - once again - you can utilize the tools that exist.
Production DevOps/SRE Tooling
This is probably the main topic to hit and for a guy that was given his first slackware floppies in 1994, still uses mutt for some of his email accounts and such you know I got opinions on this.
The first thing to note, and I think is a very common misconception, is that unikernels are somehow of a stripped down linux. You need to get rid of that thinking. Unikernels are not in the JEOS (just enough operating system) vein. They are not like a coreos or an alpine. Unikernels are fundamentally a different beast. If you can get your head around that things stop looking as crazy as you think they are and it also helps to go and boot one or two to actually understand what is going on.
Tools like htop, ps, lsof - really the toolchain list goes on and on ad infinitum - are built for multiple process systems. Maybe you are used to pulling up htop when you ssh into a system to see what application is hogging all the cpu? Well, on a unikernel system there is no question - it’s the one process. Maybe you are used to running df/du to find what and where is eating your precious disk space. Darn - the spark log files strike again! There is no question - it’s the one process. Maybe you are used to grepping through ps output to find the pid of the application that you are interested in profiling with strace? Well, on a unikernel system there is no question. Maybe you’re trying to find the process that is spawning a ton of connections with lsof? On a unikernel system there is no question and it’s easily instrumentable.
The point I’m trying to make here is that a lot of these tools are built because it’s really hard to distinguish activity between multiple processes on a multiple process monolith like Linux without them. On a single process system there is nothing stopping you from exporting anything of interest to your observability and monitoring solutions and most people who call themselves SREs would probably argue that’s what you should be doing regardless.
This starts to beg the question of how these things are actually deployed. Every single company I know that utilizes unikernels deploys them on top of hypervisors. If you are in the public cloud such as AWS or GCE you are probably sitting on a modified form of KVM (AWS is in the process of removing Xen). If you are in a private datacenter there is a strong probability you are virtualized as well. However, the real way to phrase this question is - are you in control of the virtualization or not? This question is not a good or bad style question it’s just an organizational one. If you are in control - every single tool you can think of is available to you - end of story - most of the customers we work with who run unikernels are given a platform to orchestrate these through KVM. So all those tools remain available to their ops people.
If you are on something like AWS or GCE then you will need to start treating your servers like cattle. That is, you’ll need to more heavily rely on instrumentation then ssh’ing into hosts to poke and prod. Having said that that’s not your only choice for public cloud environments. GCE supports nested virtualization and AWS supports ‘bare metal’ (well not really but that’s a different blogpost). If you are using those options you can keep your management plane shell and have your cake too. I fully expect to see serverless runtimes built on top of unikernels doing this in the very near future the same way heroku started out.
Debugging in Production
Lastly the concept of ‘debugging in production’ is basically the same concept as ‘developing in production’ or ‘testing in production’ - don’t do it. That is widely considered to be a bad practice. If you are going to sit there and debug in production you might as well get rid of your continuous integration, stop doing pull requests and stop doing code reviews. This argument is completely dead on arrival.
Now that we’ve addressed some of the categories for this argument let’s address the obvious. There are methods of deployment and infrastructure that many people already use that look suspiciously like unikernels.
Platforms as a Service:
Heroku was one of the first companies that adopted this deployment/infrastructure pattern. Guess what - it looks and feels a *lot* like unikernels when it comes to debugging arguments like this. Obviously they have more than a handful of companies using them. Then you have Google App Engine as well which for reasons I will never understand was in place for years before Google decided to roll out GCE services.
The container market is largely an extension/re-invention of what the prior platform as a service market looked like. Keep in mind Docker did used to be known as dotCloud. If you have arguments against unikernels and ‘debugging’ then you have to have the precise same arguments against containers because they operationally look a lot a like. Container people advocate having one process per container. Container people rely on higher level orchestration frameworks like Kubernetes to resolve some of the issues they perceive in managing them. Then there is a whole wealth of ‘debugging’ like tooling that exists on top to deal with anything you might have problems with in container land. No one I know is ssh’ing into a container and utilizing all the tooling described in the ops section - they might do that on the hosts that run kubernetes but once again - that’s an outer layer. The same thing exists for unikernels.Serverless
Then we have serverless and functions as a service. Once again the same arguments apply here. No one is giving you a shell to pop into these things and diagnose production related issues. The onus is on the app developer to debug locally and the ops person to instrument it correctly to ensure that the software works as expected.
As we can clearly see the unikernel debugging myths are just that and hold no weight. I’d be highly careful listening to a random HN/twitter user spout things off when they clearly have never touched them and especially without them backing up their arguments with real proof.
Stop Deploying 50 Year Old Systems
Introducing the future cloud.