Skip to content

Scalability from the operating system & code perspective

datacenterWhile the overnight success happens for only 0,01% of mobile & web services, as entrepreneurs, we are all secretly expecting and preparing for it. Cloud providers know how to use these success stories to rent hardware for a premium in exchange of lightning fast scalability (read “flexibility”). Indeed, scalability is now a matter of only a few clicks & automation while it used to take engineers/nights/pizzas/coffees not so long ago.

The downside of this unlimited power on-demand is that it progressively made today’s engineers, lazy about architecture & code optimization. As developer’s time costs money, the common wisdom today is to throw dollars at the infrastructure, not on code. The number of (amazingly slow) CMS available on the market today and WordPress being used on 23% of Internet websites confirms this.

The ever increasing pressure to deliver projects as fast as possible and cheap IT resources brought us to the “Ship fast, optimize later” kind of attitude. Just 15 years ago, unoptimized code running on a Pentium100 with 16MB of RAM was immediately noticeable, forcing us to focus on fast code opposed to “code fast” today. Not to mention nowadays abundance of horribly heavy frameworks and libraries allowing to save developer time in exchange of poor performance upon execution. While it doesn’t mean I don’t pay attention to how I code today, code optimization is clearly a lower priority than it used to be even 10 years ago. Optimization is usually left for later and “later” often becomes “definitive”.

A few recent examples made me reconsider my view on this:

First example: One of our clients runs a marketing analytics solution that costs them $2000 per month. Their competitor spends roughly $90,000 per month on Amazon AWS to achieve exactly the very same tasks. The catch? Extremely well written & optimized code, no fancy framework, bare metal hardware and amazing developers.

Second example: A large enterprise recently called us to audit their solution which was having load problems. After rewriting a few (horrible) lines of code in a script, we managed to go from 9 machines to only 2 to handle the same load of trafic.

Third example: In 2009, StackOverflow’s architecture was serving 16 million page views a month. At that time, StackOverflow was running with only two Lenovo ThinkServer RS110 1U (4 cores, 2.83 Ghz, 12 MB L2 cache with 8GB of RAM) for the web tiers and a single Lenovo ThinkServer RD120 2U (8 cores, 2.5 Ghz, 24 MB L2 cache with 48GB of RAM) for the database. That is only three servers with specs that aren’t bind blowing by today’s standards, serving 16M PV a month… The key to StackOverflow performance lies in well written/designed code, bare metal servers & simple but efficient architecture.

So it was with great attention & delight that I followed Robert Graham’s talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Every application developer who learnt network programming actually learnt BSD/Unix network programming with the usual pattern : socket/bind/listen/accept/recv/send.. All these primitives actually rely on the OS kernel. I was extremely surprised to learn that the Linux kernel actually performs poorly on handling these functions (I actually thought the opposite). Moving these primitives that we relied on for years, from the kernel to userland, we can finally have specialized (optimized) code with more control over how things are done.


Network Data Flow through the Linux Kernel

The kernel is then bypassed for the data plane (the actual movement of the data packets) which is dealt with in userland. The kernel is left with only the control plane to take care of, which it can do well. Taking care of the data plane in userland opens a new world of possibilities (and several optimizations) : packet decapsulation can be done faster (and in a lazy way I guess as you don’t always need to decapsulate every protocol headers, I have a few ideas on this) thanks to a much more optimized memory management and faster code (reduced number of CPU cycles).

After Googling a bit about this, I found DPDK (Data Plane Development Kit). DPDK framework contains a few interesting features to achieve amazing performance improvements for network related operations on Linux:

Fast packet processing: Don’t let the kernel handle/decapsulate packets for you anymore, it is not designed to do it and can’t do it well. Take the packets directly from the NIC and do the processing in userland. Intel recently announced it achieved over 80Mpps throughput vs 12.2Mpps with the native Linux stack.

Better memory management for your application: DPDK provides huge page memory which reduces page table size. Your application can reserve a large block of memory only once and then do its own management without asking anything to the kernel.

Better use of CPU cores: Most of the applications are not designed to take full advantage of the multiple CPU cores. The multicore framework seems to aim at improving this as well but I have yet to dig into the details. Using processor affinity would be an idea : keeping CPUs working for a single purpose within your application, not being possibly interrupted by the kernel.

It was an unexpected discovery for me to see how much Linux kernel, which powers roughly 70% of the Internet today is such a huge performance bottleneck when dealing with network operations. Thanks to frameworks like DPDK getting more attention, I look forward seeing web applications serving millions of connections per second for a fraction of the hardware required today. And with a very low energy footprint. With data centers worldwide using nearly 30 billion watts of electricity, code can definitely have a negative or positive impact on this consumption.

Share your thoughts