This documentation was written after having effective results.

We will try to get back the road of tunning client and server, but to make it easier to focus on a single side at once, we will be using a “good” configuration for the other peer.

Monitoring graphs for the different benches can be found here.

Server

The main focus was to tune the server so it could handle alot of connections.

Changes are made and ordered to get a noticable gain after each. Some changes could be done much earlier, but often with small impact.

baseline

No tunning, just fresh install, with a nginx home page.

A fresh nginx install, serving default home (a very small html).

Input file: small-1.txt
new page0 0
      get 10.128.0.0:80 /

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
20932 hits/s

Ok, that's gives us baseline. What we can get without even trying.

All your core are belong to us

Nginx default configuration only has 4 workers. The systems sees 24 cpu. Lets get 24 workers !

file: /etc/nginx/nginx.conf
-worker_processes 4;
+worker_processes 24;

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535

Getting some errors in /var/log/nginx/error.log

[...] accept4() failed (24: Too many open files)

Increase the number of open files. That's just memory. Memory is cheap. Lets say that instead of 1k (ulimit -n show 1024) we want lets say 1M files (1048576).

file:/etc/default/nginx
+ULIMIT="-n 1048576"

New error…

[...] "/var/log/nginx/access.log" failed (28: No space left on device) while logging request [...]

No space left ? Damn, why am I even logging my requests ? That's some heavy disk i/o and should just be removed. Lets stop writting useless access.log (keep the error.log, there shouldn't be anything there, and if there is it will probably be usefull).

file: /etc/nginx/nginx.conf
-access_log /var/log/nginx/access.log;
+access_log off;

Yet an other error…

768 worker_connections are not enough

Lets get ALOT of connections (not wanting it to appear again anytime soon).

file:/etc/nginx.conf
-worker_connections 768;
+worker_connections 524288;

Yeah, no more errors.

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
47875 hits/s

Good… we are getting somewhere.

We have 24 process that can handle a connection, it's better than 4.

Multiple way to get in

There might be some limitation with the bound socket. (Like the kernel locks the socket to check if it the waiting list is not too long before accepting the connection… pure speculation, code not checked)

Lets try to replace the single listen by multiple IPs to listen to.

file: /etc/nginx/sites-enabled/default
-#listen 80;
+listen 10.128.0.0:80;
+listen 10.128.0.1:80;
[...]
+listen 10.128.0.23:80;

New input file: small-24.txt
new page0 0
        get 10.128.0.0:80 /
new page1 0
        get 10.128.0.1:80 /
[...]
new page23 0
        get 10.128.0.23:80 /

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
50743 hits/s

Good, it does help to not be limited on a single socket.

sorry to interrupt

Overall CPU graph shows that one CPU is much much more used than the others. Checking CPU#0 graph, we can see alot of the time is spent in soft-interrupts. We should try to assign the interrupts to other CPUs too…

As we can see in /proc/interrupts, we have 24 interrupts for each interface (as many as cpu - threads - seen by the system). A first approach would be to assign them in order.

eth1-TxRx-0 0 eth1-TxRx-1 1 […] eth1-TxRx-23 23

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
53721 hits/s

Better.

stop locking yourselves

Now that our network interrupts isn't a bottle neck anymore, we get some nice connections each seconds. Nginx just doesn't accept them fast enough. By default, nginx uses a mutex so only one process accept the connection. Well, who cares ? What if everyone tries to ? Ok, most process will fail, but what if they get a new socket too ? that could fasten things up.

file:/etc/nginx/nginx.conf
+accept_mutex off;

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
97682 hits/s

Wow, that much was just due to nginx locking itself, and preventing other workers from getting the new connections at the same time.

too crowded

We have 24 interrupts spread on our 24 cpu.
We have 24 nginx workers on our 24 cpu.

What if we get less workers ?

file: /etc/nginx/nginx.conf
-worker_processes 24;
+worker_processes 16;

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
126731 hits/s

What if we get down to 12 ?

file: /etc/nginx/nginx.conf
-worker_processes 16;
+worker_processes 12;

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
138247 hits/s

That much for having as many worker as cpu.

lets focus

Not having as many worker as cpu allows better performances. Yeah, that leaves more free cpu to handle IRQ…

What if we decided to split IRQ on a few CPU, and workers on other CPU.

By checking informations from /sys/bus/cpu/devices/cpu*/topology/{core,thread}_siblings_list, we get some idea how the CPU are regarding to threads and processors :

CPU	processor	core	thread
0-5	0	0-5	0
6-11	1	0-5	0
12-17	0	0-5	1
18-23	1	0-5	1

How to split ? Lets try differents splitting.

Each core has 2 threads. Lets use one thread for IRQ, one for a worker.

irq 0-23 => cpu 0-11,0-11
workers - cpu 12-23
184769 hits/s

We have 2 real processor with 12 threads each. Lets try 1 CPU for IRQ and 1 CPU for workers.

irq 0-23 => cpu 0-5,12-17,0-5,12-17 (processor #0)
workers - set on 6-11,18-23 (processor #1)
190712 hits/s

better

What if we use first 3 cores (2 threads per core) of each processor for IRQ, and the 3 last for workers ?

irq 0-23 => cpu 0-2,6-8,12-14,18-20,0-2,6-8,12-14,18-20
workers - cpu 3-5,9-11,15-17,21-23
187394 hits/s

not as good.

Maybe now that we have a separation we can include a few more workers again, and gather the IRQ some more ?

8 cpu for IRQ, 16 workers

Lets try again to use one thread for IRQ, and one for worker… first 4 for each processor.

irq 0-23 => cpu 0-3,6-9,0-3,6-9,0-3,6-9
worker - cpu 4,5,10-23
153129 hits/s

ouch. Not that good…

What about one processor for IRQ… first 4 cores (both threads) ?

irq 0-23 => 0-3,12-15,0-3,12-15,0-3,12-15
worker - cpu 4-11,16-23
218857 hits/s

Wow, much better. Just changing which threads handle does what has a big impact.

pin the hopper

Ok, our nginx has 16 process working on 16 cpu. Why not associate each process with a single cpu, so they stop hopping from one to an other.

224544 hits/s

And yet better, with just affinity.

keep it opened

Now that we have a nice quick data transfert, our nginx serves about 200k times a single file. Maybe it should consider caching the file, and not having to access it from scratch each time. At that rate, it might make a difference.

file:/etc/nginx/nginx.conf
+open_file_cache max=1000;
236607 hits/s

I can has cookies

Kernel shows some syn flood errors…

TCP: Possible SYN flooding on port 80. Sending cookies.  Check SNMP counters.

Lets get that off our back (some options are not related to that message, but are included here too) :

file:/etc/sysctl.conf
+net.ipv4.tcp_fin_timeout = 1
+net.ipv4.tcp_tw_recycle = 1
+net.ipv4.tcp_tw_reuse = 1
+net.ipv4.tcp_syncookies = 0
+net.core.netdev_max_backlog = 1048576
+net.core.somaxconn = 1048576
+net.ipv4.tcp_max_syn_backlog = 1048576

Check how it gets on a longer period :

/root/inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
236103 hits/s

Ok, we can hold 236k connections per second, without hitting any limit in any log.

about client

Bench for server was done with a patched version of inject that pinned each process to a single cpu, and gathered network interrupts gathered on a few cpu. This was what gave the best result at a time, but further client test shows it's not optimal.

Client

Ok, now lets get back to tunning the client. We will reset the client in a default configuration, and tune it to get up at a high hit per second.

We keep the server in the latest configuration.

We already established that hitting multiple IPs was better than hitting a single one. we will keep that part in place.

As our client need to connect at a high rate, we have to use multiple source IP. If we don't, we would soon hit a limit of source ip/port → destination ip/port.

Having a client binds to an IP without specifying the port (letting it be taken from the ephemeral port) would still hit the same flaw (at least under Linux). That means we need a client that binds to a specific ip AND port for each outgoing connection.

inject seems to be doing just that. It takes a range of IP and range of ports. It splits the ports between the processes, and tries it with each IP in range, before getting to the next port. All IP in range will be used before a process move to the next port.

At our quick connections per seconds, and hoping to present a nice amount of different sources, a /20 is used (4096 IPs) along with all upper ports (1024 → 65535), that would leave about 252M ip/port tuple.

Note: at the high rate we get, it burns an average of 60 port per seconds, and would take about 18 minutes before it would loops back to the first ports.

baseline

Lets get a few baselines.

Lets start with 1 process, and 1 user

/root/inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
4984 hits/s

Ok, that's what a single user can get… that's about 0.20 ms per query.

more processes

1 process is nice, but no reason not to get more processes, as we have 24 threads on the processors.

/root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
51080 hits/s

interrupt someone else

As we can see, CPU#0 is full with soft interrupts.

Lets get the network irq spread on all cpu. (0-23 to cpu 0-23)

/root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
112035 hits/s

more users

Let the process use more users.

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
228367 hits/s

no timestamp

By default, tcp get some timestamps on its connection. When we are trying to gain the little performance we are missing, it could be a good idea to not set the timestamp. (note: could be done on server OR client with similar results)

file: /etc/sysctl.conf
net.ipv4.tcp_timestamps = 0

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
241193 hits/s

dual

To check on which side we have a bottle neck, lets try to have 2 servers, or 2 clients.

Tests done with the lastest configurations (client and server) which could give 240k hits/s.

dual servers

We get a second server with the same configuration, and checked it also can handle the 240k/s. Then, we change the scenario to hit the 24 IPs from both servers.

New input file: dual-24.txt
new page0a 0
        get 10.128.0.0:80 /
new page0b 0
        get 10.132.0.0:80 /
new page1a 0
        get 10.128.0.1:80 /
new page1b 0
        get 10.132.0.1:80 /
[...]
new page23a 0
        get 10.128.0.23:80 /
new page23b 0
        get 10.132.0.23:80 /

/root/inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
401391 hits/s

Though the client seems to use all its CPU for 240k/s, it still can go up and handle 400k hits/s. The bottle neck is probably not really on that side.

dual client

We get a second client with the same configuration, and checked it also can generate the 240k/s.

To launch both clients at the same time, cssh is very nice :)

/root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
123016 hits/s
121312 hits/s
total: 244328 hits/s

Ok, client is clearly not the limitation, as with two clients, we get the same total.

conclusions

The above bench shows the following :

As everyone knows, using multiple cores is better than using only one
smp affinity is important, and can deal huge changes
on high load, it might be better to segregate core usage (as shown by separating irq and nginx)
on high load configuration, reducing the number of process to just have one per used core is better
240k connections / seconds is doable with a single host

For some unknown reason (at the time of writing that documentation), the connections highly drops for 1-2s, as can be seen on bench-bad/nginx-bad graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome.

post-bench

After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages.

Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores.

As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each.

file-0.cfg:
# taskset 000010 ./httpterm -D -f file-0.cfg
global
        maxconn 30000
        ulimit-n 500000
        nbproc 1
        quiet

listen proxy1 10.128.0.0:80
        object weight 1 name test1 code 200 size 200
        clitimeout 10000

That gives up more connections per seconds: 278765

That helps get even more requests per seconds, but we still get some stall at times.

ze's sandcastle

Table of Contents

Server

baseline

All your core are belong to us

Multiple way to get in

sorry to interrupt

stop locking yourselves

too crowded

lets focus

pin the hopper

keep it opened

I can has cookies

about client

Client

baseline

more processes

interrupt someone else

more users

no timestamp

dual

dual servers

dual client

conclusions

post-bench

ze's sandcastle

User Tools

Site Tools

Table of Contents

Server

baseline

All your core are belong to us

Multiple way to get in

sorry to interrupt

stop locking yourselves

too crowded

lets focus

pin the hopper

keep it opened

I can has cookies

about client

Client

baseline

more processes

interrupt someone else

more users

no timestamp

dual

dual servers

dual client

conclusions

post-bench

Page Tools