This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
system:benches10gbps:direct [2012/09/28 17:10] ze redone, with stored graph each time |
system:benches10gbps:direct [2012/10/04 13:16] (current) ze add httpterm benches |
||
|---|---|---|---|
| Line 2: | Line 2: | ||
| We will try to get back the road of tunning client and server, but to | We will try to get back the road of tunning client and server, but to | ||
| - | make it easier to focus on a single side at once, we will be using one | + | make it easier to focus on a single side at once, we will be using a |
| - | of the best found configuration for the other peer. | + | "good" configuration for the other peer. |
| + | |||
| + | Monitoring graphs for the different benches can be found [[http://www.hagtheil.net/files/system/benches10gbps/direct/|here]]. | ||
| ====== Server ====== | ====== Server ====== | ||
| Line 23: | Line 26: | ||
| get 10.128.0.0:80 / | get 10.128.0.0:80 / | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 20932 hits/s | 20932 hits/s | ||
| Line 37: | Line 40: | ||
| +worker_processes 24; | +worker_processes 24; | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| Getting some errors in /var/log/nginx/error.log | Getting some errors in /var/log/nginx/error.log | ||
| Line 74: | Line 77: | ||
| Yeah, no more errors. | Yeah, no more errors. | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 47875 hits/s | 47875 hits/s | ||
| Line 105: | Line 108: | ||
| get 10.128.0.23:80 / | get 10.128.0.23:80 / | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 50743 hits/s | 50743 hits/s | ||
| Line 126: | Line 129: | ||
| eth1-TxRx-23 23 | eth1-TxRx-23 23 | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 53721 hits/s | 53721 hits/s | ||
| Line 143: | Line 146: | ||
| +accept_mutex off; | +accept_mutex off; | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 97682 hits/s | 97682 hits/s | ||
| Line 160: | Line 163: | ||
| +worker_processes 16; | +worker_processes 16; | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 126731 hits/s | 126731 hits/s | ||
| Line 169: | Line 172: | ||
| +worker_processes 12; | +worker_processes 12; | ||
| - | /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 138247 hits/s | 138247 hits/s | ||
| Line 180: | Line 183: | ||
| What if we decided to split IRQ on a few CPU, and workers on other CPU. | What if we decided to split IRQ on a few CPU, and workers on other CPU. | ||
| + | |||
| + | By checking informations from | ||
| + | ''/sys/bus/cpu/devices/cpu*/topology/{core,thread}_siblings_list'', we get some | ||
| + | idea how the CPU are regarding to threads and processors : | ||
| + | |||
| + | ^ CPU ^ processor ^ core ^ thread ^ | ||
| + | | 0-5 | 0 | 0-5 | 0 | | ||
| + | | 6-11 | 1 | 0-5 | 0 | | ||
| + | | 12-17 | 0 | 0-5 | 1 | | ||
| + | | 18-23 | 1 | 0-5 | 1 | | ||
| How to split ? Lets try differents splitting. | How to split ? Lets try differents splitting. | ||
| Line 270: | Line 283: | ||
| Check how it gets on a longer period : | Check how it gets on a longer period : | ||
| - | /root/inject -b -d 600 -u 500 -s 20 -f small-$max.txt -S 10.140.0.0-10.140.15.255:1024-65535 | + | /root/inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 |
| 236103 hits/s | 236103 hits/s | ||
| Ok, we can hold 236k connections per second, without hitting any limit | Ok, we can hold 236k connections per second, without hitting any limit | ||
| in any log. | in any log. | ||
| + | |||
| + | ===== about client ===== | ||
| + | |||
| + | Bench for server was done with a patched version of inject that pinned | ||
| + | each process to a single cpu, and gathered network interrupts gathered | ||
| + | on a few cpu. This was what gave the best result at a time, but further | ||
| + | client test shows it's not optimal. | ||
| + | |||
| + | ====== Client ====== | ||
| + | |||
| + | Ok, now lets get back to tunning the client. We will reset the client in | ||
| + | a default configuration, and tune it to get up at a high hit per second. | ||
| + | |||
| + | We keep the server in the latest configuration. | ||
| + | |||
| + | We already established that hitting multiple IPs was better than hitting | ||
| + | a single one. we will keep that part in place. | ||
| + | |||
| + | As our client need to connect at a high rate, we have to use multiple | ||
| + | source IP. If we don't, we would soon hit a limit of source ip/port -> | ||
| + | destination ip/port. | ||
| + | |||
| + | Having a client binds to an IP without specifying the port (letting it | ||
| + | be taken from the ephemeral port) would still hit the same flaw (at | ||
| + | least under Linux). That means we need a client that binds to a specific | ||
| + | ip AND port for each outgoing connection. | ||
| + | |||
| + | inject seems to be doing just that. It takes a range of IP and range of | ||
| + | ports. It splits the ports between the processes, and tries it with each | ||
| + | IP in range, before getting to the next port. All IP in range will be | ||
| + | used before a process move to the next port. | ||
| + | |||
| + | At our quick connections per seconds, and hoping to present a nice | ||
| + | amount of different sources, a /20 is used (4096 IPs) along with all | ||
| + | upper ports (1024 -> 65535), that would leave about 252M ip/port tuple. | ||
| + | |||
| + | Note: at the high rate we get, it burns an average of 60 port per | ||
| + | seconds, and would take about 18 minutes before it would loops back to | ||
| + | the first ports. | ||
| + | |||
| + | ===== baseline ===== | ||
| + | |||
| + | Lets get a few baselines. | ||
| + | |||
| + | Lets start with 1 process, and 1 user | ||
| + | /root/inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 4984 hits/s | ||
| + | |||
| + | Ok, that's what a single user can get... that's about 0.20 ms per query. | ||
| + | |||
| + | ===== more processes ===== | ||
| + | |||
| + | 1 process is nice, but no reason not to get more processes, as we have | ||
| + | 24 threads on the processors. | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 51080 hits/s | ||
| + | |||
| + | ===== interrupt someone else ===== | ||
| + | |||
| + | As we can see, CPU#0 is full with soft interrupts. | ||
| + | |||
| + | Lets get the network irq spread on all cpu. (0-23 to cpu 0-23) | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 112035 hits/s | ||
| + | |||
| + | ===== more users ===== | ||
| + | |||
| + | Let the process use more users. | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 228367 hits/s | ||
| + | |||
| + | ===== no timestamp ===== | ||
| + | |||
| + | By default, tcp get some timestamps on its connection. When we are | ||
| + | trying to gain the little performance we are missing, it could be a good | ||
| + | idea to not set the timestamp. (note: could be done on server OR client | ||
| + | with similar results) | ||
| + | |||
| + | file: /etc/sysctl.conf | ||
| + | net.ipv4.tcp_timestamps = 0 | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 241193 hits/s | ||
| + | |||
| + | ====== dual ====== | ||
| + | |||
| + | To check on which side we have a bottle neck, lets try to have 2 | ||
| + | servers, or 2 clients. | ||
| + | |||
| + | Tests done with the lastest configurations (client and server) which | ||
| + | could give 240k hits/s. | ||
| + | |||
| + | ===== dual servers ===== | ||
| + | |||
| + | We get a second server with the same configuration, and checked it also | ||
| + | can handle the 240k/s. Then, we change the scenario to hit the 24 IPs | ||
| + | from both servers. | ||
| + | |||
| + | New input file: dual-24.txt | ||
| + | new page0a 0 | ||
| + | get 10.128.0.0:80 / | ||
| + | new page0b 0 | ||
| + | get 10.132.0.0:80 / | ||
| + | new page1a 0 | ||
| + | get 10.128.0.1:80 / | ||
| + | new page1b 0 | ||
| + | get 10.132.0.1:80 / | ||
| + | [...] | ||
| + | new page23a 0 | ||
| + | get 10.128.0.23:80 / | ||
| + | new page23b 0 | ||
| + | get 10.132.0.23:80 / | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 401391 hits/s | ||
| + | |||
| + | Though the client seems to use all its CPU for 240k/s, it still can go | ||
| + | up and handle 400k hits/s. The bottle neck is probably not really on | ||
| + | that side. | ||
| + | |||
| + | ===== dual client ===== | ||
| + | |||
| + | We get a second client with the same configuration, and checked it also | ||
| + | can generate the 240k/s. | ||
| + | |||
| + | To launch both clients at the same time, cssh is very nice :) | ||
| + | |||
| + | /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535 | ||
| + | 123016 hits/s | ||
| + | 121312 hits/s | ||
| + | total: 244328 hits/s | ||
| + | |||
| + | Ok, client is clearly not the limitation, as with two clients, we get | ||
| + | the same total. | ||
| + | |||
| + | ====== conclusions ====== | ||
| + | |||
| + | The above bench shows the following : | ||
| + | |||
| + | * As everyone knows, using multiple cores is better than using only one | ||
| + | * smp affinity is important, and can deal huge changes | ||
| + | * on high load, it might be better to segregate core usage (as shown by separating irq and nginx) | ||
| + | * on high load configuration, reducing the number of process to just have one per used core is better | ||
| + | * 240k connections / seconds is doable with a single host | ||
| + | |||
| + | For some unknown reason (at the time of writing that documentation), the | ||
| + | connections highly drops for 1-2s, as can be seen on | ||
| + | [[http://www.hagtheil.net/files/system/benches10gbps/direct/bench-bad/nginx-bad/elastiques-nginx/|bench-bad/nginx-bad]] | ||
| + | graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome. | ||
| + | |||
| + | ====== post-bench ====== | ||
| + | |||
| + | After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages. | ||
| + | |||
| + | Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores. | ||
| + | |||
| + | As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each. | ||
| + | |||
| + | file-0.cfg: | ||
| + | # taskset 000010 ./httpterm -D -f file-0.cfg | ||
| + | global | ||
| + | maxconn 30000 | ||
| + | ulimit-n 500000 | ||
| + | nbproc 1 | ||
| + | quiet | ||
| + | | ||
| + | listen proxy1 10.128.0.0:80 | ||
| + | object weight 1 name test1 code 200 size 200 | ||
| + | clitimeout 10000 | ||
| + | |||
| + | That gives up more connections per seconds: 278765 | ||
| + | |||
| + | |||
| + | That helps get even more requests per seconds, but we still get some stall at times. | ||