Differences

This shows you the differences between two versions of the page.

--- system:benches10gbps:direct [2012/09/28 17:10]
ze redone, with stored graph each time
+++ system:benches10gbps:direct [2012/10/04 13:16] (current)
ze add httpterm benches
@@ Line 2: / Line 2: @@
 We will try to get back the road of tunning client and server, but to
-make it easier to focus on a single side at once, we will be using one
+make it easier to focus on a single side at once, we will be using a
-of the best found configuration for the other peer.
+"good" configuration for the other peer.
+Monitoring graphs for the different benches can be found [[http://www.hagtheil.net/files/system/benches10gbps/direct/|here]].
 ====== Server ======
@@ Line 23: / Line 26: @@
         get 10.128.0.0:80 /
-  /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 37: / Line 40: @@
   +worker_processes 24;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
 Getting some errors in /var/log/nginx/error.log
@@ Line 74: / Line 77: @@
 Yeah, no more errors.
-  /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 105: / Line 108: @@
           get 10.128.0.23:80 /
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 126: / Line 129: @@
 eth1-TxRx-23 23
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 143: / Line 146: @@
   +accept_mutex off;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 160: / Line 163: @@
   +worker_processes 16;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 169: / Line 172: @@
   +worker_processes 12;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
@@ Line 180: / Line 183: @@
 What if we decided to split IRQ on a few CPU, and workers on other CPU.
+By checking informations from
+''/sys/bus/cpu/devices/cpu*/topology/{core,thread}_siblings_list'', we get some
+idea how the CPU are regarding to threads and processors :
+^  CPU  ^  processor  ^  core  ^  thread  ^
+|  0-5  |  0  |  0-5  |  0  |
+|  6-11  |  1  |  0-5  |  0  |
+|  12-17  |  0  |  0-5  |  1  |
+|  18-23  |  1  |  0-5  |  1  |
 How to split ? Lets try differents splitting.
@@ Line 270: / Line 283: @@
 Check how it gets on a longer period :
-  /root/inject -b -d 600 -u 500 -s 20 -f small-$max.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
 hits/s
 Ok, we can hold 236k connections per second, without hitting any limit
 in any log.
+===== about client =====
+Bench for server was done with a patched version of inject that pinned
+each process to a single cpu, and gathered network interrupts gathered
+on a few cpu.  This was what gave the best result at a time, but further
+client test shows it's not optimal.
+====== Client ======
+Ok, now lets get back to tunning the client. We will reset the client in
+a default configuration, and tune it to get up at a high hit per second.
+We keep the server in the latest configuration.
+We already established that hitting multiple IPs was better than hitting
+a single one. we will keep that part in place.
+As our client need to connect at a high rate, we have to use multiple
+source IP. If we don't, we would soon hit a limit of source ip/port ->
+destination ip/port.
+Having a client binds to an IP without specifying the port (letting it
+be taken from the ephemeral port) would still hit the same flaw (at
+least under Linux). That means we need a client that binds to a specific
+ip AND port for each outgoing connection.
+inject seems to be doing just that. It takes a range of IP and range of
+ports. It splits the ports between the processes, and tries it with each
+IP in range, before getting to the next port. All IP in range will be
+used before a process move to the next port.
+At our quick connections per seconds, and hoping to present a nice
+amount of different sources, a /20 is used (4096 IPs) along with all
+upper ports (1024 -> 65535), that would leave about 252M ip/port tuple.
+Note: at the high rate we get, it burns an average of 60 port per
+seconds, and would take about 18 minutes before it would loops back to
+the first ports.
+===== baseline =====
+Lets get a few baselines.
+Lets start with 1 process, and 1 user
+  /root/inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+Ok, that's what a single user can get... that's about 0.20 ms per query.
+===== more processes =====
+process is nice, but no reason not to get more processes, as we have
+threads on the processors.
+  /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== interrupt someone else =====
+As we can see, CPU#0 is full with soft interrupts.
+Lets get the network irq spread on all cpu. (0-23 to cpu 0-23)
+  /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== more users =====
+Let the process use more users.
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== no timestamp =====
+By default, tcp get some timestamps on its connection. When we are
+trying to gain the little performance we are missing, it could be a good
+idea to not set the timestamp. (note: could be done on server OR client
+with similar results)
+  file: /etc/sysctl.conf
+  net.ipv4.tcp_timestamps = 0
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+====== dual ======
+To check on which side we have a bottle neck, lets try to have 2
+servers, or 2 clients.
+Tests done with the lastest configurations (client and server) which
+could give 240k hits/s.
+===== dual servers =====
+We get a second server with the same configuration, and checked it also
+can handle the 240k/s. Then, we change the scenario to hit the 24 IPs
+from both servers.
+  New input file: dual-24.txt
+  new page0a 0
+          get 10.128.0.0:80 /
+  new page0b 0
+          get 10.132.0.0:80 /
+  new page1a 0
+          get 10.128.0.1:80 /
+  new page1b 0
+          get 10.132.0.1:80 /
+  [...]
+  new page23a 0
+          get 10.128.0.23:80 /
+  new page23b 0
+          get 10.132.0.23:80 /
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+Though the client seems to use all its CPU for 240k/s, it still can go
+up and handle 400k hits/s. The bottle neck is probably not really on
+that side.
+===== dual client =====
+We get a second client with the same configuration, and checked it also
+can generate the 240k/s.
+To launch both clients at the same time, cssh is very nice :)
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+hits/s
+  total: 244328 hits/s
+Ok, client is clearly not the limitation, as with two clients, we get
+the same total.
+====== conclusions ======
+The above bench shows the following :
+  * As everyone knows, using multiple cores is better than using only one
+  * smp affinity is important, and can deal huge changes
+  * on high load, it might be better to segregate core usage (as shown by separating irq and nginx)
+  * on high load configuration, reducing the number of process to just have one per used core is better
+  * 240k connections / seconds is doable with a single host
+For some unknown reason (at the time of writing that documentation), the
+connections highly drops for 1-2s, as can be seen on
+[[http://www.hagtheil.net/files/system/benches10gbps/direct/bench-bad/nginx-bad/elastiques-nginx/|bench-bad/nginx-bad]]
+graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome.
+====== post-bench ======
+After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages.
+Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores.
+As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each.
+  file-0.cfg:
+  # taskset 000010 ./httpterm -D -f file-0.cfg
+  global
+          maxconn 30000
+          ulimit-n 500000
+          nbproc 1
+          quiet
+  listen proxy1 10.128.0.0:80
+          object weight 1 name test1 code 200 size 200
+          clitimeout 10000
+That gives up more connections per seconds: 278765
+That helps get even more requests per seconds, but we still get some stall at times.

ze's sandcastle

User Tools

Site Tools

Differences

Page Tools