Differences

This shows you the differences between two versions of the page.

--- system:benches10gbps:direct [2012/09/26 17:49]
ze created
+++ system:benches10gbps:direct [2012/10/04 13:16] (current)
ze add httpterm benches
@@ Line 2: / Line 2: @@
 We will try to get back the road of tunning client and server, but to
-make it easier to focus on a single side at once, we will be using one
+make it easier to focus on a single side at once, we will be using a
-of the best found configuration for the other peer.
+"good" configuration for the other peer.
+Monitoring graphs for the different benches can be found [[http://www.hagtheil.net/files/system/benches10gbps/direct/|here]].
 ====== Server ======
@@ Line 9: / Line 12: @@
 The main focus was to tune the server so it could handle alot of
 connections.
+Changes are made and ordered to get a noticable gain after each.  Some
+changes could be done much earlier, but often with small impact.
 ===== baseline =====
@@ Line 20: / Line 26: @@
         get 10.128.0.0:80 /
-  /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 21298 hits/s
+ hits/s
-start: Wed Sep 26 10:59:02 UTC 2012
-stop:  Wed Sep 26 11:00:03 UTC 2012
 Ok, that's gives us baseline. What we can get without even trying.
-===== All your cores are belong to us =====
+===== All your core are belong to us =====
 Nginx default configuration only has 4 workers. The systems sees 24 cpu.
@@ Line 37: / Line 40: @@
   +worker_processes 24;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 27271 hits/s
-start: Wed Sep 26 11:25:18 UTC 2012
+Getting some errors in /var/log/nginx/error.log
-stop:  Wed Sep 26 11:26:19 UTC 2012
-Good... we are getting somewhere.
-We have 24 process that can handle a connection, it's better than 4.
-===== Multiple way to get in =====
-There might be some limitation with the bound socket. (Like the kernel
-locks the socket to check if it the waiting list is not too long before
-accepting the connection... pure speculation, code not checked)
-Lets try to replace the single listen by multiple IPs to listen to.
-  file: /etc/nginx/sites-enabled/default
-  -#listen 80;
-  +listen 10.128.0.0:80;
-  +listen 10.128.0.1:80;
-  [...]
-  +listen 10.128.0.23:80;
-  New input file: small-24.txt
-  new page0 0
-          get 10.128.0.0:80 /
-  new page1 0
-          get 10.128.0.1:80 /
-  [...]
-  new page23 0
-          get 10.128.0.23:80 /
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  [...]
-Getting some errors in /var/log/nginx/error.log, it's time to handle
-them.
   [...] accept4() failed (24: Too many open files)
 Increase the number of open files. That's just memory. Memory is cheap.
-Lets say that instead of 1k (ulimit -n show 1024) we want lets say....
+Lets say that instead of 1k (ulimit -n show 1024) we want lets say 1M
-M files (1048576).
+files (1048576).
   file:/etc/default/nginx
   +ULIMIT="-n 1048576"
-Damn, new error...
+New error...
   [...] "/var/log/nginx/access.log" failed (28: No space left on device) while logging request [...]
@@ Line 110: / Line 77: @@
 Yeah, no more errors.
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 49966 hits/s
+ hits/s
-Wed Sep 26 12:13:27 UTC 2012
+Good... we are getting somewhere.
-Wed Sep 26 12:14:28 UTC 2012
+We have 24 process that can handle a connection, it's better than 4.
+===== Multiple way to get in =====
+There might be some limitation with the bound socket. (Like the kernel
+locks the socket to check if it the waiting list is not too long before
+accepting the connection... pure speculation, code not checked)
+Lets try to replace the single listen by multiple IPs to listen to.
+  file: /etc/nginx/sites-enabled/default
+  -#listen 80;
+  +listen 10.128.0.0:80;
+  +listen 10.128.0.1:80;
+  [...]
+  +listen 10.128.0.23:80;
+  New input file: small-24.txt
+  new page0 0
+          get 10.128.0.0:80 /
+  new page1 0
+          get 10.128.0.1:80 /
+  [...]
+  new page23 0
+          get 10.128.0.23:80 /
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+Good, it does help to not be limited on a single socket.
 ===== sorry to interrupt =====
@@ Line 132: / Line 129: @@
 eth1-TxRx-23 23
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 56581 hits/s
+ hits/s
-Wed Sep 26 14:49:00 UTC 2012
-Wed Sep 26 14:50:01 UTC 2012
 Better.
@@ Line 152: / Line 146: @@
   +accept_mutex off;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 95875 hits/s
+ hits/s
-Wed Sep 26 15:05:23 UTC 2012
-Wed Sep 26 15:06:24 UTC 2012
 Wow, that much was just due to nginx locking itself, and preventing
@@ Line 172: / Line 163: @@
   +worker_processes 16;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 123672 hits/s
+ hits/s
-Wed Sep 26 15:31:02 UTC 2012
-Wed Sep 26 15:32:03 UTC 2012
 What if we get down to 12 ?
@@ Line 184: / Line 172: @@
   +worker_processes 12;
-  /root/inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 133647 hits/s
+ hits/s
-Wed Sep 26 15:42:18 UTC 2012
-Wed Sep 26 15:43:19 UTC 2012
 That much for having as many worker as cpu.
@@ Line 198: / Line 183: @@
 What if we decided to split IRQ on a few CPU, and workers on other CPU.
+By checking informations from
+''/sys/bus/cpu/devices/cpu*/topology/{core,thread}_siblings_list'', we get some
+idea how the CPU are regarding to threads and processors :
+^  CPU  ^  processor  ^  core  ^  thread  ^
+|  0-5  |  0  |  0-5  |  0  |
+|  6-11  |  1  |  0-5  |  0  |
+|  12-17  |  0  |  0-5  |  1  |
+|  18-23  |  1  |  0-5  |  1  |
 How to split ? Lets try differents splitting.
@@ Line 205: / Line 200: @@
   irq 0-23 => cpu 0-11,0-11
   workers - cpu 12-23
-  Reponse : 178433 hits/s
+ hits/s
-Wed Sep 26 15:59:06 UTC 2012
-Wed Sep 26 16:00:07 UTC 2012
 We have 2 real processor with 12 threads each. Lets try 1 CPU for IRQ
@@ Line 215: / Line 207: @@
   irq 0-23 => cpu 0-5,12-17,0-5,12-17 (processor #0)
   workers - set on 6-11,18-23 (processor #1)
-  Reponse : 179635 hits/s
+ hits/s
-Wed Sep 26 15:55:22 UTC 2012
-Wed Sep 26 15:56:23 UTC 2012
-mmm... doesn't change much.
+better
 What if we use first 3 cores (2 threads per core) of each processor for
@@ Line 227: / Line 216: @@
   irq 0-23 => cpu 0-2,6-8,12-14,18-20,0-2,6-8,12-14,18-20
   workers - cpu 3-5,9-11,15-17,21-23
-  Reponse : 172785 hits/s
+ hits/s
-Wed Sep 26 16:08:54 UTC 2012
+not as good.
-Wed Sep 26 16:09:56 UTC 2012
-Lightly less...
 Maybe now that we have a separation we can include a few more workers
@@ Line 244: / Line 230: @@
   irq 0-23 => cpu 0-3,6-9,0-3,6-9,0-3,6-9
   worker - cpu 4,5,10-23
-  Reponse : 149155 hits/s
+ hits/s
-Wed Sep 26 16:23:47 UTC 2012
-Wed Sep 26 16:24:48 UTC 2012
-(lost logfile)
 ouch. Not that good...
-What about one processor for IRQ... first 4 cores ?
+What about one processor for IRQ... first 4 cores (both threads) ?
   irq 0-23 => 0-3,12-15,0-3,12-15,0-3,12-15
   worker - cpu 4-11,16-23
-  Reponse : 198196 hits/s
+ hits/s
-Wed Sep 26 17:02:48 UTC 2012
+Wow, much better. Just changing which threads handle does what has a big
-Wed Sep 26 17:03:49 UTC 2012
+impact.
 ===== pin the hopper =====
@@ Line 266: / Line 248: @@
 process with a single cpu, so they stop hopping from one to an other.
-  Reponse : 204982 hits/s
+ hits/s
-Wed Sep 26 16:34:11 UTC 2012
-Wed Sep 26 16:35:12 UTC 2012
-Wow. Just with cpu affinity ? damn!
+And yet better, with just affinity.
 ===== keep it opened =====
@@ Line 282: / Line 261: @@
   file:/etc/nginx/nginx.conf
   +open_file_cache max=1000;
-  Reponse : 209763 hits/s
+ hits/s
-Wed Sep 26 16:55:05 UTC 2012
-Wed Sep 26 16:56:06 UTC 2012
 ===== I can has cookies =====
@@ Line 300: / Line 276: @@
   +net.ipv4.tcp_tw_recycle = 1
   +net.ipv4.tcp_tw_reuse = 1
-  +net.core.netdev_max_backlog = 1048576
   +net.ipv4.tcp_syncookies = 0
+  +net.core.netdev_max_backlog = 1048576
   +net.core.somaxconn = 1048576
   +net.ipv4.tcp_max_syn_backlog = 1048576
-  +net.ipv4.tcp_max_tw_buckets = 1048576
 Check how it gets on a longer period :
-  /root/inject -b -d 600 -u 500 -s 20 -f small-$max.txt -S 10.140.0.0-10.140.15.255:1024-65535
+  /root/inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
-  Reponse : 208784 hits/s
+hits/s
+Ok, we can hold 236k connections per second, without hitting any limit
+in any log.
+===== about client =====
+Bench for server was done with a patched version of inject that pinned
+each process to a single cpu, and gathered network interrupts gathered
+on a few cpu.  This was what gave the best result at a time, but further
+client test shows it's not optimal.
+====== Client ======
+Ok, now lets get back to tunning the client. We will reset the client in
+a default configuration, and tune it to get up at a high hit per second.
+We keep the server in the latest configuration.
+We already established that hitting multiple IPs was better than hitting
+a single one. we will keep that part in place.
+As our client need to connect at a high rate, we have to use multiple
+source IP. If we don't, we would soon hit a limit of source ip/port ->
+destination ip/port.
+Having a client binds to an IP without specifying the port (letting it
+be taken from the ephemeral port) would still hit the same flaw (at
+least under Linux). That means we need a client that binds to a specific
+ip AND port for each outgoing connection.
+inject seems to be doing just that. It takes a range of IP and range of
+ports. It splits the ports between the processes, and tries it with each
+IP in range, before getting to the next port. All IP in range will be
+used before a process move to the next port.
+At our quick connections per seconds, and hoping to present a nice
+amount of different sources, a /20 is used (4096 IPs) along with all
+upper ports (1024 -> 65535), that would leave about 252M ip/port tuple.
+Note: at the high rate we get, it burns an average of 60 port per
+seconds, and would take about 18 minutes before it would loops back to
+the first ports.
+===== baseline =====
+Lets get a few baselines.
+Lets start with 1 process, and 1 user
+  /root/inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+ hits/s
+Ok, that's what a single user can get... that's about 0.20 ms per query.
+===== more processes =====
+process is nice, but no reason not to get more processes, as we have
+threads on the processors.
+  /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== interrupt someone else =====
+As we can see, CPU#0 is full with soft interrupts.
+Lets get the network irq spread on all cpu. (0-23 to cpu 0-23)
+  /root/inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== more users =====
+Let the process use more users.
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+===== no timestamp =====
+By default, tcp get some timestamps on its connection. When we are
+trying to gain the little performance we are missing, it could be a good
+idea to not set the timestamp. (note: could be done on server OR client
+with similar results)
+  file: /etc/sysctl.conf
+  net.ipv4.tcp_timestamps = 0
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+====== dual ======
+To check on which side we have a bottle neck, lets try to have 2
+servers, or 2 clients.
+Tests done with the lastest configurations (client and server) which
+could give 240k hits/s.
+===== dual servers =====
+We get a second server with the same configuration, and checked it also
+can handle the 240k/s. Then, we change the scenario to hit the 24 IPs
+from both servers.
+  New input file: dual-24.txt
+  new page0a 0
+          get 10.128.0.0:80 /
+  new page0b 0
+          get 10.132.0.0:80 /
+  new page1a 0
+          get 10.128.0.1:80 /
+  new page1b 0
+          get 10.132.0.1:80 /
+  [...]
+  new page23a 0
+          get 10.128.0.23:80 /
+  new page23b 0
+          get 10.132.0.23:80 /
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+Though the client seems to use all its CPU for 240k/s, it still can go
+up and handle 400k hits/s. The bottle neck is probably not really on
+that side.
+===== dual client =====
+We get a second client with the same configuration, and checked it also
+can generate the 240k/s.
+To launch both clients at the same time, cssh is very nice :)
+  /root/inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:1024-65535
+hits/s
+hits/s
+  total: 244328 hits/s
+Ok, client is clearly not the limitation, as with two clients, we get
+the same total.
+====== conclusions ======
+The above bench shows the following :
+  * As everyone knows, using multiple cores is better than using only one
+  * smp affinity is important, and can deal huge changes
+  * on high load, it might be better to segregate core usage (as shown by separating irq and nginx)
+  * on high load configuration, reducing the number of process to just have one per used core is better
+  * 240k connections / seconds is doable with a single host
+For some unknown reason (at the time of writing that documentation), the
+connections highly drops for 1-2s, as can be seen on
+[[http://www.hagtheil.net/files/system/benches10gbps/direct/bench-bad/nginx-bad/elastiques-nginx/|bench-bad/nginx-bad]]
+graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome.
+====== post-bench ======
+After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages.
+Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores.
+As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each.
+  file-0.cfg:
+  # taskset 000010 ./httpterm -D -f file-0.cfg
+  global
+          maxconn 30000
+          ulimit-n 500000
+          nbproc 1
+          quiet
+  listen proxy1 10.128.0.0:80
+          object weight 1 name test1 code 200 size 200
+          clitimeout 10000
+That gives up more connections per seconds: 278765
-Wed Sep 26 17:36:53 UTC 2012
-Wed Sep 26 17:46:54 UTC 2012
+That helps get even more requests per seconds, but we still get some stall at times.

ze's sandcastle

User Tools

Site Tools

Differences

Page Tools