User Tools

Site Tools


system:benches10gbps:direct

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
system:benches10gbps:direct [2012/09/26 17:49]
ze created
system:benches10gbps:direct [2012/10/04 13:16] (current)
ze add httpterm benches
Line 2: Line 2:
  
 We will try to get back the road of tunning client and server, but to We will try to get back the road of tunning client and server, but to
-make it easier to focus on a single side at once, we will be using one +make it easier to focus on a single side at once, we will be using a 
-of the best found configuration for the other peer.+"​good" ​configuration for the other peer. 
 + 
 +Monitoring graphs for the different benches can be found [[http://​www.hagtheil.net/​files/​system/​benches10gbps/​direct/​|here]]. 
  
 ====== Server ====== ====== Server ======
Line 9: Line 12:
 The main focus was to tune the server so it could handle alot of The main focus was to tune the server so it could handle alot of
 connections. connections.
 +
 +Changes are made and ordered to get a noticable gain after each.  Some
 +changes could be done much earlier, but often with small impact.
  
 ===== baseline ===== ===== baseline =====
Line 20: Line 26:
         get 10.128.0.0:​80 /         get 10.128.0.0:​80 /
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 21298 hits/s +  ​20932 hits/s
- +
-start: Wed Sep 26 10:59:02 UTC 2012 +
-stop:  Wed Sep 26 11:00:03 UTC 2012+
  
 Ok, that's gives us baseline. What we can get without even trying. Ok, that's gives us baseline. What we can get without even trying.
  
-===== All your cores are belong to us =====+===== All your core are belong to us =====
  
 Nginx default configuration only has 4 workers. The systems sees 24 cpu. Nginx default configuration only has 4 workers. The systems sees 24 cpu.
Line 37: Line 40:
   +worker_processes 24;   +worker_processes 24;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535
-  Reponse : 27271 hits/s+
  
-start: Wed Sep 26 11:25:18 UTC 2012 +Getting some errors in /​var/​log/​nginx/​error.log
-stop:  Wed Sep 26 11:26:19 UTC 2012 +
- +
-Good... we are getting somewhere. +
- +
-We have 24 process that can handle a connection, it's better than 4. +
- +
-===== Multiple way to get in ===== +
- +
-There might be some limitation with the bound socket. (Like the kernel +
-locks the socket to check if it the waiting list is not too long before +
-accepting the connection... pure speculation,​ code not checked) +
- +
-Lets try to replace the single listen by multiple IPs to listen to. +
- +
-  file: /​etc/​nginx/​sites-enabled/​default +
-  -#listen 80; +
-  +listen 10.128.0.0:​80;​ +
-  +listen 10.128.0.1:​80;​ +
-  [...] +
-  +listen 10.128.0.23:​80;​ +
- +
-  New input file: small-24.txt +
-  new page0 0 +
-          get 10.128.0.0:​80 / +
-  new page1 0 +
-          get 10.128.0.1:​80 / +
-  [...] +
-  new page23 0 +
-          get 10.128.0.23:​80 / +
- +
-  /​root/​inject -b -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +
-  [...] +
- +
-Getting some errors in /​var/​log/​nginx/​error.log, it's time to handle +
-them.+
  
   [...] accept4() failed (24: Too many open files)   [...] accept4() failed (24: Too many open files)
  
 Increase the number of open files. That's just memory. Memory is cheap. Increase the number of open files. That's just memory. Memory is cheap.
-Lets say that instead of 1k (ulimit -n show 1024) we want lets say.... +Lets say that instead of 1k (ulimit -n show 1024) we want lets say 1M 
-1M files (1048576).+files (1048576).
  
   file:/​etc/​default/​nginx   file:/​etc/​default/​nginx
   +ULIMIT="​-n 1048576"​   +ULIMIT="​-n 1048576"​
  
-Damn, new error...+New error...
   [...] "/​var/​log/​nginx/​access.log"​ failed (28: No space left on device) while logging request [...]   [...] "/​var/​log/​nginx/​access.log"​ failed (28: No space left on device) while logging request [...]
  
Line 110: Line 77:
 Yeah, no more errors. Yeah, no more errors.
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 49966 hits/s+  ​47875 hits/s
  
-Wed Sep 26 12:13:27 UTC 2012 +Good... we are getting somewhere. 
-Wed Sep 26 12:14:28 UTC 2012+ 
 +We have 24 process that can handle a connection, it's better than 4. 
 + 
 +===== Multiple way to get in ===== 
 + 
 +There might be some limitation with the bound socket. (Like the kernel 
 +locks the socket to check if it the waiting list is not too long before 
 +accepting the connection... pure speculation,​ code not checked) 
 + 
 +Lets try to replace the single listen by multiple IPs to listen to. 
 + 
 +  file/​etc/​nginx/​sites-enabled/​default 
 +  -#listen 80; 
 +  +listen 10.128.0.0:80; 
 +  ​+listen 10.128.0.1:80; 
 +  [...] 
 +  +listen 10.128.0.23:80; 
 + 
 +  New input file: small-24.txt 
 +  new page0 0 
 +          get 10.128.0.0:​80 / 
 +  new page1 0 
 +          get 10.128.0.1:​80 / 
 +  [...] 
 +  new page23 0 
 +          get 10.128.0.23:​80 / 
 + 
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  50743 hits/s 
 + 
 +Good, it does help to not be limited on a single socket.
  
 ===== sorry to interrupt ===== ===== sorry to interrupt =====
Line 132: Line 129:
 eth1-TxRx-23 23 eth1-TxRx-23 23
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 56581 hits/s +  ​53721 hits/s
- +
-Wed Sep 26 14:49:00 UTC 2012 +
-Wed Sep 26 14:50:01 UTC 2012+
  
 Better. Better.
Line 152: Line 146:
   +accept_mutex off;   +accept_mutex off;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 95875 hits/s +  ​97682 hits/s
- +
-Wed Sep 26 15:05:23 UTC 2012 +
-Wed Sep 26 15:06:24 UTC 2012+
  
 Wow, that much was just due to nginx locking itself, and preventing Wow, that much was just due to nginx locking itself, and preventing
Line 172: Line 163:
   +worker_processes 16;   +worker_processes 16;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 123672 ​hits/s +  ​126731 ​hits/s
- +
-Wed Sep 26 15:31:02 UTC 2012 +
-Wed Sep 26 15:32:03 UTC 2012+
  
 What if we get down to 12 ? What if we get down to 12 ?
Line 184: Line 172:
   +worker_processes 12;   +worker_processes 12;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse : 133647 ​hits/s +  ​138247 ​hits/s
- +
-Wed Sep 26 15:42:18 UTC 2012 +
-Wed Sep 26 15:43:19 UTC 2012+
  
 That much for having as many worker as cpu. That much for having as many worker as cpu.
Line 198: Line 183:
  
 What if we decided to split IRQ on a few CPU, and workers on other CPU. What if we decided to split IRQ on a few CPU, and workers on other CPU.
 +
 +By checking informations from
 +''/​sys/​bus/​cpu/​devices/​cpu*/​topology/​{core,​thread}_siblings_list'',​ we get some
 +idea how the CPU are regarding to threads and processors :
 +
 +^  CPU  ^  processor ​ ^  core  ^  thread ​ ^
 +|  0-5  |  0  |  0-5  |  0  |
 +|  6-11  |  1  |  0-5  |  0  |
 +|  12-17  |  0  |  0-5  |  1  |
 +|  18-23  |  1  |  0-5  |  1  |
  
 How to split ? Lets try differents splitting. How to split ? Lets try differents splitting.
Line 205: Line 200:
   irq 0-23 => cpu 0-11,0-11   irq 0-23 => cpu 0-11,0-11
   workers - cpu 12-23   workers - cpu 12-23
-  ​Reponse : 178433 ​hits/s +  ​184769 ​hits/s
- +
-Wed Sep 26 15:59:06 UTC 2012 +
-Wed Sep 26 16:00:07 UTC 2012+
  
 We have 2 real processor with 12 threads each. Lets try 1 CPU for IRQ We have 2 real processor with 12 threads each. Lets try 1 CPU for IRQ
Line 215: Line 207:
   irq 0-23 => cpu 0-5,​12-17,​0-5,​12-17 (processor #0)   irq 0-23 => cpu 0-5,​12-17,​0-5,​12-17 (processor #0)
   workers - set on 6-11,18-23 (processor #1)   workers - set on 6-11,18-23 (processor #1)
-  ​Reponse : 179635 ​hits/s +  ​190712 ​hits/s
- +
-Wed Sep 26 15:55:22 UTC 2012 +
-Wed Sep 26 15:56:23 UTC 2012+
  
-mmm... doesn'​t change much.+better
  
 What if we use first 3 cores (2 threads per core) of each processor for What if we use first 3 cores (2 threads per core) of each processor for
Line 227: Line 216:
   irq 0-23 => cpu 0-2,​6-8,​12-14,​18-20,​0-2,​6-8,​12-14,​18-20   irq 0-23 => cpu 0-2,​6-8,​12-14,​18-20,​0-2,​6-8,​12-14,​18-20
   workers - cpu 3-5,​9-11,​15-17,​21-23   workers - cpu 3-5,​9-11,​15-17,​21-23
-  ​Reponse : 172785 ​hits/s+  ​187394 ​hits/s
  
-Wed Sep 26 16:08:54 UTC 2012 +not as good.
-Wed Sep 26 16:09:56 UTC 2012 +
- +
-Lightly less...+
  
 Maybe now that we have a separation we can include a few more workers Maybe now that we have a separation we can include a few more workers
Line 244: Line 230:
   irq 0-23 => cpu 0-3,​6-9,​0-3,​6-9,​0-3,​6-9   irq 0-23 => cpu 0-3,​6-9,​0-3,​6-9,​0-3,​6-9
   worker - cpu 4,5,10-23   worker - cpu 4,5,10-23
-  ​Reponse : 149155 ​hits/s +  ​153129 ​hits/s
- +
-Wed Sep 26 16:23:47 UTC 2012 +
-Wed Sep 26 16:24:48 UTC 2012 +
-(lost logfile)+
  
 ouch. Not that good... ouch. Not that good...
  
-What about one processor for IRQ... first 4 cores ?+What about one processor for IRQ... first 4 cores (both threads) ​?
  
   irq 0-23 => 0-3,​12-15,​0-3,​12-15,​0-3,​12-15   irq 0-23 => 0-3,​12-15,​0-3,​12-15,​0-3,​12-15
   worker - cpu 4-11,16-23   worker - cpu 4-11,16-23
-  ​Reponse : 198196 ​hits/s+  ​218857 ​hits/s
  
-Wed Sep 26 17:02:48 UTC 2012 +Wow, much better. Just changing which threads handle does what has a big 
-Wed Sep 26 17:03:49 UTC 2012+impact.
  
 ===== pin the hopper ===== ===== pin the hopper =====
Line 266: Line 248:
 process with a single cpu, so they stop hopping from one to an other. process with a single cpu, so they stop hopping from one to an other.
  
-  ​Reponse : 204982 ​hits/s +  ​224544 ​hits/s
- +
-Wed Sep 26 16:34:11 UTC 2012 +
-Wed Sep 26 16:35:12 UTC 2012+
  
-Wow. Just with cpu affinity ​? damn!+And yet better, ​with just affinity.
  
 ===== keep it opened ===== ===== keep it opened =====
Line 282: Line 261:
   file:/​etc/​nginx/​nginx.conf   file:/​etc/​nginx/​nginx.conf
   +open_file_cache max=1000;   +open_file_cache max=1000;
-  ​Reponse : 209763 ​hits/s +  ​236607 ​hits/s
- +
-Wed Sep 26 16:55:05 UTC 2012 +
-Wed Sep 26 16:56:06 UTC 2012+
  
 ===== I can has cookies ===== ===== I can has cookies =====
Line 300: Line 276:
   +net.ipv4.tcp_tw_recycle = 1   +net.ipv4.tcp_tw_recycle = 1
   +net.ipv4.tcp_tw_reuse = 1   +net.ipv4.tcp_tw_reuse = 1
-  +net.core.netdev_max_backlog = 1048576 
   +net.ipv4.tcp_syncookies = 0   +net.ipv4.tcp_syncookies = 0
 +  +net.core.netdev_max_backlog = 1048576
   +net.core.somaxconn = 1048576   +net.core.somaxconn = 1048576
   +net.ipv4.tcp_max_syn_backlog = 1048576   +net.ipv4.tcp_max_syn_backlog = 1048576
-  +net.ipv4.tcp_max_tw_buckets = 1048576 
  
 Check how it gets on a longer period : Check how it gets on a longer period :
  
-  /​root/​inject --d 600 -u 500 -s 20 -f small-$max.txt -S 10.140.0.0-10.140.15.255:​1024-65535 +  /​root/​inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
-  ​Reponse ​208784 ​hits/s+  ​236103 hits/s 
 + 
 +Ok, we can hold 236k connections per second, without hitting any limit 
 +in any log. 
 + 
 +===== about client ===== 
 + 
 +Bench for server was done with a patched version of inject that pinned 
 +each process to a single cpu, and gathered network interrupts gathered 
 +on a few cpu.  This was what gave the best result at a time, but further 
 +client test shows it's not optimal. 
 + 
 +====== Client ====== 
 + 
 +Ok, now lets get back to tunning the client. We will reset the client in 
 +a default configuration,​ and tune it to get up at a high hit per second. 
 + 
 +We keep the server in the latest configuration. 
 + 
 +We already established that hitting multiple IPs was better than hitting 
 +a single one. we will keep that part in place. 
 + 
 +As our client need to connect at a high rate, we have to use multiple 
 +source IP. If we don't, we would soon hit a limit of source ip/port -> 
 +destination ip/port. 
 + 
 +Having a client binds to an IP without specifying the port (letting it 
 +be taken from the ephemeral port) would still hit the same flaw (at 
 +least under Linux). That means we need a client that binds to a specific 
 +ip AND port for each outgoing connection. 
 + 
 +inject seems to be doing just that. It takes a range of IP and range of 
 +ports. It splits the ports between the processes, and tries it with each 
 +IP in range, before getting to the next port. All IP in range will be 
 +used before a process move to the next port. 
 + 
 +At our quick connections per seconds, and hoping to present a nice 
 +amount of different sources, a /20 is used (4096 IPs) along with all 
 +upper ports (1024 -> 65535), that would leave about 252M ip/port tuple. 
 + 
 +Noteat the high rate we get, it burns an average of 60 port per 
 +seconds, and would take about 18 minutes before it would loops back to 
 +the first ports. 
 + 
 +===== baseline ===== 
 + 
 +Lets get a few baselines. 
 + 
 +Lets start with 1 process, and 1 user 
 +  /​root/​inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  4984 hits/s 
 + 
 +Ok, that's what a single user can get... that's about 0.20 ms per query. 
 + 
 +===== more processes ===== 
 + 
 +1 process is nice, but no reason not to get more processes, as we have 
 +24 threads on the processors. 
 + 
 +  /​root/​inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  51080 hits/s 
 + 
 +===== interrupt someone else ===== 
 + 
 +As we can see, CPU#0 is full with soft interrupts. 
 + 
 +Lets get the network irq spread on all cpu. (0-23 to cpu 0-23) 
 + 
 +  /​root/​inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  112035 hits/s 
 + 
 +===== more users ===== 
 + 
 +Let the process use more users. 
 + 
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  228367 hits/s 
 + 
 +===== no timestamp ===== 
 + 
 +By default, tcp get some timestamps on its connection. When we are 
 +trying to gain the little performance we are missing, it could be a good 
 +idea to not set the timestamp. (note: could be done on server OR client 
 +with similar results) 
 + 
 +  file: /​etc/​sysctl.conf 
 +  net.ipv4.tcp_timestamps = 0 
 + 
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  241193 hits/s 
 + 
 +====== dual ====== 
 + 
 +To check on which side we have a bottle neck, lets try to have 2 
 +servers, or 2 clients. 
 + 
 +Tests done with the lastest configurations (client and server) which 
 +could give 240k hits/s. 
 + 
 +===== dual servers ===== 
 + 
 +We get a second server with the same configuration,​ and checked it also 
 +can handle the 240k/s. Then, we change the scenario to hit the 24 IPs 
 +from both servers. 
 + 
 +  New input file: dual-24.txt 
 +  new page0a 0 
 +          get 10.128.0.0:​80 / 
 +  new page0b 0 
 +          get 10.132.0.0:​80 / 
 +  new page1a 0 
 +          get 10.128.0.1:​80 / 
 +  new page1b 0 
 +          get 10.132.0.1:​80 / 
 +  [...] 
 +  new page23a 0 
 +          get 10.128.0.23:​80 / 
 +  new page23b 0 
 +          get 10.132.0.23:​80 / 
 + 
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  401391 hits/s 
 + 
 +Though the client seems to use all its CPU for 240k/s, it still can go 
 +up and handle 400k hits/s. The bottle neck is probably not really on 
 +that side. 
 + 
 +===== dual client ===== 
 + 
 +We get a second client with the same configuration,​ and checked it also 
 +can generate the 240k/s. 
 + 
 +To launch both clients at the same time, cssh is very nice :) 
 + 
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535 
 +  123016 hits/s 
 +  121312 hits/s 
 +  total: 244328 hits/s 
 + 
 +Ok, client is clearly not the limitation, as with two clients, we get 
 +the same total. 
 + 
 +====== conclusions ====== 
 + 
 +The above bench shows the following : 
 + 
 +  * As everyone knows, using multiple cores is better than using only one 
 +  * smp affinity is important, and can deal huge changes 
 +  * on high load, it might be better to segregate core usage (as shown by separating irq and nginx) 
 +  * on high load configuration,​ reducing the number of process to just have one per used core is better 
 +  * 240k connections / seconds is doable with a single host 
 + 
 +For some unknown reason (at the time of writing that documentation),​ the 
 +connections highly drops for 1-2s, as can be seen on 
 +[[http://​www.hagtheil.net/​files/​system/​benches10gbps/​direct/​bench-bad/​nginx-bad/​elastiques-nginx/​|bench-bad/​nginx-bad]] 
 +graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome. 
 + 
 +====== post-bench ====== 
 + 
 +After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages. 
 + 
 +Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores. 
 + 
 +As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each. 
 + 
 +  file-0.cfg:​ 
 +  # taskset 000010 ./httpterm -D -f file-0.cfg 
 +  global 
 +          maxconn 30000 
 +          ulimit-n 500000 
 +          nbproc 1 
 +          quiet 
 +   
 +  listen proxy1 10.128.0.0:​80 
 +          object weight 1 name test1 code 200 size 200 
 +          clitimeout 10000 
 + 
 +That gives up more connections per seconds: 278765
  
-Wed Sep 26 17:36:53 UTC 2012 
-Wed Sep 26 17:46:54 UTC 2012 
  
 +That helps get even more requests per seconds, but we still get some stall at times.
  
system/benches10gbps/direct.1348681740.txt.gz · Last modified: 2012/09/26 17:49 by ze