User Tools

Site Tools


system:benches10gbps:direct

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
system:benches10gbps:direct [2012/09/28 17:10]
ze redone, with stored graph each time
system:benches10gbps:direct [2012/10/04 13:16] (current)
ze add httpterm benches
Line 2: Line 2:
  
 We will try to get back the road of tunning client and server, but to We will try to get back the road of tunning client and server, but to
-make it easier to focus on a single side at once, we will be using one +make it easier to focus on a single side at once, we will be using a 
-of the best found configuration for the other peer.+"​good" ​configuration for the other peer. 
 + 
 +Monitoring graphs for the different benches can be found [[http://​www.hagtheil.net/​files/​system/​benches10gbps/​direct/​|here]]. 
  
 ====== Server ====== ====== Server ======
Line 23: Line 26:
         get 10.128.0.0:​80 /         get 10.128.0.0:​80 /
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   20932 hits/s   20932 hits/s
  
Line 37: Line 40:
   +worker_processes 24;   +worker_processes 24;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535
  
 Getting some errors in /​var/​log/​nginx/​error.log Getting some errors in /​var/​log/​nginx/​error.log
Line 74: Line 77:
 Yeah, no more errors. Yeah, no more errors.
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-1.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   47875 hits/s   47875 hits/s
  
Line 105: Line 108:
           get 10.128.0.23:​80 /           get 10.128.0.23:​80 /
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   50743 hits/s   50743 hits/s
  
Line 126: Line 129:
 eth1-TxRx-23 23 eth1-TxRx-23 23
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   53721 hits/s   53721 hits/s
  
Line 143: Line 146:
   +accept_mutex off;   +accept_mutex off;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   97682 hits/s   97682 hits/s
  
Line 160: Line 163:
   +worker_processes 16;   +worker_processes 16;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   126731 hits/s   126731 hits/s
  
Line 169: Line 172:
   +worker_processes 12;   +worker_processes 12;
  
-  /​root/​inject --d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   138247 hits/s   138247 hits/s
  
Line 180: Line 183:
  
 What if we decided to split IRQ on a few CPU, and workers on other CPU. What if we decided to split IRQ on a few CPU, and workers on other CPU.
 +
 +By checking informations from
 +''/​sys/​bus/​cpu/​devices/​cpu*/​topology/​{core,​thread}_siblings_list'',​ we get some
 +idea how the CPU are regarding to threads and processors :
 +
 +^  CPU  ^  processor ​ ^  core  ^  thread ​ ^
 +|  0-5  |  0  |  0-5  |  0  |
 +|  6-11  |  1  |  0-5  |  0  |
 +|  12-17  |  0  |  0-5  |  1  |
 +|  18-23  |  1  |  0-5  |  1  |
  
 How to split ? Lets try differents splitting. How to split ? Lets try differents splitting.
Line 270: Line 283:
 Check how it gets on a longer period : Check how it gets on a longer period :
  
-  /​root/​inject --d 600 -u 500 -s 20 -f small-$max.txt -S 10.140.0.0-10.140.15.255:​1024-65535+  /​root/​inject -p 24 -d 600 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
   236103 hits/s   236103 hits/s
  
 Ok, we can hold 236k connections per second, without hitting any limit Ok, we can hold 236k connections per second, without hitting any limit
 in any log. in any log.
 +
 +===== about client =====
 +
 +Bench for server was done with a patched version of inject that pinned
 +each process to a single cpu, and gathered network interrupts gathered
 +on a few cpu.  This was what gave the best result at a time, but further
 +client test shows it's not optimal.
 +
 +====== Client ======
 +
 +Ok, now lets get back to tunning the client. We will reset the client in
 +a default configuration,​ and tune it to get up at a high hit per second.
 +
 +We keep the server in the latest configuration.
 +
 +We already established that hitting multiple IPs was better than hitting
 +a single one. we will keep that part in place.
 +
 +As our client need to connect at a high rate, we have to use multiple
 +source IP. If we don't, we would soon hit a limit of source ip/port ->
 +destination ip/port.
 +
 +Having a client binds to an IP without specifying the port (letting it
 +be taken from the ephemeral port) would still hit the same flaw (at
 +least under Linux). That means we need a client that binds to a specific
 +ip AND port for each outgoing connection.
 +
 +inject seems to be doing just that. It takes a range of IP and range of
 +ports. It splits the ports between the processes, and tries it with each
 +IP in range, before getting to the next port. All IP in range will be
 +used before a process move to the next port.
 +
 +At our quick connections per seconds, and hoping to present a nice
 +amount of different sources, a /20 is used (4096 IPs) along with all
 +upper ports (1024 -> 65535), that would leave about 252M ip/port tuple.
 +
 +Note: at the high rate we get, it burns an average of 60 port per
 +seconds, and would take about 18 minutes before it would loops back to
 +the first ports.
 +
 +===== baseline =====
 +
 +Lets get a few baselines.
 +
 +Lets start with 1 process, and 1 user
 +  /​root/​inject -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  4984 hits/s
 +
 +Ok, that's what a single user can get... that's about 0.20 ms per query.
 +
 +===== more processes =====
 +
 +1 process is nice, but no reason not to get more processes, as we have
 +24 threads on the processors.
 +
 +  /​root/​inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  51080 hits/s
 +
 +===== interrupt someone else =====
 +
 +As we can see, CPU#0 is full with soft interrupts.
 +
 +Lets get the network irq spread on all cpu. (0-23 to cpu 0-23)
 +
 +  /​root/​inject -p 24 -d 60 -u 1 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  112035 hits/s
 +
 +===== more users =====
 +
 +Let the process use more users.
 +
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  228367 hits/s
 +
 +===== no timestamp =====
 +
 +By default, tcp get some timestamps on its connection. When we are
 +trying to gain the little performance we are missing, it could be a good
 +idea to not set the timestamp. (note: could be done on server OR client
 +with similar results)
 +
 +  file: /​etc/​sysctl.conf
 +  net.ipv4.tcp_timestamps = 0
 +
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  241193 hits/s
 +
 +====== dual ======
 +
 +To check on which side we have a bottle neck, lets try to have 2
 +servers, or 2 clients.
 +
 +Tests done with the lastest configurations (client and server) which
 +could give 240k hits/s.
 +
 +===== dual servers =====
 +
 +We get a second server with the same configuration,​ and checked it also
 +can handle the 240k/s. Then, we change the scenario to hit the 24 IPs
 +from both servers.
 +
 +  New input file: dual-24.txt
 +  new page0a 0
 +          get 10.128.0.0:​80 /
 +  new page0b 0
 +          get 10.132.0.0:​80 /
 +  new page1a 0
 +          get 10.128.0.1:​80 /
 +  new page1b 0
 +          get 10.132.0.1:​80 /
 +  [...]
 +  new page23a 0
 +          get 10.128.0.23:​80 /
 +  new page23b 0
 +          get 10.132.0.23:​80 /
 +
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f dual-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  401391 hits/s
 +
 +Though the client seems to use all its CPU for 240k/s, it still can go
 +up and handle 400k hits/s. The bottle neck is probably not really on
 +that side.
 +
 +===== dual client =====
 +
 +We get a second client with the same configuration,​ and checked it also
 +can generate the 240k/s.
 +
 +To launch both clients at the same time, cssh is very nice :)
 +
 +  /​root/​inject -p 24 -d 60 -u 500 -s 20 -f small-24.txt -S 10.140.0.0-10.140.15.255:​1024-65535
 +  123016 hits/s
 +  121312 hits/s
 +  total: 244328 hits/s
 +
 +Ok, client is clearly not the limitation, as with two clients, we get
 +the same total.
 +
 +====== conclusions ======
 +
 +The above bench shows the following :
 +
 +  * As everyone knows, using multiple cores is better than using only one
 +  * smp affinity is important, and can deal huge changes
 +  * on high load, it might be better to segregate core usage (as shown by separating irq and nginx)
 +  * on high load configuration,​ reducing the number of process to just have one per used core is better
 +  * 240k connections / seconds is doable with a single host
 +
 +For some unknown reason (at the time of writing that documentation),​ the
 +connections highly drops for 1-2s, as can be seen on
 +[[http://​www.hagtheil.net/​files/​system/​benches10gbps/​direct/​bench-bad/​nginx-bad/​elastiques-nginx/​|bench-bad/​nginx-bad]]
 +graphs. I tried to avoid using results triggering such behaviour. Any ideas/hints on what could produce such are welcome.
 +
 +====== post-bench ======
 +
 +After publishing the first benches, someone adviced to use httpterm, instead of nginx. Unlike nginx, httpterm is aimed at only doing stress bench, and not serve real pages.
 +
 +Bench using multi-process httpterm directly shows some bug. It still sends header, but fails to send data. Getting down to 1 process keep it running, but obviously not using all cores.
 +
 +As we have 16 core for the web server, so 16 process with 1 IP each were launched, pinned with taskset on a cpu each.
 +
 +  file-0.cfg:
 +  # taskset 000010 ./httpterm -D -f file-0.cfg
 +  global
 +          maxconn 30000
 +          ulimit-n 500000
 +          nbproc 1
 +          quiet
 +  ​
 +  listen proxy1 10.128.0.0:​80
 +          object weight 1 name test1 code 200 size 200
 +          clitimeout 10000
 +
 +That gives up more connections per seconds: 278765
 +
 +
 +That helps get even more requests per seconds, but we still get some stall at times.
  
system/benches10gbps/direct.1348852217.txt.gz · Last modified: 2012/09/28 17:10 by ze