IEPM

Bulk throughput measurements

SLAC Home Page
Bulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements
Les CottrellLast update February 13, 2003

Introduction | Methodology | Operating System | Pathchar and Traceroutes | Measurement duration | CPU Utilization | Streams and Window Sizes | Summary

Sample Bulk Throughput Measurements

Cal Tech | APAN-JP | CERN | Colorado | IN2P3 | INFN/Rome | INFN/Padova | Daresbury | Manchester University | Rutherford | Stanford to Daresbury | ANL | BNL | FNAL | GSFC | JLab | LANL | LBL | NERSC | Pacific Northwest GigaPoP | ORNL | Rice | RIKEN | SDSC | SLAC LAN | SLAC To Stanford's Campus | Southern eXchange GigaPoP in Atlanta | TRIUMF | Univsersity College London | UFL | UIUC | University of Michigan | UT Dallas | Wisconsin

Introduction

With the success of BaBar and the need to support multiple remote computer centers, the need for high performace between the remote computer centers and SLAC was imperative. To assist with understanding the performance, we set out to measure the bulk data flows to critical sites (BaBar regional centers etc.) using the production networks available today, to see how well they performed.

For more information on how we tuned the TCP stack and application to optimize bulk-data throughput and the impact on hosts etc. see High Performance Throughput Tuning/Measurements. For information on some experimental high speed transfer measurements see SC2001 bulk thruput measurements to SLAC.

Methodology

To begin, we selected a set of sites that were relavent to BaBar. Typically they were either BaBar regional computing centers or sites which have high thruput requirements with SLAC. At each of these sites, it was imperative that we could get an iperf server installed where we had an account on a Unix host and ideally where we had access to link utilization information.

For each site we used iperf to send TCP bulk data from a client at SLAC to an iperf server at the remote site. Each bulk data transfer ran for 10 seconds. For each site we sent 10 second transfers at each of the following window sizes 8kBytes, 1024kBytes, 16kBytes, 512kBytes, 32kBytes, 256kBytes, 128kBytes, 64kBytes, 55kBytes. For each window size we used different numbers of parallel data streams to comprise each transfer. The numbers of streams used were 1, 20, 5, 12, 8, 10, 4, 11, 3, 7, 15, 9, 2, 6. The sequences of windows sizes and number of parallel streams were as given above and were deliberately chosen so they did not monotonically increase or decrease. Simultaneoulsy with the data transfer we also sent 10 100 byte pings separated by 1 second, each with a 20 second timeout. Following each transfer we also sent 10 more pings. the idea of the 2 sets of pings was to evaluate the RTT with and without the bulk data transfer. No consideration was given to "fairness" or shared TCP blocks (see RFC 2140).

We saved the data and analyzed it with Excel.

Determining System Configuration

All hosts were running either Solaris or Linux. We determined the operating system of the hosts using the Unix uname -a command. For Solaris we determined the hardware configurations using the Solaris command /usr/platform/`uname -i`/sbin/prtdiag -v if available or /usr/sbin/psrinfo -v. For Linux we used the command: more /proc/cpuinfo. Using iperf on Linux generally limited the window size that could be used to <= 128KBytes and also typically the window size granted was double that requested by the client. On Solaris we determined the link speed, using ifconfig -a, and assumed ge0 identified a Gbit Ethernet interface, hme0: identified a 100Mbps interface. If we had sudo to ndd we were also able to verify this using ndd -get /dev/ge0 link_speed. For example:

ccasn07:tcsh[48] ifconfig -a
lo0: flags=849 mtu 8232
        inet 127.0.0.1 netmask ff000000
ge0: flags=863 mtu 1500
        inet 134.158.105.27 netmask fffff800 broadcast 134.158.111.255
ccasn07:tcsh[49] ndd -get /dev/ge link_speed
couldn't push module 'ge', No such device or address
ccasn07:tcsh[50] sudo ndd -get /dev/ge link_speed
Password:
1000

To find out the TCP buffer window settings  use (for more information see TCP Tuning Guide and How to achieve Gigabit speeds with Linux, Ipsysctl tutorial, Take charge of processor affinity, Identify performance bottlenecks with OProfile for Linux, TCP Tuning and Network Troubleshooing):
Solaris Linux 2.2.x Linux 2.4.x
ndd  /dev/tcp tcp_max_buf 
  (4194304)
ndd  /dev/tcp tcp_cwnd_max 
  (2097152) 
ndd /dev/tcp tcp_xmit_hiwat 
  (65536)
ndd /dev/tcp tcp_recv_hiwat 
  (65536)
more /proc/sys/net/core/wmem_max
 (8388608)

more /proc/sys/net/core/rmem_max
  (8388608)
more /proc/sys/net/core/rmem_default
  (65536) 
more /proc/sys/net/core/wmem_default
  (65536)
more /proc/sys/net/ipv4/tcp_rmem
  (4096 87380 4194304
more /proc/sys/net/ipv4/tcp_wmem
  (4096 65536 4194304)
more /proc/sys/net/core/wmem_max
 (8388608)

more /proc/sys/net/core/rmem_max
  (8388608)

Pathchar and Traceroutes

We also used pathchar to characterize the paths and traceroute to get the routes (in both directions where possible). See Pathchar measurements from SLAC and Traceroutes from SLAC. The bottleneck bandwdith and RTT can also be measured using pathchar which then uses these to predict the number of bytes required to fill the pipe. The relevant pathchar measurements are available.

Measurement duration

To evaluate the effect of duration of the individual measurements, for a single stream, we selected durations of 2, 5, 10, 20, 40, 80, 160, 250 and 320 seconds, and window sizes of 256, 512, 1024, 2048 and 4096 KBytes. For each of the above possible pairs we made a single stream measurement of the iperf TCP throughput 17 times from plato.cacr.caltech.edu to pharlap.slac.stanford.edu. The results are shown to the right for each pair of possible values (window, duration). The points are the medians of each set of 17 measurements, and the error bars are determined from the Inter Quartile Ranges (IQRs). It is seen that though the medians continue to rise for durations of over 10 seconds (by about 10% going from 10 to 20 seconds) to within the accuracy of the measurements this is a small effect. So for most of our measurements we settled on 10 second durations.

Effect of measurement duration on throughput To see whether larger RTTs might result in a different conclusion we repeated the measurement with window sizes of 64, 128, 256, 512, 1024, 2014 and 4096KBytes from pharlap.slac.stanford.edu to ccasn07.in2p3.fr for the same durations as above. The RTT in this case was min/avg/max (std) 178/178/181 (0.460) msec. Each point in the plot below, is the median of about 20 measurements. Again it is seen that there is little to be gained in measuring for longer than 10 seconds.

To further investigate the dependence of the 10 second knee in the throughput versus duration plot, Brian Tierney of CERN/LBL measured the IPERF TCP throughput from LBL to hosts at CERN and ANL over a 24 hour period on December 12 '01. Each duration setting was measured 6 times and the results for a single host pair and duration averaged. The CERN "edge" host is pcgiga, which is outside the CERN firewall. That's why its so much faster than the other CERN host. "It is surprising that this value (10 seconds) seems to be independent of latency. The RTT time from LBL to CERN is 3 times the RTT from LBL to ANL, so I would have thought that it would take longer to get a good estimate. In fact the ANL estimate at 10 sec looks to be more accurate that the CERN estimate, but its closer than I would have thought." Brian Tierney email to Les Cottrell, 12/13/01.

Matt Mathis of PSC pointed out to us that the optimum duration may depend critically on the buffer sizes in the first router in the path. Also as one move to larger RTT*Bandwidth products, slow start will take longer and thus more time will be needed for the aggregate throughput to get close to capacity.

Tom Dunigan of ORNL produced the plot below for a 100Mbit/s * 100ms path and a 1Gbps * 200ms path. The "measurements" are ns simulations with no losses and delayed ACKs.


We looked at one of our higher bandwidth * delay product links to see whether the 10s duration would fail for it. The link was from SLAC to APAN-JP with a RTT of 135ms, and a maximum bandwidth of about 450Mbits/s (see Bulk Throughput Measurements - APAN-JP. As above we measured using iperf TCP transfers for one stream with various durations and a maximum window/buffer size of 4MBytes. The results are shown below. It can be seen that 10 seconds is not long enough, in fact after 10 seconds the throughput only reaches about 80% of the asymptotic bandwidth for a single stream (we took the asymptotic bandwidth to be the maximum achieved during these tests). A plot of the maximum bandwidth - achieved throughput is also shown below for the same data.

With a little work, we can probably get the analytical solution, but the rough estimate is that if you want to reach 90% of actual bandwidth to be reported by iperf at the end, then how long you have to run depends on MSS, RTT and target bandwidth (either capacity or your window size). Tom Dunigan, email to Les Cottrell Jun 7 '02.

Applying this algorithm to the APAN-JP link with a Window of 5MBytes, and an RTT of 135ms we deduce that we should achieve 90% of the maximum throughput in 15.4s. This appears to agree well with what we observe.

A table of some values of the time (using the method outlined above) required to make a measurement for which the single stream measurement is in slow start for < 10% of the total measurement time is shown in the table below. If multiple streams are used then the bandwidth is roughly equally shared among the streams, so the BW, as used in the table, is divided by the number of streams.

CPU Utilization

The graph to the right shows the iperf client cpu MHz/Mbps for several CPU's with different OS'. The points are the medians for each complete set of measurements made with the various window sizes and streams. The error bars are the InterQuartile Range (IQR) for each complete set. It is seen that there is a lot of varaibility in the observed values. More measurements would be needed to determine whether one OS is superior to another in terms of minimizing MHz/Mbps.

If one looks at cpu utilization vs streams windows and throughput one gets graphs like the one shown to the right for measurements from SLAC to RIKEN in Japan. Each line is for a fixed window size and shows the variation of CPU utilization with throughput. It is apprent that CPU utilzation increases with increasing throughput. The overall relation between all the measurements is CPU MHz ~ 0.97 * Mbits/s. It is also observed that for a fixed achievable throughput that CPU utilization goes down with window size. The lines joing the points for a fixed window size indicate the order of streams (fisrt point to the left is 8KB, next point is 16KB, 32KB, 64KB, 128KB, 256KB, 512KB and 1024KB). It can be seen that for a given window size (single colored line) throughput and cpu increase with increased number of streams. For the larger window size lines, there is tyupically a hook back for the last point (i.e. the last point has lower throughput and possibly higher CPU utilization than the penultimate point). This is thought to be due to over-driving the link with too large windows*stream products.

Streams and Window Sizes

One would also expect the throughput to increase as one increases the window size or the number of parallel streams towards the size of the product of the RTT*Bottleneck_bandwidth of the link, since this helps keep the pipe full and reduces time wasted waiting for acknowledgements. On a congested network, parallel streams often provides linear speedup! This only helps for large read/writes. Brian Tierney, LBNL in TCP Tuning Guide for Distributed Application on Wide Area Networks. See Bulk throughput: windows vs. streams for comparisons of using large windows versus many streams. For more on selecting window sizes see Enabling High Performance Data Transfers on Hosts (also see UNIX IP Stack Tuning Guide, also for Linux see the Ipsysctl tutorial, and Solaris 2.x - Tuning Your TCP/IP Stack and More for how to configure the TCP stack for improved security), and see Adjusting IP MTU, TCP MSS, and PMTUD on Windows and Sun Systems for information on adjusting the Maximum Transmission Unit (MTU). See TCP auto-tuning zoo for information on auto-tuning TCP stacks. For information and comparisons on various network high performance file transfer/copy mechanisms see Bulk Data Transfer Tools by Tim Adye. For pointers on modifying TCP for high throughput, see High Speed TCP by Sally Floyd, Scalable TCP by Tom Kelly and FAST TCP. Also see A Brief History of TCP for an overview of earlier modifications to TCP. There is also a presentation given at SC2002 by Phil Dykstra on High Performance networking, and another the last half of From Airplanes to Elephants gives "What worked, and What didn't" for achieving performance.

Summary


Created August 25, 2000.
Comments to iepm-l@slac.stanford.edu