IEPM

Bulk thruput: windows vs. streams

SLAC Home Page

Introduction

From Bulk thruput measurements made using iperf from SLAC to various sites it appears that using multiple parallel streams of data is often more effective in achieving high thruput than using large windows. This page looks at 3 links, chosen from the Bulk thruput measurements, in more detail to attempt to provide more empirical evaluation of this phenomenon.

Methodology

We selected three sites of interest to BaBar. These were CERN, Caltech and IN2P3 in Lyon France. These were chosen since from the Bulk thruput measurements the first two exhibit the effect of multiple streams being more effective than large windows, while for IN2P3 both multiple streams and large windows are effective. Also we have strong relationships with the sites, and the sites were kind enough to enable us to run iperf on one of their hosts, and to impact their links with our high performance measurement traffic. We also show some selected data from thruput measurements to with other sites, Daresbury Lab. near Liverpool England and Stanford University California.

For each site we used iperf to send TCP bulk data with a 256kByte window and 2 streams from a client at SLAC to an iperf server at the remote site for 40 seconds. At the same time we sent 40 56 byte pings (one/second, with a timeout of 20 seconds). Following this we sent 40 more similar pings but without any iperf traffic. The above measurement was then repeated but with a 64k byte window and 8 streams (i.e. the same product of streams * window). the above is referred to as a measurement cycle.

The above set of measurements was repeated a couple of hundred times, the data was saved and analyzed with Excel. The whole series of measurements was repeated for each site.

We also used pathchar to characterize the paths and traceoute to get the routes (in both directions where possible). See Pathchar measurements from SLAC and Traceroutes from SLAC.

SLAC to CERN

The plot below shows the SLAC to CERN performance as measured via Internet 2 with 8 streams and a windows size of 64kBytes. The contribution of each individual stream is shown in additive form. It can be seen that the individual components roughly contribute equal amounts to the aggregate thruput. The aggregate thruput varies from 10 to 30 Mbits/sec.
CERN thruput for 64kB window and 8 streams
The next plot shows the measurement made with 2 streams and 256kByte windows. It can be seen again that the 2 streams contibute about equally to the aggregate. However, confirming the results in Bulk thruput measurements the aggregate for this case (2 streams, 256kByte windows) is considerably (roughly 3 times) below that for 8 streams and 64kByte windows, even though the individual streams carry more data (about 4.7:3) in the case of the larger windows.
CERN thruput with 256kB window and 2 streams

Caltech

We made similar measurements from SLAC to Caltech. The table below summarizes the results and compares them with the SLAC CERN measurements. The number reported for the aggregate thruput is the median of all the measurements, IQR is the inter quartile range of the thruput measurements (it indicates the spread of the measurements). The RTTs (Round Trip Time) are those measured by ping. It can be seen that the best performance to Caltech is better than to CERN. (~ factor of 2). The performance with more streams is better than with fewer streams. More streams also results in greater RTT change between loaded and unloaded measurements.
SiteWindowStreamsAggregate thruputIQRMin loaded RTTAvg loaded RTTMin unloaded RTTAvg unloaded RTT
CERN256kBytes29.45Mbits/s4.63Mbits/s165ms167ms165ms165ms
CERN64kBytes826.8Mbits/s10.6Mbits/s164ms181ms165ms165.4ms
Caltech256kBytes246.5Mbits/s1.7Mbits/s16ms21ms15ms17ms
Caltech64kBytes863.5Mbits/s4.6Mbits/s17ms30ms16ms18ms

Effect of increasing number of streams for CERN & IN2P3

To better understand the impact of increasing the number of streams while keeping the window size constant we measured the thruput, losses and RTTs with varying numbers of streams (in the order 1, 20, 5, 12,8, 10, 4, 11, 3, 7, 15, 9, 2, 6), from SLAC to CERN with 8kByte and 16kByte windows. The duration of each thruput measurement (for a fixed window size and number of streams) was 20 seconds during which both ping RTT and loss were also measured (loaded loss & RTT), followed by 20 pings (1 second separation and 20 second timeout with 100byte packets) with no iperf thruput to measure the unloaded RTT and loss. Below we show plots of the thruput, unloaded loss and the difference in the loaded and unloaded RTTs (dRTT) versus the number of parallel streams for window sizes of 8kBytes and 16kBytes. The solid green lines represent the dRTT averaged over all the measurements for a given number of streams. Fot the 55kByte windows there were a few instances of very long ping (> 4 seconds) RTTs which make the averages oscillate wildly, so we add the median of the dRTTs as a solid blue line. For the 8kByte, 16kByte and 55kByte window measurements, for each number of streams there are roughly 21 measurements of the RTT difference and thruput. It is seen that from 1 thru about 10 streams the thruput grows roughly linearly with the number of streams. Above 7 streams the difference in RTT also starts to grow indicating a greater impact on cross-traffic. At the same time beyond 7 streams the variability in the thruput also increases. This may indicate that there is an optimal number of streams that limits the impact while providing good thruput, by for example minimizing the ratio of RTT difference (dRTT) to thruput. It is also seen that for larger numbers of streams the variability in RTT often increases as thruput saturates. The red line shows the IQR of the RTT. The magenta dots show the data rate variability defined as 1 - max(data rate)/avg(data rate) and data rate is here defined as (bytes sent + bytes received back) / RTT. The magenta line shows the average data rate variability.
CERN thruput, RTT & loss vs streams, 8kB window CERN thruput, RTT & loss vs streams, 16kB window CERN thruput, RTT & loss vs streams, 55kB window

Similar graphs are shown below for SLAC to IN2P3 (Lyon, France) measured Oct 8 and 9, 2000, from pharlap.slac.stanford.edu to ccasn01.in2p3.fr. The best thruputs appear to be about 20Mbits/s, and is achieved for larger windows and more streams.
Thruput, dRTT & loss SLAC to IN2P3 by streams for 8kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 16kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 32kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 55kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 64kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 128kByte window Thruput, dRTT & loss SLAC to IN2P3 by streams for 256kByte window

We also made similar measurements from Daresbury Lab (near Liverpool, UK) to SLAC. One example for a 55kByte window is shown below.
Thruput dRTT & loss DL to SLAC to 55kByte window

Effect of thruput on RTT

In the graphs above it appears that the variability of the various RTT metrics increases as the thruput increases. To demonstrate this further, below we plot the dRTT and the rate variability versus the thruput for thruput measured from SLAC to IN2P3 with a 256kByte window and from Daresbury Lab to SLAC with a 55kByte window. It is seen that dRTT and loaded rate variability (i.e. the rate variability measured when the thruput loading is being executed) track one another and also increase with thruput. There is seen to be a steep increase in dRTT and rate variability as the thruput passes 13 Mbits/s.
RTT metrics vs stream thruput RTT vs thruput for DL to SLAC, 55kB window

The frequency distributions of thruput and loaded rate variability are also seen below for the IN2P3 case. It can be seen that the thruput loading affects the loaded rate variability by 10% or less for about 73% of the time. The curves are an exponential fit to the frequency and a log fit to the CDF. The parameters of the fits are shown.
Histogram of thruput frequency Histogram of rate variability frequency
The difference in the unloaded vs. loaded rate variabilities for 300 loaded and 300 unloaded measurements from SLAC to IN2P3 with 256kByte windows are seen in the table below. The low unloaded rate variability values probably indicate that, at the time of the measurements the other (cross-traffic), non iperf traffic was not high (e.g. < 30%). In such a case the rate variability is roughly proportional to the dRTT since the unloaded average RTT is likely to be fairly constant and the minimum loaded RTT is also fairly constant.

AverageStdevMedianIQRMinMax
Unloaded0.4%3.4%0%0%0%56.5%
Loaded7.1%7.8%4.4%9.6%0.6%54.4%
The agreement between dRTT and the rate variability is seen below for SLAC to IN2P3. Note that two points at (-285, 1.14%) and (320, 0.61%) are not shown since they were due to abnormally high RTTs (~ 440 msec) seen for a brief period (one set of 20 loaded and one set of 20 unloaded pings) in both the loaded and unloaded min, average and max pings. It is seen that there is good agreement between the dRTT and rate variability. We also looked at the agreement for the so called rate predictability = RP = 1 - rate variability = min(RTT)/avg(RTT) and the dRTT for Stanford to SLAC (these measurements were made using the methodology outlined in Bulk thruput measurements) as shown in the second plot below. Again there is seen to be good agreement (R2 > 0.6).
dRTT vs rate variability dRTT vs rate predictability

Effect of duration of measurement

A 256kByte window is roughly 170 packets (at 1500 Bytes/packet) and a 4.5Mbps stream is roughly carrying about 340 packets per second. Thus we felt that 40 seconds should be sufficient time to establish a stable TCP flow pattern. In order to verify this we repeated the measurement cycles but varying the duration between cycles. In turn we made measurements for 10, 320, 40, 80, 160 and 20 second durations. We repeated the above twice for each host. The plots below show the aggregate thruput as a function of duration. It can be seen that the thruput does not exhibit any strong correlation with duration.
CERN thruput vs duration Caltech thruput vs duration

Effect on cpu utilization

We ran the iperf client transferring data from SLAC to IN2P3 on an unloaded Sun 4500 host with 6 cpus at 336MHz (pharlap.slac.stanford.edu) running Solaris 2.8, noting the throughput and using the Unix time function to report the user and system/kernel times for different window sizes and numbers of parallel streams. Each pair of streams and window size was measured for 10 seconds. This was later repeated for measurement durations of 60 seconds. The user and kernel cpu times as a function of througput are seen below. In the first plit it is seen that both user and kernel cpu times increase roughly linearly with throughput. In this case about an order of magnitude more kernel cpu was used compared to user cpu. The effect is more dramatic and strongly correlated (R2=0.91 vs 0.30) for the kernel cpu than user cpu utilization. The second plot (of the percentage cpu (user + kernel) time used by throughput for the two measurement durations (10 seconds and 60 seconds)) shows that total (kernel + user) cpu percentage utilization goes roughly as % cpu utilization ~ alpha * throughput, where alpha is between 0.14 and 0.15, and R2 > 0.9. We also briefly looked at the iperf server utilization on the same Sun 4500 for similar transfer from IN2P3 to SLAC and observed that the total server cpu utilization was roughly equal to that of the client.
Kernel and user cpu utilization versus throughput cpu utilization vs throughput for 10 & 60 sec measurements
Iperf client kernel cpu utilization It seems intuitive that for a given product of streams * window size having more streams is likely to use more cpu time due to the need to manage multiple threads etc. This was done on July 7th 2001, for a fixed product streams * window size = 2048 MBytes for sending iperf data from SLAC to IN2P3 for 10 seconds and for 60 seconds. It was observed that there was little dependence of the user time on the number of streams. However, there was a dependence of kernel time on the number of streams. This can be seen in the plot to the right. The lines are straight line fits to guide the eye. The move to the right (higher cpu utilization) as one goes to higher numbers of streams illustrates the increase in cpu utilization for a given throughput as the number of streams is increased. The plot also indicates that there is a strong correlation for iperf client cpu utilization (user + kernel) increasing with throughput. For small numbers of streams (< 32) and fixed product of streams * window of 2048Bytes which gives roughly uniform throughput, the cpu utilization for the client is roughly equal to that for the server, with the client usually being slightly (~10%) higher. For > 16 streams the increase in cpu cycles is more marked for the client than server.

Summary

Further Reading
  • Protocol Parallelism
  • Effects of Ensemble TCP

    Back to top


    Created August 25, 2000, last update October 14, 2000.
    Comments to iepm-l@slac.stanford.edu