 |
Bulk thruput: windows vs. streams
 |
|
Introduction
From Bulk thruput measurements made using
iperf
from SLAC to various sites it appears that using multiple parallel
streams of data is often more effective in achieving high thruput than using
large windows. This page looks at 3 links, chosen from the Bulk thruput measurements,
in more detail to attempt to
provide more empirical
evaluation of this phenomenon.
Methodology
We selected three sites of interest to BaBar. These were CERN,
Caltech and IN2P3 in Lyon France. These were chosen since from the
Bulk thruput measurements
the first two exhibit the effect of multiple streams being more effective than
large windows, while for IN2P3 both multiple streams and large windows are
effective. Also we have strong relationships with the sites,
and the sites were kind enough to
enable us to run iperf on one of their hosts, and to impact their
links with our high performance measurement traffic. We also show
some selected data from thruput measurements to with other sites, Daresbury
Lab. near Liverpool England and Stanford University California.
For each site we used iperf
to send TCP bulk data with a 256kByte window and 2 streams
from a client at
SLAC to an iperf server at the remote site for 40 seconds.
At the same time we sent 40 56 byte pings (one/second, with a timeout of
20 seconds). Following this we sent 40 more similar
pings but without any iperf traffic.
The above measurement was then repeated but with a 64k byte window and 8
streams (i.e. the same product of streams * window).
the above is referred to as a measurement cycle.
The above set of measurements was repeated a couple of hundred times,
the data was saved and analyzed with Excel. The whole series of measurements
was repeated for each site.
We also used
pathchar to characterize the paths and traceoute to get the routes
(in both directions where possible). See
Pathchar measurements from SLAC and
Traceroutes from SLAC.
SLAC to CERN
The plot below shows the SLAC to CERN performance as measured
via Internet 2 with 8 streams and a windows size of 64kBytes. The contribution
of each individual stream is shown in additive form. It can be seen
that the individual components roughly contribute equal amounts to the
aggregate thruput. The aggregate thruput varies from 10 to 30 Mbits/sec.
The next plot shows the measurement made with 2 streams and 256kByte windows.
It can be seen again that the 2 streams contibute about equally to the aggregate.
However, confirming the results in
Bulk thruput measurements the aggregate for this
case (2 streams, 256kByte windows) is considerably (roughly 3 times)
below that for 8 streams and
64kByte windows, even though the individual streams carry more data
(about 4.7:3) in the case of
the larger windows.
Caltech
We made similar measurements from SLAC to Caltech. The table below summarizes
the results and compares them with the SLAC CERN measurements. The number reported
for the aggregate thruput is the median of all the measurements,
IQR is the inter quartile
range of the thruput measurements
(it indicates the spread of the measurements). The RTTs (Round Trip Time)
are those measured by ping. It can be seen that the best performance to Caltech is
better than to CERN.
(~ factor of 2). The performance with more streams is better than with fewer streams.
More streams also results in greater RTT change between loaded and unloaded
measurements.
| Site | Window | Streams | Aggregate thruput | IQR | Min loaded RTT | Avg loaded RTT | Min unloaded RTT | Avg unloaded RTT |
| CERN | 256kBytes | 2 | 9.45Mbits/s | 4.63Mbits/s | 165ms | 167ms | 165ms | 165ms |
| CERN | 64kBytes | 8 | 26.8Mbits/s | 10.6Mbits/s | 164ms | 181ms | 165ms | 165.4ms |
| Caltech | 256kBytes | 2 | 46.5Mbits/s | 1.7Mbits/s | 16ms | 21ms | 15ms | 17ms |
| Caltech | 64kBytes | 8 | 63.5Mbits/s | 4.6Mbits/s | 17ms | 30ms | 16ms | 18ms |
Effect of increasing number of streams for CERN & IN2P3
To better understand the impact of increasing the number of streams while keeping the
window size constant we measured the thruput, losses and RTTs with varying numbers
of streams (in the order 1, 20, 5, 12,8, 10, 4, 11, 3, 7, 15, 9, 2, 6), from SLAC to
CERN with 8kByte and 16kByte windows. The duration of each thruput measurement
(for a fixed window size and number of streams) was 20 seconds during which both
ping RTT and loss were also measured (loaded loss & RTT), followed by
20 pings (1 second separation and 20 second timeout with 100byte packets) with no
iperf thruput to measure the unloaded RTT and loss.
Below we show plots of the thruput, unloaded loss and the difference in the loaded and
unloaded RTTs (dRTT) versus the number of parallel streams for window sizes of
8kBytes and 16kBytes. The solid green lines represent the dRTT averaged over all the
measurements for a given number of streams. Fot the 55kByte windows there were
a few instances of very long ping (> 4 seconds) RTTs which make the averages
oscillate wildly, so we add the median of the dRTTs as a solid blue line.
For the 8kByte, 16kByte and 55kByte window measurements, for each number of streams
there are roughly 21 measurements of the RTT difference and thruput.
It is seen that from 1 thru about 10 streams the thruput grows
roughly linearly with the number of streams. Above 7 streams the difference in RTT
also starts to grow indicating a greater impact on cross-traffic. At the same time beyond
7 streams the variability in the thruput also increases. This may indicate that there
is an optimal number of streams that limits the impact while providing good
thruput, by for example minimizing the ratio of RTT difference (dRTT) to thruput.
It is also seen that for larger numbers of streams the variability in RTT often increases
as thruput saturates. The red line shows the IQR of the RTT. The magenta
dots show the
data rate variability defined as
1 - max(data rate)/avg(data rate) and data rate is here defined as
(bytes sent + bytes received back) / RTT. The magenta line shows the average
data rate variability.
Similar graphs are shown below for SLAC to IN2P3 (Lyon, France) measured Oct 8 and 9,
2000, from pharlap.slac.stanford.edu to ccasn01.in2p3.fr. The best thruputs appear
to be about 20Mbits/s, and is achieved for larger windows and more streams.
We also made similar measurements from Daresbury Lab (near Liverpool, UK)
to SLAC. One example for a 55kByte window is shown below.
Effect of thruput on RTT
In the graphs above it appears that the variability of the various RTT
metrics increases as
the thruput increases. To demonstrate this further,
below we plot the dRTT and the
rate variability versus the thruput for thruput measured from
SLAC to IN2P3 with a 256kByte window and from Daresbury Lab to SLAC with a
55kByte window.
It is seen that dRTT and loaded rate variability (i.e. the rate variability
measured when the thruput loading is being executed)
track one another and also increase with thruput. There is seen to be a steep
increase in dRTT and rate variability as the thruput
passes 13 Mbits/s.
The frequency distributions of thruput and loaded rate
variability are also
seen below for the IN2P3 case.
It can be seen that the thruput loading affects the loaded
rate variability by 10% or less for about 73% of the time.
The curves are an exponential fit to the frequency and a log fit to the
CDF. The parameters of the fits are shown.
The difference in the unloaded vs. loaded rate variabilities
for 300 loaded and 300 unloaded measurements from SLAC to IN2P3 with
256kByte windows are seen in the
table below. The low unloaded rate variability values probably indicate that,
at the time of the measurements the other (cross-traffic),
non iperf traffic was not high (e.g.
< 30%). In such a case the rate variability is roughly
proportional to the dRTT since
the unloaded average RTT is likely to be fairly constant and the
minimum loaded RTT is also fairly constant.
| Average | Stdev | Median | IQR | Min | Max |
| Unloaded | 0.4% | 3.4% | 0% | 0% | 0% | 56.5% |
| Loaded | 7.1% | 7.8% | 4.4% | 9.6% | 0.6% | 54.4% |
The agreement between dRTT and the rate variability is seen below
for SLAC to IN2P3. Note
that two points at (-285, 1.14%) and (320, 0.61%) are not shown since
they were due to abnormally high RTTs (~ 440 msec) seen for a
brief period (one set of 20 loaded and one set of 20 unloaded pings)
in both the loaded and unloaded min, average and max pings.
It is seen that there is good agreement between the dRTT and rate
variability. We also looked at the agreement for the so called
rate predictability = RP = 1 - rate variability = min(RTT)/avg(RTT)
and the dRTT for Stanford to SLAC (these measurements were made
using the methodology outlined in
Bulk thruput measurements)
as shown in the second plot below.
Again there is seen to be good agreement (R2 > 0.6).
Effect of duration of measurement
A 256kByte window is roughly 170 packets (at 1500 Bytes/packet) and a 4.5Mbps
stream is roughly carrying about 340 packets per second. Thus we felt
that 40 seconds should be sufficient time to establish a stable TCP flow pattern.
In order to verify this we repeated the measurement cycles but varying
the duration between cycles. In turn we made measurements for 10, 320, 40, 80,
160 and 20 second durations.
We repeated the above twice for each host. The plots below
show the aggregate thruput as a function of duration. It can be seen that
the thruput
does not exhibit any strong correlation with duration.
Effect on cpu utilization
We ran the iperf client transferring data from SLAC to IN2P3 on an unloaded
Sun 4500 host with 6 cpus at 336MHz (pharlap.slac.stanford.edu)
running Solaris 2.8,
noting the throughput and using the Unix time
function to report the user and system/kernel times for different window
sizes and numbers of parallel streams. Each pair of streams and window size
was measured for 10 seconds. This was later repeated for measurement durations
of 60 seconds.
The user and kernel cpu times as a function of througput are seen below.
In the first plit it is seen that both user and kernel cpu times increase
roughly linearly with throughput.
In this case about an order of magnitude more kernel cpu was used compared to
user cpu.
The effect is more dramatic and strongly correlated (R2=0.91
vs 0.30) for the kernel cpu than user cpu utilization. The
second plot (of the percentage cpu (user + kernel) time used by
throughput for the two measurement durations (10 seconds and 60 seconds))
shows that total (kernel + user) cpu percentage utilization goes roughly as
% cpu utilization ~ alpha * throughput, where alpha is
between 0.14 and 0.15, and
R2 > 0.9. We also briefly looked at the iperf server utilization
on the same Sun 4500 for similar transfer from IN2P3 to SLAC and observed that the
total server cpu utilization was roughly equal to that of the client.
It seems intuitive that for a given product of
streams * window size
having more streams is likely to use more cpu time due to the need to manage
multiple threads etc. This was done on
July 7th 2001, for a fixed product
streams * window size = 2048 MBytes for sending iperf data from SLAC to IN2P3
for 10 seconds and for 60 seconds. It was observed that there was little dependence
of the user time on the number of streams. However, there was a
dependence of kernel time
on the number of streams. This can be seen in the plot to the right. The lines are
straight line fits to guide the eye. The move to the right (higher cpu
utilization) as one goes to higher numbers of streams illustrates the
increase in cpu utilization for a given throughput as the number of
streams is increased. The plot also
indicates that there is a strong correlation for iperf client cpu utilization
(user + kernel)
increasing with throughput.
For small numbers of streams (< 32) and fixed product of
streams * window of 2048Bytes which gives roughly
uniform throughput, the cpu utilization for
the client is roughly equal to that for the server, with the client usually being
slightly (~10%) higher.
For > 16 streams the increase in cpu cycles is more marked for the
client than server.
Summary
- For some links it is the product of streams * windows that determines the
maximum throughput.
- We confirm that at least for some links a
larger number streams is more effective than
fewer streams and larger windows (in this case for the product streams * window
being kept constant).
As Harvey Newman of Caltech suggests
In terms of performance, one argument for using N streams
is that if you have a packet
loss on a single stream, then the multiple-decrease,
additive-increase (MDAI) behavior of TCP will only affect
one stream of N, and you will regain performance on that
stream (to 1/N of the aggregate performance) faster.
Further, I note that the likelihood of multiple packet
losses, leading to a TCP slow-restart, is lower.
As is pointed out in the Applied Techniques for High Bandwidth
Data Transfers acsross Wide Area Networks by Jason Lee et. al.
contribution to CHEP01, if the bandwidth is limited
by a small router in the path, all the streams are likely to experience packet loss in
synchrony (when the buffer fills, arriving packets from each stream are all dropped) and
thus gain little advantage over a single stream.
- The individual streams in a measurement contribute about equally
to the aggregate.
- The individual streams for the larger window and smaller number of streams
carry more data than the smaller window larger number of streams.
- For small window sizes, the thruput grows roughly
linearly with number of streams for a small number of
streams.
- Choosing too many streams does not improve performance, but does increase
the impact on others (the difference in the loaded minus unloaded RTT and
the loss rate increases while the thruput does not improve).
- For a fixed streams * window size product increasing the number
of streams increases the system/kernel time used while iperf is
running.
- Opening multiple streams may be considered overly aggressive and
not "TCP compatible" (see
RFC2914). Future
versions of TCP may use the concept of flows to aggregate multiple
streams to help with congestion management.
- On the other hand it may be easier to set up multiple streams than to
set up a different version of the kernel
or configure an existing kernel to support the larger windows.
Some systems
(see an
example for Linux)
do not do a good job of supporting large windows, so using multiple
streams may be a way to get round this limitation.
Further it is possible that large windows may not be supported through
some network components. For example, BNL had a Cisco PIX firewall that reset the
window scaling factor thus limiting the maximum window sizes to 64KBytes.
- As pointed out to me by Jason Leigh of University of Illinois, Chicago,
there are libraries emerging that will make
parallel sockets simpler.
The
CAVERNsoft toolkit
for example provides both a parallel tcp implementation and a non-parallel
implementation with the same API calls (except where you have to give a
parameter for # of sockets).
- A problem with parallel sockets is knowing how many sockets to open.
That will change depending on available bandwidth at any given moment in time.
- It is possible that multiple streams may work better than
large windows when one has multiple paths (e.g. multiple OC12s rather than a
single OC48 ).
- As increased thruput drives the network to saturation,
performance and RTT appear to fluctuate more and more.
This is consistent
with queuing theory that suggests that the variation in RTT
varies propoortional to 1/(1-L)
where L is the current network load, 0≤L≤1 (see
Internetworking with TCP/IP, volume 1, Principles, protocols
and architectures, Douglas E. Comer, Prentice Hall (1995), section
13.19 Responding to High Variance in Delay.)
- It is possible that one might be able to optimize thruput
while minimizing the
impact on other traffic by varying the number of streams to maximize
thruput while minimizing the fluctuations, e.g. by minimizing dRTT.
For an otherwise (i.e. other than the high
thruput load) lightly loaded link,
the fluctuations in the thruput or
the RTT may also be an adequate indicator of load.
The fluctuation in the RTT may be estimated using the IQR(RTT) or by
calculating the average(data rate)/maximum(data rate) or its
complement (we refer to this complement as the rate variability),
where the data rate
is ~ (bytes in out bound packet + bytes in in bound packet)/(RTT of packet)
or more accurately by calculating the slope of
the data rate with respect to the number of bytes. The rate
variability = RV or rate predictability = RP (rough approximations are
RV ~ 1 - (min(RTT)/avg(RTT) and
RP ~ min(RTT)/avg(RTT) are fairly easy to measure and
calculate and are also dimensionless and so
may scale well.
- The loss measured by ping does not appear to be very sensitive to the thruput for these
measurements.
Further Reading
Protocol Parallelism
Effects of Ensemble TCP
Back to top
Created August 25, 2000, last update October 14, 2000.
Comments to iepm-l@slac.stanford.edu