To extend the network bulk throughput
measuremenst to the application level, we made measurements with
a peer to peer secure remote file copy program developed by Andy
Hanushevsky of SLAC, and also with
bbftp a secure client/server FTP
program written by Gilles Farache of IN2P3, Lyon France. These programs are both
in heavy daily production us in the HENP community and so understanding how they
perform is very important.
Both programs allow one to set the number of streams
and window sizes for the transfer.
Bulk Data Transfer Tools for a comparison of bbcp and
bbftp. Bbcp also allows us to set the TOS bits for QoS.
We installed bbcp at SLAC, IN2P3, DL, and CERN and
set up the appropriate ssh permissions
to allow access between SLAC and the remote site without requiring a password
to be manually entered.
We wrote a Perl script bbcpload.pl allow the selection of the source
and destination for a file and to cycle through multiple
window sizes (by default
8kBytes, 1024kBytes, 16kBytes, 512kBytes, 32kBytes,
256kBytes, 128kBytes, 64kBytes in this order)
and for each window size a set of streams (by default
1,20,5,12,8,10,4,25,11,3,7,40,15,9,2,30,6 in this order)
copying the file once for
each window and stream setting.
We verified that bbcp was indeed setting the window sizes correctly by
using the Solaris snoop
command on pharlap to
capture packets and looking at the stream initiation SYN and SYN/ACK
For each copy, we noted the transfer rate reported by bbcp, the file size, the
window size, the number of streams, the Unix timeuser, system/kernel, and real times, and
the throughput (file size / real time). We also noted the
loaded (when bbcp was running) and unloaded ping times
(when bbcp was not running). Between each measurement we slept
for the duration of the previous bbcp measurement to limit the load imposed by
bbcpload.pl and to allow unloaded ping measurements to be made.
The maximum number of streams allowed by bbcp was 64. The maximum
window size on ccasn02.in2p3.fr was 1MByte.
The host at SLAC (pharlap) was a Sun E4500 with 6 cpus running at 336 MHz,
a Gbps Ethernet interface and running Solaris 5.8.
At IN2P3, the host was a Sun Enterprise 450 with 4 * 300MHz cpus and a 100Mbps Ethernet
interface running Solaris 5.6.
To reduce the impact of disk/file system performance on the measurements,
most copies were made for a file saved in /tmp/ and were written to
/dev/null on the destination host.
To help understand disk I/O and file system
limitations we also made measurements with the
source file being stored on /tmp/ in memory,
and the destination "file" for different measurement sets
being stored on /tmp/, on /afs/
or on /dev/null.
We also made measurements with different file
sizes to verify that
we were close to the maximum throughput, i.e.
to within the measurement deviation errors,
the throughput had reached the asymptotic limit for an
infinitely long file. Examples of such measurements are shown to the right
for CERN (each point was measured ~ 30 times) and on the
SLAC LAN (each p0int was measured about 25 times). The differences in
transfer rate maxima are ~70Mbits/s compared to about 450Mbits/s. The measurements
indicate asympotia is reached
for file sizes of 100MBytes (transfer time ~ 7-10 secs) and
50Mbytes - 100Mbytes (transfer time 1-2 seconds). The transfer time is here estimated as
the file size / throughput. Henceforth we mainly use 100MByte files for the
measurements. In order to see how the throughput increased at start of a copy, we
also used bbcp with 64KByte window and 4 streams
to record the incremental throughput from plato.cacr.caltech.edu to
pharlap.slac.stanford.edu for each second for the first
couple hundred of secconds of a copy. The results are shown below for 4 copies.
The 3rd and 4th copies use the axis to the right which is offset from the
left hand y axis to help distinguish overlapping points.
It is seen that the throughput rapidly rises to its maximum throughput.
On the other hand, if we use a single stream with a 1MByte window, then the copy
takes considerably longer to get up to speed. This is shown in the second graph
below. It is seen that incremental throughput reaches a maxima around 20
seconds, but average throughput takes longer. The dashed line is a fit
of the average throughput to a
logarithmic function with the parameters shown.
The sawtooth behavior of the incremental throughput (also seen in the 3rd and 4th
graphs which are for window sizes of 2 and $Mbytes respectively)
is reminsicent of the behavior
of the TCP congestion window (cwnd) when retransmissions occur, which
in turn take place since duplicate ACKs are received indicating a packet loss.
This is all described in TCP/IP Illustrated Volume 1, The Protcols by
W. Richard Stevens published in Addison Wesley, in the chapter on
TCP Timeout and Retransmission. Looking at retransmissions with tcpdump
they only appear to be about 1 per minute. On the other hand for the
packets sent from plato to pharlap the reorders are about 0.1% or about
1/second on average. Similar sawtooth behavior is seen below also
for 2MByte and 4MByte windows. In these cases the asymptotic maximum through appears to be
reached by about 15 seconds.
Though the previous measurements on the effect of file sizes (or individual measurement
durations) indicated that for 100MByte files or durations of < 10 seconds
asymptotia was reached, the measurements were made for multiple streams. The slow rise
of the throughput with time of copy (duration) seen above for a single stream, required us
to measure carefully the impact of duration on throughput with window size
for a single stream. In particular we were concerned about biases against measurements with
a single stream leading to mistaken conclusions about the effectiveness of using
To measure copies taking a long time (say 320 seconds) would require
a very long file (over 10GBytes) for transfer rates of 300Mbits/s, so we decided to make the
measurements using iperf rather than bbcp. We selected durations of 2, 5, 10, 20, 40, 80,
160, 250 and 320 seconds, and window sizes of 256, 512, 1024, 2048 and 4096 KBytes. For each
of the above possible pairs we made a single stream
measurement of the iperf TCP throughput 10 times from
plato.cacr.caltech.edu to pharlap.slac.stanford.edu. The results are shown to the right
for 5 measurements for each pair of possible values (window, duration). The points are
the medians of each set of 17 measurements, and the error bars are determined from the
Inter Quartile Ranges (IQRs).
It is seen that though the media continue to rise for durations of over 10 seconds
(by about 10% going from 10 to 20 seconds)
to within the accuracy of the measurements this is a small effect. So for most of our
measurements we settled on 10 second durations.
Using the Unix top
command we looked at memory utilization and it appeared that both the iperf client and
server resided comfortably in main memory. The client on plato appeared to use about 1MByte
and typically there were 367MBytes free, The server on pharlap
used about 75MBytes when running, and there were typically 450MBytes free.
Bbcp has an option to instruct it to set the maximum transfer rate. If this is requested
then the data is clocked out at the specified rate per second.
It simply skips
clock tics to achieve a long-term average. It still sends window-sized
buffers whenever it does send data. Those buffers are sent at whatever speed
the network will allow.
the impact of using this option we measured the bbcp throughput from plato.cacr.caltech.edu
to pharlap.slac.stanford.edu with a 1MByte window and a single stream. We did this with
several selected values of the maximum transfer rate including 512KBytes/sec, 1MByte/s,
2MBytes/s, 5MBytes/s, 10MBytes/s, 15MBytes/s, 20MBytes/s, 25MBytes/s, 30MBytes/s,
35MBytes/s and no maximum transfer rate limit.
The unloaded ping RTT was min/avg/max (std) 24.5/126.96.36.199 (0.15)msec. Some examples
of the througput are shown below.
The magenta line labelled KB/s is the cumulative throughput so far, and the
blue line labelled dKB/s is the incremental throughput for the last second.
We also measured the impact of self pacing when using 32 streams with 64KByte windows.
The results are shown below for 15MBytes/s and 25MByte/s self pacing and no self-pacing.
It can be seen taht with the extra agility provided by more streams, bbcp is able to
more closely provide the bandwidth specified in the maximum bandwidth option.