IEPM

Bulk File Transfer Measurements

Les Cottrell and Andy Hanushevsky, SLAC
SLAC Home Page

Bulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements

Introduction | Methodology | Measurement duration | Summary

Introduction

To extend the
network bulk throughput measuremenst to the application level, we made measurements with bbcp ( man pages, presentation) a peer to peer secure remote file copy program developed by Andy Hanushevsky of SLAC, and also with bbftp a secure client/server FTP program written by Gilles Farache of IN2P3, Lyon France. These programs are both in heavy daily production us in the HENP community and so understanding how they perform is very important. Both programs allow one to set the number of streams and window sizes for the transfer. See Bulk Data Transfer Tools for a comparison of bbcp and bbftp. Bbcp also allows us to set the TOS bits for QoS.

Methodology

We installed bbcp at SLAC, IN2P3, DL, and CERN and set up the appropriate ssh permissions to allow access between SLAC and the remote site without requiring a password to be manually entered. We wrote a Perl script bbcpload.pl allow the selection of the source and destination for a file and to cycle through multiple window sizes (by default 8kBytes, 1024kBytes, 16kBytes, 512kBytes, 32kBytes, 256kBytes, 128kBytes, 64kBytes in this order) and for each window size a set of streams (by default 1,20,5,12,8,10,4,25,11,3,7,40,15,9,2,30,6 in this order) copying the file once for each window and stream setting. We verified that bbcp was indeed setting the window sizes correctly by using the Solaris snoop command on pharlap to capture packets and looking at the stream initiation SYN and SYN/ACK packets. For each copy, we noted the transfer rate reported by bbcp, the file size, the window size, the number of streams, the Unix time user, system/kernel, and real times, and the throughput (file size / real time). We also noted the loaded (when bbcp was running) and unloaded ping times (when bbcp was not running). Between each measurement we slept for the duration of the previous bbcp measurement to limit the load imposed by bbcpload.pl and to allow unloaded ping measurements to be made. The maximum number of streams allowed by bbcp was 64. The maximum window size on ccasn02.in2p3.fr was 1MByte. The host at SLAC (pharlap) was a Sun E4500 with 6 cpus running at 336 MHz, a Gbps Ethernet interface and running Solaris 5.8. At IN2P3, the host was a Sun Enterprise 450 with 4 * 300MHz cpus and a 100Mbps Ethernet interface running Solaris 5.6. To reduce the impact of disk/file system performance on the measurements, most copies were made for a file saved in /tmp/ and were written to /dev/null on the destination host. To help understand disk I/O and file system limitations we also made measurements with the source file being stored on /tmp/ in memory, and the destination "file" for different measurement sets being stored on /tmp/, on /afs/ or on /dev/null.

Measurement Duration

We also made measurements with different file sizes to verify that we were close to the maximum throughput, i.e. to within the measurement deviation errors, the throughput had reached the asymptotic limit for an infinitely long file. Examples of such measurements are shown to the right for CERN (each point was measured ~ 30 times) and on the SLAC LAN (each p0int was measured about 25 times). The differences in transfer rate maxima are ~70Mbits/s compared to about 450Mbits/s. The measurements indicate asympotia is reached for file sizes of 100MBytes (transfer time ~ 7-10 secs) and 50Mbytes - 100Mbytes (transfer time 1-2 seconds). The transfer time is here estimated as the file size / throughput. Henceforth we mainly use 100MByte files for the measurements. In order to see how the throughput increased at start of a copy, we also used bbcp with 64KByte window and 4 streams to record the incremental throughput from plato.cacr.caltech.edu to pharlap.slac.stanford.edu for each second for the first couple hundred of secconds of a copy. The results are shown below for 4 copies. The 3rd and 4th copies use the axis to the right which is offset from the left hand y axis to help distinguish overlapping points. It is seen that the throughput rapidly rises to its maximum throughput.
On the other hand, if we use a single stream with a 1MByte window, then the copy takes considerably longer to get up to speed. This is shown in the second graph below. It is seen that incremental throughput reaches a maxima around 20 seconds, but average throughput takes longer. The dashed line is a fit of the average throughput to a logarithmic function with the parameters shown. The sawtooth behavior of the incremental throughput (also seen in the 3rd and 4th graphs which are for window sizes of 2 and $Mbytes respectively) is reminsicent of the behavior of the TCP congestion window (cwnd) when retransmissions occur, which in turn take place since duplicate ACKs are received indicating a packet loss. This is all described in TCP/IP Illustrated Volume 1, The Protcols by W. Richard Stevens published in Addison Wesley, in the chapter on TCP Timeout and Retransmission. Looking at retransmissions with tcpdump they only appear to be about 1 per minute. On the other hand for the packets sent from plato to pharlap the reorders are about 0.1% or about 1/second on average. Similar sawtooth behavior is seen below also for 2MByte and 4MByte windows. In these cases the asymptotic maximum through appears to be reached by about 15 seconds.

Though the previous measurements on the effect of file sizes (or individual measurement durations) indicated that for 100MByte files or durations of < 10 seconds asymptotia was reached, the measurements were made for multiple streams. The slow rise of the throughput with time of copy (duration) seen above for a single stream, required us to measure carefully the impact of duration on throughput with window size for a single stream. In particular we were concerned about biases against measurements with a single stream leading to mistaken conclusions about the effectiveness of using multiple streams. To measure copies taking a long time (say 320 seconds) would require a very long file (over 10GBytes) for transfer rates of 300Mbits/s, so we decided to make the measurements using iperf rather than bbcp. We selected durations of 2, 5, 10, 20, 40, 80, 160, 250 and 320 seconds, and window sizes of 256, 512, 1024, 2048 and 4096 KBytes. For each of the above possible pairs we made a single stream measurement of the iperf TCP throughput 10 times from plato.cacr.caltech.edu to pharlap.slac.stanford.edu. The results are shown to the right for 5 measurements for each pair of possible values (window, duration). The points are the medians of each set of 17 measurements, and the error bars are determined from the Inter Quartile Ranges (IQRs). It is seen that though the media continue to rise for durations of over 10 seconds (by about 10% going from 10 to 20 seconds) to within the accuracy of the measurements this is a small effect. So for most of our measurements we settled on 10 second durations. Using the Unix top command we looked at memory utilization and it appeared that both the iperf client and server resided comfortably in main memory. The client on plato appeared to use about 1MByte and typically there were 367MBytes free, The server on pharlap used about 75MBytes when running, and there were typically 450MBytes free.

Bbcp self-pacing

Bbcp has an option to instruct it to set the maximum transfer rate. If this is requested then the data is clocked out at the specified rate per second. It simply skips clock tics to achieve a long-term average. It still sends window-sized buffers whenever it does send data. Those buffers are sent at whatever speed the network will allow. To investigate the impact of using this option we measured the bbcp throughput from plato.cacr.caltech.edu to pharlap.slac.stanford.edu with a 1MByte window and a single stream. We did this with several selected values of the maximum transfer rate including 512KBytes/sec, 1MByte/s, 2MBytes/s, 5MBytes/s, 10MBytes/s, 15MBytes/s, 20MBytes/s, 25MBytes/s, 30MBytes/s, 35MBytes/s and no maximum transfer rate limit. The unloaded ping RTT was min/avg/max (std) 24.5/24.7.25.3 (0.15)msec. Some examples of the througput are shown below. The magenta line labelled KB/s is the cumulative throughput so far, and the blue line labelled dKB/s is the incremental throughput for the last second.

We also measured the impact of self pacing when using 32 streams with 64KByte windows. The results are shown below for 15MBytes/s and 25MByte/s self pacing and no self-pacing. It can be seen taht with the extra agility provided by more streams, bbcp is able to more closely provide the bandwidth specified in the maximum bandwidth option.

Summary of Bulk File Transfer Measurements

Measuring Disk Throughputs on Remote Hosts


Created August 25, 2000, last update August 30, 2001.
Authors: Les Cottrell and Andy Hanushevsky, SLAC. Comments to iepm-l@slac.stanford.edu