IEPM

Bulk File Transfer with Compression Measurements

SLAC Home Page
Bulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements

Introduction | Methodology | Results | Summary

Introduction

To understand the costs and impacts of compression on
file transfer throughput measurements to the application level, we made measurements using bbcp's ( man pages, PDF paper) (a peer to peer secure remote file copy program developed by Andy Hanushevsky of SLAC) compression features. Bbcp allows one to set the number of streams and window sizes for the transfer, as well as the compression factor. The compression feature uses the Unix zlib (gzip) features which provide compression factors of 0 (no compression) through 9 (maximum compression).

Methodology

We installed bbcp at about 30 destination hosts in 8 countries, using the IEPM-BW infrastructure. We wrote a set of Perl scripts (bbcpcomp-all.pl, bbcpcomp.pl, bbcpload.pl and bbcpflow.pl) to allow the selection of the destinations (userid@host_name:file_name), windows (by default 8kBytes, 1024kBytes, 16kBytes, 512kBytes, 32kBytes, 256kBytes, 128kBytes, 64kBytes in this order), streams (by default 1,20,5,12,8,10,4,25,11,3,7,40,15,9,2,30,6 in this order), compression factors (by default 0 5 1 7 3 9 2 4 in this order).

We verified that bbcp was indeed setting the window sizes correctly by using the Solaris snoop command on pharlap to capture packets and looking at the stream initiation SYN and SYN/ACK packets.

For each copy, we noted the transfer rate reported by bbcp, the file size, the window size, the number of streams, the compression factor and compression achieved, the Unix time user, system/kernel, real times, the bbcp source and target host cpu usage reported by bbcp. We also noted the loaded (when bbcp was running) and unloaded ping times (when bbcp was not running). Between each measurement we slept for the duration of the previous bbcp measurement to limit the load imposed by bbcpload.pl and to allow unloaded ping measurements to be made. We also noted the operating system and version, together with the number of cpus and their MHz, for the remote host. The maximum number of streams allowed by bbcp was 64. The host at SLAC (pharlap) was a Sun E4500 with 4 cpus running at 336 MHz, a Gbps Ethernet interface and running Solaris 5.8.

To source file was read from /tmp. On pharlap /tmp is stored in swap space. The source file used was a 60Mbyte BaBar Objectivity file. The destination file was always written to /dev/null.

All measurements were made for a duration of approximately 10 seconds (see Measurement Duration for more details).

Results

First we measured the impacts of compression on the size of the file to be sent (after compression) and on the cpu cycles, on the source host, required to achieve the compression. The graph to the right shows the percent of a single 336MHz Sun cpu utilized on pharlap and the compression achieved for various specified compression factors when copying the file from SLAC to ANL. We fixed the TCP window size and number of streams to an optimum (determined separately by using iperf, see Bulk throughput measurements for more details), and varied the compression factors. It can be seen that there is a big (factor 7 or more) increase in compression and cpu utilization as one goes from a compression factor of 0 to 1. After that increasing the compression factor results in a small increase (< 10%) in compression for a further increase in cpu utilization.

The average MHz-seconds used for a compression factor = 0 (no compression) is 8.3 +- 0.91 MHz-secs, and for a compression factor of 1 (compression = 6.9) is 57.3 +- 0.46 MHz-secs. Next we made measurements of bbccp throughput to 22 remote hosts with compression factors of 0 and 1, and with an optimal TCP window size and number of streams selected for each host. The results are shown to the right, with the compression factor of 0 being shown in blue and that for a compression factor of 1 shown in red diagonal stripes. It can be seen that the maximum compressed throughput is about 50Mbits/s. If the uncompressed throughput exceeds this rate (as in the case of NERSC2,ANL, LANL, caltech, SDSC, Stanford, NERSC, Mich, Wisc, and LBL) then there is no improvement by using compression. If the uncomppressed throughput is < ~ 50Mbits/s then compression can help (by more than a factor of 4 in the case of KEK which only has a 10Mbit/s bottleneck bandwidth between it and SLAC). When using a compression factor of 1 (or compression of 6.7), then the average compressed bbcp thoughput is 58+- 0.46 Mbits/s.
The consistency of the compressed throughput indicates that there is a common cause. To ascertain whether this common cause was the measuring host (pharlap), we repeated the measurements from antonia, a 2*532 MHz cpus running Linux 2.4 with a Gig Ethernet NIC, and with hercules, a 2*1131 MHz cpus host running Linux 2.4 with 2 Gig Ethernet NICs. Comparing the antonia results with pharlap's it is apparent that the maximum uncompressed throughput is reduced from about 400Mbits/s to about 165Mbits/s. This is believed to be since the pharlap source file is read from memory (/tmp is in swap space) whereas on antonia and hercules it is read from disk (/dev/sda2 for antonia and hercules and /dev/sda9 on testlnx05).
For the 1131MHz cpu (hercules) it can be seen that uncompressed throughputs of over 400Mbits are achievable, and the median compressed throughput is over 140Mbits/s. To understand these compressed throughput values better we measured the system time gzip took to compress the 380MByte Objectivity file on the measurement hosts and reported this as Mbits/s. The table below compares the results from all the measuring hosts. It can be seen that there is reasonable agreement between the median bbcp compressed throughput and the Gzip throughputs, with Gzip typically being 10-17% lower. This reduction maybe due to gzip source and destination being on the same host, whereas the bbcp measurements were using separate source and destination hosts. To pursue this further we used bbcp to compress and copy the above file from and to the same host and measured the source process cpu seconds and Mbits/s. The graph to the right of the table shows the median bbcp compressed (compression factor 1, compression 6.9) throughput from the measuring host versus the MHz of the measuring host's cpu:
HostOS# cpusMHzNIC MbpsMedian compressed MBits/sStdevGzip Mbits/sBbcp sce=tgt Mbits/sc/x
PharlapSolaris 5.843601000580.4648414.42
AntoniaLinux 2.42532100067561633.69
HerculesLinux 2.4211311000142361341393.32

Since bbcp reports both the source and target cpu utilizations, we are able to compare the cpu times required to compress (source) versus those required to decompress (target). The compression (c) / decompression (x) cpu ratios are also reported in the table above.

For an uncompressed copy, the average ratio of source_host MHz-seconds / target_host MHz-seconds was 1.2 +- 0.3. For The average target host MHz-s for Solaris (5.6, 6.7 and 5.8 a total of 5 hosts) was 5460+-367 MHz-seconds, and for Linux (2.2 & 2.4, a total of 20 hosts) was 7805+-745. There was little variation within the various Solaris versions or the various Linux versions. We also looked for a correlation between target MHz-seconds and the throughput in Mbits/s or the number of streams but could find little evidence for any correlation. See the plots below:

Looking at the bbcp compression throughput graph for Hercules above, we see that the Stanford compressed throughput rate (96 Mbits/s) is much lower than the median (~ 143 Mbits/s), and its uncompressed bbcp throughput is about 90Mbits/s so there is no lack of network bandwidth. The Stanford cpu is a single 299MHz cpu running Linux 2.2. Thus if we take the c/x ratio to be ~ 3.32, the best compressed throughput is limited by its cpu to about:

143Mbits/s (the median compressed throughput measured on a source host with 1131 MHz) * (c/x) / 299 MHz/1131MHz,
or ~ 125 Mbits/s.
Some of the disrepancy between the value of 125Mbits/s and the observed 96Mbits/s may be due to the difference in Linux 2.4 running on Hercules and Linux 2.2 running on the Stanford host. The other hosts with low (< 100Mbits/s) compressed throughputs (e.g. Rome (4*400MHz Solaris 5.7), NIKHEF (2*999MHz Linux 2.4) and RAL (604MHz Linux 2.2)) are less likely to be impacted by limited target cpu MHz, and the discrepancy is due to some other cause such as limited bandwidth. According to iperf/bbcp-uncompressed the Rome max throughput is about 23/4.8Mbits/s, NIKHEF about 48/8.3Mbits/s and RAL about 11/11 Mbits/s.

By fixing the window size at 16Kbytes and varying the number of streams with no compression for bbcp copies from SLAC to ANL, we were able to increase the file copy throughput in a fairly linear-with-streams fashion from about 1.3Mbits/s to over 115 Mbits/s. Using the same window sizes and varying the numbers of streams for compression factors of 1 through 9 we were able to visualize the effectiveness of compression on throughput for varying uncompressed file copy rates. This is seen on the graph to the right. It is seen that for uncompressed copies the throughput is less than about 50Mbits/sec for fewer than 18 streams. For 18 or more streams the uncompressed throughput is greater than 50 Mbits/s. Thus, as can be seen from the graph for a window size of 16KBytes between SLAC and ANL, compression is effective in increasing throughput for fewer than 18 parallel streams. It can also be seen that a compression factor of 1 is most effective.

Summary


Created December 30, 2001, last update December 30, 2001.
Authors: Les Cottrell and Andy Hanushevsky, SLAC. Comments to iepm-l@slac.stanford.edu