![]() |
Bulk File Transfer with Compression MeasurementsBulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements |
We verified that bbcp was indeed setting the window sizes correctly by using the Solaris snoop command on pharlap to capture packets and looking at the stream initiation SYN and SYN/ACK packets.
For each copy, we noted the transfer rate reported by bbcp, the file size, the window size, the number of streams, the compression factor and compression achieved, the Unix time user, system/kernel, real times, the bbcp source and target host cpu usage reported by bbcp. We also noted the loaded (when bbcp was running) and unloaded ping times (when bbcp was not running). Between each measurement we slept for the duration of the previous bbcp measurement to limit the load imposed by bbcpload.pl and to allow unloaded ping measurements to be made. We also noted the operating system and version, together with the number of cpus and their MHz, for the remote host. The maximum number of streams allowed by bbcp was 64. The host at SLAC (pharlap) was a Sun E4500 with 4 cpus running at 336 MHz, a Gbps Ethernet interface and running Solaris 5.8.
To source file was read from /tmp. On pharlap /tmp is stored in swap space. The source file used was a 60Mbyte BaBar Objectivity file. The destination file was always written to /dev/null.
All measurements were made for a duration of approximately 10 seconds (see Measurement Duration for more details).
First we measured the impacts of compression on the size of the file to be sent (after
compression) and on the cpu cycles, on the source host, required to achieve the compression.
The graph to the right shows the percent of a single 336MHz Sun cpu utilized on pharlap and the
compression achieved for various specified compression factors when copying the file
from SLAC to ANL.
We fixed the TCP window size and number of streams to an optimum (determined
separately by
using iperf, see
Bulk throughput
measurements for more details), and varied the compression factors.
It can be seen that there is a big (factor 7 or more)
increase in compression and cpu
utilization as one goes from a compression factor of 0 to 1. After that increasing
the compression factor results in a small increase (< 10%) in compression for a further
increase in cpu utilization.
The average MHz-seconds used for a compression factor = 0 (no compression) is
8.3 +- 0.91 MHz-secs, and for a compression factor of 1 (compression = 6.9) is
57.3 +- 0.46 MHz-secs.
Next we made measurements of bbccp throughput to 22 remote hosts with
compression factors of 0 and 1, and with an optimal
TCP window size and number of streams selected for each host.
The results are shown to the right, with the compression factor of
0 being shown in
blue and that for a compression factor of 1 shown in red diagonal
stripes.
It can be seen that the maximum compressed throughput is about 50Mbits/s.
If the uncompressed
throughput exceeds this rate (as in the case of NERSC2,ANL,
LANL, caltech, SDSC, Stanford, NERSC, Mich, Wisc, and LBL) then
there is no improvement by using compression. If the uncomppressed
throughput is < ~ 50Mbits/s
then compression can help (by more than a factor of 4 in the case of KEK which only has a 10Mbit/s
bottleneck bandwidth between it and SLAC). When using a compression factor of
1 (or compression of 6.7), then the average compressed bbcp
thoughput is 58+- 0.46 Mbits/s.
The consistency of the compressed throughput
indicates that there is a common cause. To ascertain whether this common cause
was the measuring host (pharlap), we repeated the measurements from antonia, a
2*532 MHz cpus running Linux 2.4 with a Gig Ethernet NIC, and with hercules,
a 2*1131 MHz cpus host running Linux 2.4 with 2 Gig Ethernet NICs.
Comparing the antonia results with pharlap's it is apparent that the
maximum uncompressed throughput is reduced from about 400Mbits/s to about
165Mbits/s. This is believed to be since the pharlap
source file is read from memory
(/tmp is in swap space) whereas on
antonia and hercules it is read from disk (/dev/sda2 for antonia and
hercules and /dev/sda9 on testlnx05).
For the 1131MHz cpu (hercules) it can be seen that
uncompressed throughputs of over 400Mbits are achievable,
and the median compressed
throughput is over 140Mbits/s.
To understand these compressed throughput values better we measured the
system time gzip
took to compress the 380MByte Objectivity file on the measurement hosts
and reported this as Mbits/s.
The table below compares
the results from all the measuring hosts.
It can be seen that there is reasonable agreement between the median
bbcp compressed throughput and the Gzip throughputs, with Gzip typically
being 10-17% lower. This reduction maybe due to gzip source and destination
being on the same host, whereas the bbcp measurements were using separate source and destination hosts.
To pursue this further we used bbcp to compress and copy the above file from and
to the same host and measured the source process cpu seconds and Mbits/s.
The graph to the right
of the table shows the median bbcp compressed
(compression factor 1, compression 6.9)
throughput from the measuring host versus the MHz of the measuring host's
cpu:
| Host | OS | # cpus | MHz | NIC Mbps | Median compressed MBits/s | Stdev | Gzip Mbits/s | Bbcp sce=tgt Mbits/s | c/x |
|---|---|---|---|---|---|---|---|---|---|
| Pharlap | Solaris 5.8 | 4 | 360 | 1000 | 58 | 0.46 | 48 | 41 | 4.42 |
| Antonia | Linux 2.4 | 2 | 532 | 1000 | 67 | 5 | 61 | 63 | 3.69 |
| Hercules | Linux 2.4 | 2 | 1131 | 1000 | 142 | 36 | 134 | 139 | 3.32 |
For an uncompressed
copy, the average ratio of source_host MHz-seconds / target_host MHz-seconds was
1.2 +- 0.3. For The average target host MHz-s for Solaris (5.6, 6.7 and 5.8 a total of
5 hosts) was 5460+-367 MHz-seconds, and for Linux (2.2 & 2.4, a total of 20 hosts) was
7805+-745. There was
little variation within the various Solaris versions or the various Linux versions.
We also looked for a correlation between target MHz-seconds and the throughput
in Mbits/s or the number
of streams but could find little evidence for any correlation. See the plots below:
Looking at the bbcp compression throughput graph for Hercules above, we see that the Stanford compressed throughput rate (96 Mbits/s) is much lower than the median (~ 143 Mbits/s), and its uncompressed bbcp throughput is about 90Mbits/s so there is no lack of network bandwidth. The Stanford cpu is a single 299MHz cpu running Linux 2.2. Thus if we take the c/x ratio to be ~ 3.32, the best compressed throughput is limited by its cpu to about:
By fixing the window size at 16Kbytes and varying the number of streams with no compression
for bbcp copies from SLAC to ANL, we were able to increase the file copy throughput in a fairly
linear-with-streams fashion from about 1.3Mbits/s to over 115 Mbits/s.
Using the same window sizes and varying
the numbers of streams for compression factors of 1 through 9 we were able to visualize
the effectiveness of compression on throughput for
varying uncompressed file copy rates.
This is seen on the graph to the right. It is seen that for uncompressed copies the throughput
is less than about 50Mbits/sec for fewer than 18 streams. For 18 or more streams the
uncompressed throughput is greater than 50 Mbits/s. Thus, as can be seen from the graph for a
window size of 16KBytes between SLAC and ANL, compression is effective in increasing throughput
for fewer than 18 parallel streams. It can also be seen that a compression factor of 1 is
most effective.