By Saad Ansari saad@slac.stanford.edu
UDP Mon and TCP multi-stream are parts of a software that puts forward a technique for measuring data-rates by varying various traffic parameters, but most visibly by changing inter-packet gaps.
The software follows a “request-response” design, analogous to a client-server architecture. The software runs as a user-mode process and does not modify the kernel or have very stringent requirements for it. Essentially, the client (requester) sends a request to the server (responder) by varying traffic and system characteristics. These include:
·
Packet length. This is the length of a packet,
not including its headers.
·
Packet-wait time. This is the time the client
waits before sending the next packet.
·
Number of probes. This is total number of
packets the client sends for each measurement.
·
Packet length increments. This specifies the
increment by which the length of the packet should be increased with each
subsequent transmission.
·
Packet-wait time increments. This specifies the
increment by which the wait time between packets should be increased with each
subsequent transmission.
· Port number. This specifies the port number on which the communication will take place.
· Buffer size. This is the maximum buffer the system will allocate to handle incoming packets, thereby constraining the packet length to be less than or equal to this value. It defaults to 65536 bytes (64 KB) and the default value was used in all test cases.
The server responds by echoing back the request. The client, upon receiving the echoed data, calculates various statistics like:
· Received Data Rate. This is the data rate that the client perceives for the entire communication, i.e. both transmit and receive.
·
Send Data Rate. This is the data rate which
depicts the rate at which the client can send data to its NIC interface.
·
Latency. This is the time taken for a packet to
go to its destination and be echoed back.
·
Time/frame. This shows the time taken to process
an outgoing/incoming packet.
Of these, the received data-rate stands out most conspicuously. Data-rate calculation is done using a simple formula:
(Data per packet in bytes * 8 * number of packets sent * 2)
-----------------------------------------------------------
(Time taken to send and receive data)
The numerator reflects the total data sent and received for that measurement (which is double the data initially sent by the client).
The following section highlights some of the main software packages and attempts to discern their utility by analyzing the results obtained from running these packages.
UDP BW Mon is a utility which is used to estimate the capacity of a link. It follows the architecture described above and uses the aforementioned characteristics to generate traffic, from which various traffic statistics are extracted for analysis.
Varying inter-packet gaps did not reveal much as cross-traffic introduces non-deterministic gaps anyway and the specified inter-packet gaps do not remain the same. Ideally, the inter-packet gap should be close to 0, if the objective is to measure raw throughput on the link. However, higher values of up to 10 µseconds revealed similar results. With even higher values, throughput fell tremendously as predicted, because higher delays mean lower utilization of the link capacity.
Another metric that was varied was packet size. Bigger packets (values closer to the MTU of 1500 bytes) revealed close to optimal bandwidth results and smaller packets showed low utilization.
UDP Mon was also able to pick up transient network congestion and this was reflected as a trough in all the measurements.
Initially tests were run over multiple hosts, but most links had confounding traffic variation, which skewed results, making it difficult to discover patterns in the resulting statistics. Looping back packets to the same machine was also not a good idea as data would never hit the wire. Therefore, a reasonably fast link was required which would have mostly deterministic characteristics without too much variation.
The path to the server (responder) had a bottleneck link of 622 Mbps. Therefore, theoretically, the maximum achievable rate would be close to this value, discounting for link layer headers, processing delay and potentially queuing delay.
Client Machine: Hercules.slacs.stanford.edu
Server Machine: pdsfgrid2.nersc.gov
Physical Distance between client and server: 45 miles, RTT = 2ms
The first scenario had the following traffic parameters:
Packet length = 1450 bytes
Inter-packet wait times = 0, 10, 20, 50 µseconds
Number of probes per measurement[1] = 300

Figure 1 Hercules to pdsfgrid2, packet size 1450 bytes
The graph shows that for wait times 0-20 µseconds, the data rate estimation followed the same pattern. Any transient network delays were reflected in most measurements and manifested itself as a trough in all plots for that measurement (consider measurement 16). There are a few outliers, but it is a reasonable assumption to ignore these (as they can be attributed to transient network congestion). This result also shows that delays of 50 µseconds reveal highly inaccurate bandwidth measurements and therefore have little utility as far as raw data rate measurement is concerned. Data rate measurement is done at the receiver end (server side).
The second scenario had the following traffic parameters:
Packet length = 1000 bytes
Inter-packet wait times = 0, 10, 20, 50 µseconds
Number of probes per measurement = 300

Figure 2 Hercules to pdsfgrid2, packet size 1000 bytes
This graph adds a little more to our analysis. It seems that varying wait times is not the only factor affecting the accuracy of measurement. Varying packet sizes also starts affecting results. With a packet size of 1000 bytes, results show that delays greater than 10 µseconds fail to give precise values. An interesting difference from the previous measurement is that delays of 10 µseconds give much higher estimates of the data rate than delays of 0 µseconds and these estimates are closer to the values revealed in the earlier tests. To check if this was not transient behavior, I ran the same test at a different time from another machine to the same server and got the following graph.

Figure 3 Antonia to pdsfgrid2, packet size 1000 bytes
This graph also shows better throughput for a wait of 10 µseconds as compared to packets sent back to back. I was unable to come up with a plausible explanation for this behavior.
In order to verify the claim that smaller packet sizes will show lower throughput, the following traffic parameters were used:
Packet lengths = 200 and 100 bytes
Inter-packet wait times = 0, 10, 20, 50 µseconds
Number of probes per measurement = 300

Figure 4 Hercules to pdsfgrid2, packet
size 200 bytes

Figure 5 Hercules to pdsfgrid2, packet size 100 bytes
Both tests show that lowering packet sizes decreases the throughput. There is now much greater separation between the 0 and 10 µseconds measurements. Higher wait times show very low data rate estimates.
UDP Mon also supplies an estimate of the data rate at which the client is able to send requests. This result is much higher than the bandwidth estimates for the link.

Figure 6 Comparing machine and link
data rates from hercules to pdsfgrid2
As the data rate of the link was much slower than that at
which data was supposedly being pumped on to the network, it would seem that
packets would eventually buffer up at the bottleneck and some would probably
get dropped. As this flow was UDP, any dropped packets would not be retransmitted.
The UDP Mon application, however, did keep track of packet sequences and
counted any dropped/missing packets from the sequence it was expecting. These
tests revealed that no packets were dropped. This would seem contradictory to
the reason cited above, but upon closer inspection of the code, it shows that
this calculation of the machine data rate, does not account for the wait time
between packets sent and therefore showed a much higher bandwidth than that of
the link rate estimate. Another interesting observation is that sudden drops in
link throughput do not always reflect transient network delay, but may be
because the machine has slowed down because of some context switch or CPU
scheduling. This would explain some sudden drops in throughput in some of the
measurements that have been recorded in this document.
Another test was run to measure the maximum data rate at
which a machine could transmit at. This test revealed that this statistic was
directly related to the processing capability of the machine. The following
graphs corroborate this claim.

Figure 7 Hercules to Pdsfgrid2, Machine Data Rate
Figure 7 shows that packets sent from Hercules can reach
up to 2100 Mbps data rate. The same test when run from a slower machine,
revealed a much slower data rate. Transmitting bigger packets was more
efficient than smaller packets, perhaps because of less time spent on
fragmenting the same amount of data for each. The next graph shows the same
test run from a slower machine, showing a fall in machine throughput.

Figure 8 Antonia to Pdsfgrid2, Machine Data Rate
This software was very similar to the UDP facility described earlier, but this ran over TCP. It also added another metric to vary traffic characteristics. This was the number of parallel TCP streams that would be run to transmit the data.
As there was no way to control the maximum window size (i.e. set it to a high value) or to tweak the underlying TCP stack, throughput remained small for transfers. Also, because of AIMD, the TCP flows being measured were either still in Slow-start before coming to a close or they had spent a lot of time in Slow-start getting to the maximum window size and this initial ramp-up skewed data rate estimates and the rates predicted by UDP in the earlier tests were not achieved. Therefore, unless there is some mechanism to modify the underlying TCP stack, it does not seem feasible to use TCP Mon to measure maximum available TCP throughput for very high speed connections and bulk transfers. Also, more than 39 simultaneous streams cause a segmentation error.
Following are a few tests that were run to test the utility of this software (the same test bed was used for comparable results).
Packet length = 1450 bytes
Inter-packet wait times = 0, 10, 20, 50 µseconds
Number of probes per measurement = 300
Number of parallel streams = 1, 10, 20, 30, 35

Figure 9 TCP Mon - Hercules to
Pdsfgrid2, number of streams = 1

Figure 10 TCP Mon - Hercules to
Pdsfgrid2, number of streams = 10

Figure 11 TCP Mon - Hercules to Pdsfgrid2,
number of streams = 20

Figure 12 TCP Mon - Hercules to Pdsfgrid2,
number of streams = 30

Figure 13 TCP Mon - Hercules to Pdsfgrid2,
number of streams = 35
The results show that increasing the number of parallel TCP streams has increased the total throughput. However, it also shows that for larger number of TCP streams, there is greater variation in the throughput. Consider the graph with 35 TCP streams, apart from the 3 outliers where the data rate has dropped below 50 Mbps, there is greater variation in the throughput for different wait times. In order to investigate further how throughput reacted to increased number of connections, the next scenario was created.
Packet length = 1450 bytes
Inter-packet wait time = 0 µseconds
Number of probes per measurement = 300
Number of parallel streams = 1-38

Figure 14 Varying the number of TCP
streams
This graph shows that although there is a general increase in the throughput for a greater number of connections, it begins to taper off at the end. The reason could be burning CPU cycles and memory for managing the different TCP connections. There was a strict limitation in the software, that it did not allow more than 39 TCP connections, as it would exit with a segmentation fault, possibly because of using statically sized buffers.
The software is useful for revealing available link bandwidth and results are corroborated by other tests (namely tests run by Jiri). However, investigating the utility of varying inter-packet gaps is beyond the scope of this document. An option could be to look at varying inter-packet gaps randomly, emulating cross-traffic and studying the results. Documentation detailing how calculations were made was sparse and in order to find out, the code had to be studied. Although simple in essence, a single main function and goto’s sprinkled inside loops and other control structures, made understanding the flow much harder than it should have been. TCP Mon did not prove very much useful either, unless the TCP window size could be modified. Although TCP has some sort of checksum mechanism, UDP Mon assumed whatever packets it got were correct. It may be that a checksum calculation was deliberately left out to avoid introducing further delay in the calculations, but the program did not offer any remedies for a scenario if a packet got clobbered on the network.
[1] Each probe consisted of packets of the specified length. Measurements for all probes were aggregated and averaged over the total number of probes sent.