December 2004 MICS Progress Report for the IEPM Project

Date: June 2004 - December 2004
Project title: TeraPaths: DataGrid WAN Monitoring Infrastructure
PI: Les Cottrell
Co-PI: Connie Logg
Institution: Stanford Linear Accelerator Center
Number: Graduate students = 1; PhD students = 0; PostDoc Fellows = 0
Project Website: http://www-iepm.slac.stanford.edu/
 

Summary

This half year we have been focusing on preparing the next version of the IEPM-BW high performance network monitoring toolkit based on experiences and feedback from the successful deployment of the current version (ten sites of which four are using in production), together with new demands to meet the needs of new projects. The main improvements will be to add a new optional security model to simplify deployment; add traceroute measurements, analysis and visualization; develop, implement and evaluate the ability to detect anomalous behavior in bandwidth capacity and availability; provide improved management tools and documentation.  We have started the deployment of the new version, will be extending this in the coming quarter, and will be visiting sites to assist in this.

In addition a major activity for some of the time was to lead, prepare for and successfully execute the SC2004 Bandwidth Challenge with Caltech, CERN, FNAL, and many others. Besides winning the challenge for the highest sustained bandwidth, this provided valuable insight into how to achieve high throughputs, the impact of TCP Offload Engines, comparisons between TCP and UDP based reliable transports.

We continued development and extension of the PingER project in particular to extend the coverage of monitoring to more of the developing world, to improve the visualization and administration. This was done in collaboration with NIIT Pakistan under a grant by the US Department of State to cover travel.

Bandwidth/Throughput measurement (IEPM-BW)

Following concerns about the impact of iperf testing on network traffic, we examined the effects and documented them at http://www.slac.stanford.edu/grp/scs/net/case/iepm-jul04/.

We added U Victoria to the sites monitored by IEPM-BW. We also provideed assistance to the UVic HEP folks to try and isolate why their file transfer performance from SLAC to UVic is << that from UVic to SLAC.

We installed and studied BWCTL from Internet 2 which we would like to use for scheduling on-demand higher impact bandwidth measurements. Currently it can only run single stream iperf, which is inadequate for our measurement purposes. We have communicated this to the developers of BWCTL. In addition we have communicated our desire to be able to run other data transfer protocol tests such as Gridftp, BBFTP, BBCP via BWCTL. Currently this is not possible.

IEPM-BW version 3 is in now in first customer ship state. We have  installed the monitoring toolkit at SLAC and Caltech (see http://socrates.cacr.caltech.edu/iepm-bw.cacr.caltech.edu/slac_wan_bw_tests.html). Efforts are currently under way to install IEPM-BW at BNL and NIIT. We will follow this up with CERN, Caltech and FNAL. As part of this we are arranging trips by IEPM staff members to visit these institutes to install software and to train local administrators.

We are working with Bill Allcock (ANL/GridFTP) and Internet 2 to try and improve the monitoring of GridFTP traffic. Currently it is not possible to get a good estimate of GridFTP's network utilization in terms of bytes since it uses ephemeral ports. This is not usually the case for bbcp and bbftp, and we wouild like to compare the various utilizations.

Lightweight Bandwidth Estimation

We assisted the people from SDSC/CAIDA to evaluate pathload on the IEPM-BW testbed. We demonstrated ABwE at SC2004.

Bandwidth performance anomalous events

We implemented a modified and simplified version of the NLANR "plateau" algorithm. We have tuned the anomalous event detection using the "Plateau Algorithm" to minimize diurnal effects. We have developed extensive graphical displays of our implementation of the algorithm in action which facilitates the understanding of exactly how it works. We have started to compare different methods of detecting anomalous events in time series. As part of this we have set up an informal collaboration of people from Loughborough (who are looking at the Kolmogorov- Smirnov technique, NIIT (looking at Principal Component Analysis approach), FNAL (looking at Holt-Winters) and SLAC (looking at HW and plateau), to develop and compare the various techniques.

Traceroute Analysis and Visualization

We improved the traceroute topology visualization tool, applied it to the AMP data and incorporated it into the IEPM-BW traceroute visualization.

PingER

We worked with Florida International University to get agreement to install a PingER monitoring site there. AS of December 8th, it is now running there. This will be particularly useful to understand South American connectivity. We also successfully set up monitoring to an Indian commercial site to try and get a handle on the relative performance of commercial versus Academic and Research sites in India. Most PingER sites are Academic & Research, and we want to find out if the poor performance to India extends to commercial sites. We are working to set up a PingER monitoring site in Bangalore.

We worked with NIIT/Pakistan to develop a mouse sensitive map of PingER deployment (see http://www-dev.slac.stanford.edu/cgi-wrap/zoomtst-plot.pl). We will extend this to other projects such as ABwE and IEPM.

We collaborated with Dr. Richard Baker and his team at the University of New England, Armidale Australia to analyze the PingER data and visualize the congestion wave as it moves around the globe with time of day.

SC2004

Monitoring demos

For SC2004 we set up a facet of the SC2004 SLAC/FNAL booth to demonstrate Internet monitoring. We had two animated demos, one showing real time pings from SC2004 to various regions around the world (see PingWorld at http://www-iepm.slac.stanford.edu/tools/pingworld ), and the second showing the Internet Congestion wave (measured from PingER packet loss statistics for an entire year) moving around the world with time of day. In addition we prepared several PowerPoint presentations illustrating Internet throughput growth, PingER, bringing the Internet to China, we also provided IEPM demonstrations of version 3 of the IEPM bandwidth monitoring, analysis and visualization toolkit.

SC2004 Bandwidth Challenge (see http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2004/hiperf.html

We set up a collaboration with Caltech, FNAL, University of Manchester, England; several companies including Chelsio, S2io, Sun and Cisco; ESnet; National Lambda Rail and others to participate in this year's SC2004 Bandwidth Challenge. For more details see the web site at http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2004/hiperf.html.

As part of this we secured loans of two 10Gbits/s wavelengths (from NLR and ESnet/QWest) from Sunnyvale to Pittsburgh (this year's site for SC2004), the loan of eleven Sun Opteron (V20Z and V40Z) compute servers, eleven 10GE interfaces, three Sun file servers (based on Sun 3510 disk arrays), Cisco equipment (router, XENPAKs and routing blades), and space at  Sunnyvale in both ESnet's and CENIC's points of presence (PoPs). We also secured space for two electronics racks and a table in the SLAC/FNAL booth at SC2004.

We set up a 1GE file server and 10GE compute server at the ESnet PoP and four 10GE compute servers at the CENIC PoP, plus eleven 10GE compute servers and two file servers in the SLAC/FNAL booth. On the servers we installed a mix of the Linux 2.6 and Sun Solaris 10 x86 operating systems and a mix of Chelsio 10GE NIC (with TCP Offload Engines - TOE) and S2io 10GE NICs. We made tests with iperf/TCP, UDT and file transfers using bbcp and bbftp.

Together with the Caltech booth, we achieved over 100Gbits/sec throughput and won this year's bandwidth Challenge for sustained throughput (beating last year's record by more than a factor of four); we were able to saturate over 99% of a transcontinental 10Gbit/s wavelength from Pittsburgh to Sunnyvale; we were able to send over 11.4Gbits/s from a single V40Z Opteron host with two 10GE NICs to two V20Z Opteron hosts each with a 10GE NIC; we demonstrated the smooth interworking between Chelsio and S2io NICS in the multi Gbits/s range; with UDT we were able to achieve about 4.45Gbits/s throughput; we were able to compare the CPU utilization of TCP/TOE versus non TOE NICs and also TCP versus the UDP based UDT transport.

Proposals

Publications

Representation

We attended the DoE PI network research meeting at FNAL and made two presentations:

We attended the NASA/LSN workshop on Optical Network Technologies at NASA Ames. We prepared and gave a presentation on WAN Monitoring Issues (see http://www.slac.stanford.edu/grp/scs/net/talk03/jet-aug-4.html) and also served on a panel.

We attended the kick-off meeting for the UltraLight project.

As part of the program committee for the Protocols for Fast Long Distance Networks 2005 and for Passive and Active Measurements  2005 workshops Les wrote and submitted reviews for 17 papers.

We entertained visits by teams led by: Kars Ohrenberg of DESY Hamburg; Douglas Leith of Hamilton Institute, Dublin; Grenville Armitage of Swinburne, Australia; and Robert Baker of University of New England Armidale Australia.

Talks