Report on IEPM PPDG Efforts for the Quarter July - September 2005

Report by Les. Cottrell, SLAC

Bandwidth/Throughput Monitoring

The DataGrid Wide Area Network Monitoring Infrastructure (DWMI) now has IEPM-BW monitoring successfully installed, making measurements, collecting, analyzing and reporting results at: BNL, Caltech, CERN, FNAL, and SLAC.

We are now using the plateau method, of detecting significant, persistents drops (events) in network performance, in production. It is now used to generate email alerts. Typically we are seeing a couple of alerts/week. These are being carefully reviewed and case studies (see Network Problem Case Studies) are being developed. The results are encouraging, next we need to carefully quantify the success of the method in terms of false positives, missed events etc. We are also working on gathering extra relevant information to report in the alerts.

We are studying a new packet train method pathneck that appears to work better at high speeds than packet pair techniques. We are hoping to use it to gather information on path bottlenecks after detecting an event.

We worked with the author of the achievable TCP throughput tool thrulay to specify required new features. Google funded development of thrulay over the summer so the enhancements have been added. We now need to evaluate the enhancements.

The integration of IEPM-BW into MonALISA to provide improved navigation and visualization has been completed.

Passive Monitoring

Passive monitoring provides data from real user applications making real transfers, file to file, for real users, and to real collaborating sites. It adds no extra traffic to the network, does not require us to make reservations or get accounts/passwords/keys/certificates. We are evaluating its effectiveness for providing estimates of achievable throughput (e.g. for grid middleware) by looking at Netflow records at the SLAC border router for large (>1 MByte) flows from the SLAC border router for the last 9 months. Daily there are about 30K of passive Netflow measurements to about 70 sites. Comparisons with the active measurements (where available) show good agreement and aggregating multiple parallel streams is relatively simple and accurate. From the active measurements 90% of the paths have negligible seasonal variation so the data can be aggregated over long periods. Over a 9 month period, 40% of throughput distributions of the flows between SLAC and a given site are single mode 30% have two modes and 30% have three or mode modes. We are evaluating the causes for the multi-modalilty, e.g. hosts with different network connections, cpu speed, configurations. We are also looking at what to report in terms of percentiles etc.

PingER and Developing Region Monitoring

The focus this quarter is on providing better management tools for PingER so we can more easily ensure the data is of high quality. To check that hosts are where we believe they are we are building a tool to make round trip measurements to selected hosts from landmarks (e.g. PingER monitoring sites) so we can triangulate to determine the real position of bthe host. To support this we put together a secure ping server to be deployed at PingER monitoring sites.

We put together a case study of the fiber outage to Pakistan June 27th to July 8, 2005.

We added a monitoring site in S. Africa, and monitored sites in four African countries, in Manaus Brazil and Israel. We are working with contacts to get sites in Palestine. We validated the data being measured from S. Africa and configured it to measure to a suitable set of sites.


The 10Gbps wide area network testbed at Sunnyvale is still in place with a connection to UltraLight.

With Caltech, Manchester, FNAL, CERN and others, once again we are preparing to participate in the SC2005 (in Seattle) BandWidth Challenge (BWC). We have put together a web site to publicize our efforts. Equipment loans have been secured from Sun, Cisco, Boston Computers, QLogic, Neterion, and Chelsio. We have arranged for seven 10 Gbits/s waves to the SLAC/FNAL booth (2 from SLAC, 4 from FNAL and one from the UK). At SLAC we are installing an xrootd cluster of ten Sun v20z dual 1.8GHz Opterons, plus 4 file servers. At SC2005 we will have eight file servers from Boston Computers, a cluster of ten Sun v20z with dual 2.4GHz Opterons, 40Gbits/ fibre channel connection to 20 TBytes in the StorCloud booth at SC2005. We are hoping to win the BWC for the third year in succession.

We have made contact with Microsoft and are working on an MOU to evaluate a new TCP stack on real networks.

Admin, visits, papers, presentations, proposals etc.

Article on PingER published in Science Grid this week.

Submitted proposal to USAID for the SLAC/NIIT collaboration to provide monitoring for PERN/NTC.

Submitted paper on "Anomalous Event Detection" to NOMS 2006.

We made the following presentations: