Report on IEPM PPDG Efforts for the Quarter July - September 2005
Report by Les. Cottrell,
SLAC
Bandwidth/Throughput Monitoring
The
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) now has
IEPM-BW
monitoring successfully installed, making measurements, collecting, analyzing and
reporting results at:
BNL,
Caltech,
CERN,
FNAL,
and
SLAC.
We are now using the plateau method, of detecting significant,
persistents drops (events) in network performance, in production.
It is now used to generate email alerts. Typically we are seeing a couple of
alerts/week. These are being carefully reviewed and case studies (see
Network
Problem Case Studies) are being developed. The results are
encouraging, next we need to carefully quantify the success of the
method in terms of false positives, missed events etc. We are also
working on gathering extra relevant information to report in the
alerts.
We are studying a new packet train method
pathneck that appears to
work better at high speeds than packet pair techniques. We are hoping to use
it to gather information on path bottlenecks after detecting an event.
We worked with the author of the achievable TCP throughput tool
thrulay to specify required new features. Google funded development
of thrulay over the summer so the enhancements have been added. We now need
to evaluate the enhancements.
The integration of IEPM-BW into
MonALISA
to provide improved navigation and visualization has been completed.
Passive Monitoring
Passive monitoring provides data from real user applications
making real transfers, file to file, for
real users, and to real collaborating sites. It adds no extra traffic to the
network, does not require us to make reservations or get
accounts/passwords/keys/certificates.
We are evaluating its effectiveness for providing estimates
of achievable throughput (e.g. for grid middleware) by looking at
Netflow records at the SLAC border router for large
(>1 MByte) flows
from the SLAC border router for the last 9 months. Daily there are
about 30K of passive Netflow
measurements to about 70 sites. Comparisons with the active
measurements (where available) show good agreement and aggregating
multiple parallel streams is relatively simple and accurate.
From the
active measurements 90% of the paths have negligible
seasonal variation so the data can be aggregated over long periods.
Over a 9 month period, 40% of throughput distributions of the flows between
SLAC and a given site are single mode
30% have two modes and 30% have three or mode modes.
We are evaluating the causes for the multi-modalilty, e.g. hosts with
different network connections, cpu speed, configurations. We are also
looking at what to report in terms of percentiles etc.
PingER and Developing Region Monitoring
The focus this quarter is on providing better management tools for PingER
so we can more easily ensure the data is of high quality. To check that
hosts are where we believe they are we are building a tool to make
round trip measurements to selected hosts from landmarks (e.g.
PingER monitoring sites) so we can triangulate to determine the
real position of bthe host. To support this we put together a secure
ping server to be deployed at PingER monitoring sites.
We put together a case study of
the fiber outage to Pakistan June 27th to July 8, 2005.
We added a monitoring site in S. Africa, and monitored sites
in four African countries, in Manaus Brazil and Israel. We are working with
contacts to get sites in Palestine. We validated the data being measured
from S. Africa and configured it to measure to a suitable set of sites.
Testbeds
The 10Gbps wide area network testbed at Sunnyvale is still in place with a
connection to UltraLight.
With Caltech, Manchester, FNAL, CERN and others, once again
we are preparing to
participate in the
SC2005 (in Seattle) BandWidth Challenge (BWC). We have put together a
web site to publicize our efforts. Equipment loans have been secured from
Sun, Cisco, Boston Computers, QLogic, Neterion, and Chelsio.
We have arranged for seven 10 Gbits/s waves to the SLAC/FNAL booth
(2 from SLAC,
4 from FNAL and one from the UK). At SLAC we are installing an
xrootd
cluster of ten Sun v20z dual 1.8GHz Opterons, plus 4 file servers. At
SC2005 we will have eight file servers from Boston Computers,
a cluster of ten Sun v20z with dual 2.4GHz Opterons,
40Gbits/ fibre channel connection to 20 TBytes in the
StorCloud
booth at SC2005.
We are hoping to win the BWC for the third year in succession.
We have made contact with Microsoft and are working on an MOU to evaluate
a new TCP stack on real networks.
Admin, visits, papers, presentations, proposals etc.
Article on PingER published in Science Grid this week.
Submitted
proposal to USAID for the SLAC/NIIT collaboration to provide
monitoring for PERN/NTC.
Submitted paper on
"Anomalous Event Detection"
to NOMS 2006.
We made the following presentations:
-
Terapaths: Datagrid Wide Area Monitoring Infrastructure (DWMI)
presented by Les Cottrell at the DoE Network Research PI meeting BNL, Sept '05.
-
Report from ICFA Digital Divide WorkshopCommand: Daegu, Korea, May 23-27 05
presented by Les Cottrell at the Internet2 Fall Members meeting,
Philadelphia Sept 21, 2005.
-
Network Monitoring for SCIC
prepared by Les Cottrell for the ICFA meeting September 2005.
-
Network Monitoring Tools for High Performance Networks presented
by Les Cottrell at the Internet Fall 2005 Members meeting, Philadelphia,
Sep 19, 2005.
-
Network Monitoring for ICFA/SCIC
presented by Les Cottrell, at ICFA/SCIC meeting 8/24/05
-
SLAC Site Report, presented by Les Cottrell for
ICFA/SCIC meeting 8/24/05
-
Monitoring 10Gbits/s and Beyond presented by Les Cottrell at the
LHC tier0, tier1 meeting, CERN July 19 '05.