FNAL/SLAC IEPM meeting

10/12/00, at SLAC

Rough notes by Les Cottrell


Administrivia:

Attendees:
Warren Matthews, Les Cottrell, Frank Nagy, Al Thomas.

Bolface Action items
Italics Decisions

People issues

HEPNRC has been shut down. George Beranek has gone to ANL is available for consulting. Frank Nagy (he is the technology planning person at FNAL) has taken over at a less than FTE level to get the current system running in a maintainable form. He will also do strategy planning. The goal is to hire someone full time and they have the funding in FY01 who would do ongoing devlopment. The new person will need to be computer literate (in Unix/Linux/Solaris, possibly NT, a physicist with knowledge of handling data, analysing data etc.). Look for submitting a FWP in Spring FY01 to provide ongoing support at 1 FTE level. They have a DC level of about 0.5 FTE. FNAL management want to continue the PingER/IEPM as a joint SLAC/FNAL project.

Computing Issues

The current system at FNAL is hardly maintainable and not extensible. Frank is looking at a redesign with generic accounts/directories, network storage with archiving, new hardware. SAS is very heavyweight and only lightly used and costly to replicate. Nowadays mysql and Perl/Python/Java look like better candidates, more ubiquitous, less costly, better known. Performance is not an issue. No decision on operating system Linux vs. WNT. The SLAC stuff runs on Linux, it could be ported but at some effort to NT. It was agreed to use Linux as the system of choice. Any changes will need to be made in a way that is evolutionary to users, i.e. the existing system will continue to work during thetransition.

Languages

Python is very popular at FNAL. SLAC is heavily oriented to Perl. Perl is pretty mature and probably more widespread than Python. It was decided to go ahead with Perl. Also looking at Java especially for visualization. Do we need to use a CVS code management system. Most code is developed by a single person. One advantage of a code management system enables multi-site/developer stuff and also allows for a person to move in/out. Both SLAC & FNAL use CVS. We agreed to use CVS when a code management system is needed. Also there is an email list set up to facilitate commnication between developers, it is at pinger@fnal.gov. It will be archived.

Ping itself

Replace enhance ping with SMTP, HTTP, FTP, IPERF ... May require a redesign of the infrastructure. Correlate ping and traceping. Make contents of ping meaningful (e.g. hello this packet is from ... for further information contact pinger@fnal.gov), should be easy to do. Frank will look at this. This will help people understand who is probing them.

Timeping

Currently the nodes table provide the name of the remote site and optionally the IP address. Using the IP address only gives a problem if the machine IP address changes. On the other hand name lookup fails quite often. So would like to have a better algorithm to try name and then if fails try IP address.

Poisson scheduling to randomly make the measurements. This is a better sampling method. We need to do it, it has been "promised" for a long time. We have the NIKHEF ping with source code. The NIKHEF ping currently only allows down to 1 second scheduling. It could be modified to be subsecond. There are two scheduling levels. The current half hour scheduling of measuring a cluster/sample of 21 pings, and then the scheduling of the pings within the cluster. The latter would need a modification of the ping code.

Ping customizing/directives to allow different packet sizes or frequency. Al had a concern about allowing complete flexibility on the size directive since it could be used to do ping o' death. We will limit the maximum size to an MTU (~1500 bytes).

Other directives include other measurement methods (e.g. HTTP, IPERF...), frequency of sending, how to log the data (i.e. format), the scheduling algorithms, the duration of the sample (today we only note the time of the start of the sample). It would also be good to look at the first ping since it primes the caches etc. and behaves differently. This would be facilitated if we used self describing data, e.g. XML.

The directives may need a validation of the file before it is actually put into production, could be  parameter to timeping so can use existing code.

Data Collection

There have been problems in SLAC collecting and ensuring the integrity of data from DoE/MICS and U Wisconsin. It is unknown if FNAL has similar problems. Frank will look at. This is a possibility for automation and only require human intervention for exceptions.

FNAL is looking at how to do this in an architected fashion. They are looking at DVD RAMs and/or robotics. There may be more than one need, one for integrity of data (e.g. a traditional backup), or for user accessible access to archives. The latter might be handled by just having enough disk storage online.

Further analysis

Would like to provide MTBF and MTTR for unreachability.

Analysis code needs to be able to handle non 100 & 1000 byte ping sizes as might be created by timeping directives. The analysis keeps the current 2 sizes separate and allows the user to select which size. But what if the monitoring site does not use these sizes. It sounds like one should restrict the choices of sizes (e.g. 56, 100, 1000, or join sizes together, e.g. 56-100 = small, 1000 = large). An example of a site that has non standard sizes is Taiwan. Small packets may restrict the message inside the ping. Warren will make a decision on what sizes to allow so Frank can implement in timeping.

Anomalies such as out of order packets, duplicate packets, how should these be recognized, analyzed, reported on.

Application monitoring is interesting to users, but are not a good measure of network performance. However, users are interested in Web access. This might be a secondary project. Some of the mechanisms are quite intrusive such as IPERF. Also this gets into an area where commercial concerns are very interested so there may be alternatives.

Alarms are a fascinating area but non-trivial if one does not want false positives. One needs a profile of the expected performance together with the variability. This is of considerable interest to network operators and probably beyond our scope.

Throughput information from PingER may  be possible and useful. E.g. using the RTT and loss to predict a bandwidth. This will need careful validation and understanding. Can one look at load vs RTT, e.g. is the variability of the RTT a measure of utilization. Pathchar provides a method of measuring thruput that works reasonably at low bottleneck bandwidths (< T3) but is network intrusive and can take a long time.

Passive monitoring by sniffing packets using tcpdump or OC3Mon or netflow and then analyse to provide information on flows, protocol, utilization, and supplement active monitoring. This can aso be useful for security/intrusion detection. The FNAL network folks are running netflow and analyzing the data. The contact is Phil DeMar.

Traceroute provides useful data, but is tricky to analyze and is increasingly blocked. We are using traceping from Oxford university. Also getting the ASs out of the traceroutes is useful and possible but the databases of AS information can be suspect.

Improved visualization of the PingER data is a frequent request. There are CAIDA/NLANR tools (e.g. Cichlid and Otter) for providing toplology displays. What about using root, we have a local community of HEP people who can help with this, it could help our migration away from SAS. An alternative might be to use the Perl graphing modules.

Future possible developments include:

Prioritization

Timeping needs a clean up and extension:

Other things:

Next Steps

FNAL will continue to work on hiring someone. Frank is also working backup/archive getting new hardware, organizing software  etc. Frank will look at the some of the top priority items. There will be an exchange of emails on the new data format. Les will send out notes from the meeting. We will use pinger@fnal.gov for communication.