|
Outage frequencies between SLAC, FNAL CMU and CERN
Supported by
DOE MICS
|
|
The plots on this page show the outage duration frequencies between SLAC,
FNAL, CMU and CERN. The outage duration is measured by looking at the
Surveyor
one way probes and looking for how long consecutive probes don't
get through. The Surveyor probes are launched on average
2 times/second with a
Poisson distribution and are one way probes.
Outage frequencies for all sites
The plot below shows the outage frequency for the data for all pairs added
together for November 1998 thru July 1999. It includes data from about 284 million probes or 142
million seconds.
To assist in seeing how the outage duration frequencies behave for different
outage ranges the data is binned into 1 second ( dark blue), 10 second (red)
and 100 second bins (green). The light blue points are the 1 second bins out to a 20
second duration. The lines are power series fits with the parameters shown and
with the
R2 shown. The blue dashed line is a fit to the data
binned in 1 second bins out to 20
seconds. The data are seen to have a strong correlation (R2 > 0.6)
to the power series fits.
Outage probabilities for all sites
The plot below shows the probabilities of seeing outages of a given length during
a phone call of duration 3 minutes (magenta dots), and the probability of observing
an outage of > given duration in a 3 minute phone call (dark blus crosses).
The magenta line
is a power series fit to the magenta dots. The parameters of the power series are
given together with the R2. The point depicting the
probability of no outage
of > 1 second in a call of length 3 minutes (the value is 75%)
is not shown in order to make it
easier to read off the probababilities for 2 seconds and beyond.
The magenta line is a fit of the probability of an outage of a given length data
to a power series. The parameters and R2 of the fit
are also shown. The blue crosses illustrate the probability P of not observing an outage
of greater than the given number of seconds in a 3 minute time period. The 3 minute time
period is chosen as being the typical length of a phone call. P is defined as follows:
- Let Fi
be the observed frequency of an outage of duration si
seconds (si < si+1), i = 1 ... M, then
-
Ki = Sum1i=M Ki+1 + Fi,
- is the reverse cumulative frequency distribution, and KM+1 = 0,
-
Ji = C * Ki / N ,
- where C is the call length (in the above case set to 180
seconds), and N is the total number of seconds over which the measurement was made, and
Ji is the probability of an outage of a given length in a call of length C,
then
-
Pi = 1 - Ji
- is the probability of not observing an outage
of greater than si seconds in a call of C seconds.
Another way of looking at this data is to look at the
Outage
Events metric defined by
the Automotive Network eXchange (ANX) as the number
of outage events of 30 seconds or greater per year. For the data shown here this comes out
to be about 450/year/site-pair. This is much higher than the ANX limit of 10 such events.
Outage frequencies by month and by site pairs
The
plots below are organized by month (one month per row) and by
site pairs (one site pair
per column). The log of the frequency is plotted against the log of the
outage duration in seconds.
There are many causes of the outages, each with its own characteristic
time scale, including:
- Short term (subsecond to several seconds) outages caused by
router congestion and queue overflows.
- Longer term outages (typically tens to hundreds of seconds)
caused by failed components which cause
route reconfiguration and router convergence on the new routes.
The plot of
Ping RTT & loss between SLAC & CERN May-11, 1999 illustrates this type
of outage.
- Outages (scheduled or unscheduled) of components that do not have
automatic failover. Such outages can extend over many hours and may include
cable cuts to a site, a power outage involving a site's border router.
| CERN to SLAC | SLAC to CERN | CERN to FNAL | FNAL to
CERN |








|








|








|








|
| SLAC to FNAL | FNAL to SLAC | SLAC to CMU | CMU to SLAC |








|








|







|







|
Created: 23 September 1999; last update 28 September 1999
URL: http://www-iepm.slac.stanford.edu/monitoring/surveyor/outage.html
Comments to
iepm-l@slac.stanford.edu