Alarm Descriptions


Summary:
CRC_ERROR='CRC MISMATCH'                #CRC error if caught by mover.
CRC_ENCP_ERROR='CRC ENCP MISMATCH'      #CRC error if caught by encp.
CRC_ECRC_ERROR='CRC ECRC MISMATCH'      #CRC error if caught by encp --ecrc.
CRC_DCACHE_ERROR = "CRC DCACHE MISMATCH"  #CRC error if caught by encp.
DEVICE_ERROR = "DEVICE ERROR"           #read()/write() call stuck in
kernel.

Note: "read_tape CRC error" and "write_tape CRC error" alarms raised 
by the mover mean "CRC MISMATCH" to encp.



E-mail:
Alarms that are e-mailed to CDF (cdfdh_oper@fnal.gov) and D0 
(d0farm@d0mino01.fnal.gov *) **:
CRC ENCP MISMATCH
CRC ECRC MISMATCH
CRC DCACHE MISMATCH
DEVICE ERROR

* Why this is d0farm@d0mino01.fnal.gov and not d0farm@fnal.gov is a
  mystery to me, but that is what D0 requested.

** We don't have the ability to send these e-mails based on storage 
   groups. This is why on STKEN ISA needs to be proactive to tell the 
   experiments something is wrong.



Full description:

read_tape CRC error
write_tape CRC error
CRC MISMATCH:  These are the CRC errors detected by the mover.  If this 
error(s) occurs then some combination of the tape or drive must be looked 
at.  ISA should start by looking at the drive, but if nothing is found 
move on to looking at the tape.

CRC ENCP MISMATCH:  This error occurred because the file became corrupted
while in transit between the mover and encp.  Statistically we will see 
one of these every so often.  However, more than one (in a reasonable 
period of time) indicates a problem.  The user should retry the file if 
nothing is found broken (i.e. disk, nic, memory) on their node.

CRC ECRC MISMATCH:  This error is detected when the file is written to 
local disk (reading from tape) and then reread by encp to verify it was 
written correctly only to find it was not written correctly.  One of these 
errors my occur every now and then due to hardware reliability issues.  
However, most often these occur in great numbers and indicate that the 
disk on the user's computer is failing.
   Note: User's need to enable this check on the command line: --ecrc

CRC DCACHE MISMATCH:  First, encp has transfered the file and the mover 
and encp CRC's match.  But when encp compares those matching CRCs with the 
CRC dCache wrote in layer 2 they do not match.  This mostly occurs when 
writing the file to tape and happens because of rewrites in dCache.
The file on disk can not be trusted to be correct and the user/experiment
need to rewrite the file to dCache.

DEVICE_ERROR:  First you need to examine the error to determine if the
error was in read() or write().  Next, you need to determine if the 
read()/write() in error was to the local file or to the network talking to 
the mover (HINT: check the infile [read] and outfile [write] names).  If 
the error was on the users local disk, then they need to be informed
of the disk errors.  Past experience with the error happening on the 
network side indicates that dCache is thrashing the node to death;
it would be nice to notify the appropriate system admin (for CMS 
or CDF; for FNDCA that would be ISA) that such a state has been detected.
This type of error happens almost exclusively in large numbers.

selective CRC check error:  This is the mover alarming about a possible 
bad drive.  The drive should be checked by someone in ISA.  If nothing is 
found wrong with the drive the tape should be checked.  There should be 
a matching "Possible drive failure" alarm.

Possible drive failure:  This is the mover alarming about a possible bad 
drive.  The drive should be checked by someone in ISA.


    

Last modified: Fri Sep 8 15:08:03 CDT 2006