Items to Watch to Make Sure Enstore Runs Smoothly
$Revision: 1.11 $
$Date: 2000/02/10 15:48:39 $GMT
Enstore was written to monitor itself and take appropriate action
whenever it can. Sometimes this is not possible. Keeping an eye on
the following items allows an administrator to be proactive and ensure
that the system keeps running efficiently.
- The Alarm
Page generated by the Alarm server lists items that need immediate
administrator attention. After the alarm has been resolved, you can
clear it by using the web page. Although the page is automatically
updated, you can also hit the "reload" button to get the current
status. Old
Alarms can also be reviewed or searched.
- Another good page to watch is the Patrol Page .
It provides a status page, having the same information as the alarm
page, that will be centrally monitored. You might have to wait (at
most) 15 minutes for the Patrol cronjob to update the Patrol
informationa and you might have to hit the "update" button on the top
of the page to get the current status. One thing to remember, you have
to hit "back" in your browser quite a few times to got back to your
previous page since the patrol page has lots of frames.
- If you are interested in details, the
Status Page provides a thorough listing of all the various parts
of Enstore.
- The inquisitor generates this page every few minutes and the
page should automatically update (depending on your browser) for
you as well. Please remember that the information may not be
current, but might represent the state up to 5 minutes ago. Also,
since the various Enstore components are queried at different
times, on rare occasions it could look like there is contradictory
information (mover says it is writing a tape but the library
manager says the mover is idle); the situation should resolve
itself with the next write of the web page.
- One item that is easy to spot is when servers are
highlighted. That means the inquisitor has not been able to
communicate with them, which usually implies that something has
happened and the server has either crashed or hung. These
situations are becoming more and more infrequent. If you see
highlighted servers, don't be too quick to try and restart things -
the inquisitor probably has already sent the appropriate commands
to do that. One needs to understand more of what is happening
before taking action.
- The most detailed source of what is going on in the system is the
Log
Pages . It requires some effort to read and understand.
-
One nice feature about the web interface is that you can
search a subset of log files.
- You need to keep hitting "reload" to get the most current
information if you read the log though the web interface.
- If there is a problem, it may be easier to log into d0ensrv2
and look at the log directly. You can find it in
/diska/enstore-log.
- You can get a good idea of how the system is performing by
watching the Encp
History/Transfer Page . Generally there should be activity all
the time. Remember that mammoth-1 drives have a maximum rate of 3
MB/S, DLT-7000 drives have a maximum rate of 5 MB/S, and the maximum
throughput of a single Fast Ethernet connection is 11 MB/S.
- Enstore runs quite a few cronjobs. You can check their status on
the Cronjob
Status Page.. The plots are updated once an hour.
- Each graph shows exit status on the y axis and job time on the
x axis. The graph goes back 1 week in time. The date on the right
most part of the graph should be tomorrow's date. If it's not, you
are looking at old graphs. The graphs are remade once per hour (via
a cronjob, of course.) The graphs will be titled with the node name
and the name of the cronjob.
- Each time a cron job is started, an "x" is made on the graph at
10. Look for a "regular" pattern. Investigate unusual empty places
on the graphs.
- When a cronjob finishes, an "x" is made on the graph with its
exit status. The only correct status is 0. Investigate all non-zero
values.
- You can see every Enstore Tasks
that is running on all the d0en nodes. This page is located
within the miscellaneous status page and is updated every 10
minutes. You can log into the console server, d0ensrv3, for example,
as enstore and issue the comand "enstore EPS" to get an up to date
display. This output is simply that of the "ps" command with
non-enstore commands filtered out.
- It is a good idea to check the volumes defined to the system.
There's currently no web page for this.
- Log into one of the console servers, d0ensrv3 or d0ensrv5,
- Type "setup enstore"
- Type "enstore volume --vols"
- Look especially for volumes marked "NOACCESS" and try to
determine why they are in that state. Don't just clear them!
- You can look at the monthly performance of the system by looking
at the
Activity Plot, Bytes/Day Plot
, Mount
Latency Plot, and the Mounts/Hr
Plots. These plots are updated once per day by the inquisitor (or
they can be generated on demand).
- A recent snapshot of the tape drives in the AML/2 robot is
available on the AML/2
Drive Status Page (within the miscellaneous status page). The page
is updated every 10 minutes.
- You can also query the AML/2 drive status directly by logging
onto one of the d0en server nodes, preferably one of the console
server nodes, d0ensrv3 or d0ensrv5, and issuing the command
"dasadmin listd2".
- DECDLT drives DE01-DE02, Mammoth-1 drives DC03-DC06 and AIT-1
drives DM07-DM12 should have their state "st: UP" and be assigned to
"client: rip5". These drives are used for Enstore testing.
- DECDLT drives DE13-DE14, Mammoth-1 drives DC15-DC17, DC20-DC29
and AIT-1 drives DM18-DM19 should have their state "st: UP" and be
assigned to "client: d0ensrv4". These drives are used for D0
production.
- The "clean_count" indicates how many times a tape has been
mounted in the drive. For your information, the way you tell how
many times a tape has been mounted, is by using the command
"dasadmin view -t [8MM|DECDLT] tape_label.
- "volser: label" indicates which tape is currently mounted, if
any, in the tape drive.
- You can see the inter-d0en ethernet rates by looking at the Enstore
Node Uptime Page (within the miscellaneous status page). The page
is updated every 10 minutes.
- You can also query the PC state directly by logging onto one of
the d0en server nodes, preferably one of the console server nodes,
d0ensrv3 or d0ensrv5, and issuing the command "rgang -n d0en
/usr/local/bin/uptime"
- The first item listed is the nodename. Movers should be listed
twice since there are 2 ethernet ports. [Both respond with the "a"
mover name.]
- Next is an item "bogo" which lists the bogomips for each
processor on the node. Since all nodes, except the console servers,
have dual 450 MHz, you should see 2 bogo values around 450. If
there is just 1, that means only 1 processor is active. The 2
console servers have dual 400 MHz so they should have 2 values both
around 400.
- Next is "memf/mtot" which is the ratio of memory free to memory
total. All nodes have 512 MB of memory (which is reduced to 505
somehow?)
- Next you'll see the current time. All the nodes are ntp
synced, so they should all have the correct time. Earilier nodes
may show a minute earlier than the later ones since the information
is currently gathered serially for all the nodes.
- Next you'll see the uptime. Generally the nodes should be up
for a long time. Look for nodes that seemed to have booted recently
and find out why.
- Next is number of users. Typically, only the console servers
should have users on them. Find out who is using the nodes and why.
- Next is the load average. There are 3 numbers showing what the
average has been during the last minute, 5 minutes and 15 minutes.
High numbers, anything about 3 are probably bad.
- Finally is listed the kernel version and when it was compiled.
They should all be alike, except for the console servers.
- You can see the PC temperatures, voltages and fan speeds by
checking the SDR
Page (located within the Enstore Log File Page). This page is
updated every 15 minutes.
- You can view the PC's System Event Log
(located within the Enstore Log File Page). This provides
details on crashes, memory problems, etc. This page is updated every
15 minutes.
- You can the PC's
Detailed
Memory Settings (located within the Enstore Log File Page). This page is
updated every 15 minutes.
- You can the last 25 lines of the PC's Console Log
(located within the Enstore Log File Page). This
page is updated every 15 minutes.
- You can check the Raid Status
Page (within the miscellaneous status page) to look for errors. This page is updated every few
minutes. You can also log into d0ensrv1 and type "raidinfo" to get
current values. One needs to pay particular attention to the 1st few
lines that show the disk's current state.
- Information about the network is available at BigA Switch Page
(within the Enstore Log File Page). Remember that all nodes have full
duplex Fast Ethernet lines - there should not be any errors,
collisions, jabbers at all. MRTG Rate
plots for the nodes are also available from the BigA Switch
Page..
- If there are AML/2 problems, you can check the AML/2 Log
(within the Enstore Log File Page) for errors. These pages are
updated every 15 minutes. The information is detailed and takes some
time to understand completely. An attempt to summarize
the information is also available. The summary filters out most
extraneous information for you, but is still not perfect. Direct
access to the AML/2 logs is possible either by looking at the OS/2
adic2 node in front of the robot, or by logging into adic2 as enstore,
cd'ing to 'cd \das\bin', and running the log 'amulog' (you quit by
typing 'q' and 'exit').