PNFS Scanning Instructions

Execution:

The scanning is done by a script $ENSTORE_DIR/src/scanfiles.py.  There are 
two main ways to scan.  First there is a forward scan, which looks at pnfs
first and matches it against the Enstore DB.  The other is a reverse scan.
This one looks at the Enstore DB and compares it to what is in PNFS.


The method of evoking a forward scan is to pass it a file(s) or directory(ies).
If a directory is passed to scanfiles it will recursively check all files
underneath the specified directory.

    python $ENSTORE_DIR/src/scanfiles.py [target1] [target2 [target3 ...]]

There are two ways of doing a reverse scan.  The first is to use it on a
BFID(s).  The second is to use it on a volume(s) which causes a lookup
of all files on the tape in the Enstore DB to obtain a list of BFIDs, then
a scan on that list of BFIDs is done in the same way the --bfid scan is
performed.

    python $ENSTORE_DIR/src/scanfiles.py --bfid [bfid1] [bfid2 [bfid3 ...]]
    python $ENSTORE_DIR/src/scanfiles.py --vol [vol1] [vol2 [vol3 ...]]

If no files, BFIDs or volumes are listed on the command line; scanfiles
will read from stdin for a list of respective targets.

Each file is listed on one line with one of three classifications: ERROR, 
WARNING or OK.  ERRORs indicate that the scan found something that will
require intervention to fix.  These range from annoyances to really bad
problems.  WARNINGs are usually (though not all of them) for files that errors
were found, but are so new that they have a high chance of still
being written to tape.  These WARNINGs should simply be rerun at later time
to weed out any false positives.

It is recommended that the list of ERRORs and WARNINGs be grep-ed out
and re-run against the production system.  Files that come up as OK
or "does not exist" no longer need to be worried about.

There exists two scripts: forward_scan and reverse scan.  To run them just
execute them.  The produce a few items of output.

The first is a file consisting of what it plans to scan.  For a forward
scan it will contain a list of pnfs mountpoints.  For a reverse scan it
will contain a list of storage groups.  It is possible to pass an edited
version of these files named all_mount_points and all_storage_groups to
the script to scan only a desired sub-set of targets.

    forward_scan [all_mount_points_file]
    reverse_scan [all_storage_groups_file]

For CMS reverse scans it is necessary to hand edit an all_storage_groups
file.

Other output is the creation of the directory to hold the scan results
that typically look like 'SCAN_09_01_2006.'  In this directory will be
a forward_scan.status or reverse_scan.status file and files containing
the output from "python scanfiles.py ..."

    

Fixing common errors:

Most of these scripts will accept raw output from scanfiles.py.  They will
simply strip away everything after the first " ... " found.

size:

Here is a sample size error. The file DB and layer 4 sizes agree, but the os/ls size (in the middle) is zero. If the size mismatch errors do not fit this pattern then a more detailed investigation needs to be done that this description can give. /pnfs/mist/NULL/1KB_093 ... size(1024, 0, 1024) ... ERROR The manual method to fix this is by executing: enstore pnfs --size <filename> <size> To automated way to fix this error os to use the $ENSTORE/tools/fix/fix_size.py script. There are two ways to use it. One is to pipe the list of files to fix_size.py. cat <filename> | $ENSTORE/tools/fix/fix_size.py The other is to pass it on the command line. $ENSTORE/tools/fix/fix_size.py <filename>

found temporary file:

These should just be removed. In most cases using rm is sufficent. Take note if "invalid directory entry" also appears in the scanfiles.py output. This means that the entry in the directory listing does not point to a valid file (aka ghost file).

deleted:

This likely means that the offline pnfs server sill contains the file, but: the file has been removed from the production pnfs server, delfile.py marked the file deleted and the scan found this descrepency with the offline pnfs server. If a rerun of the scan of these files on the production pnfs server gives "does not exist" then the user deleted the file and everything is okay. If the file rescans as OK there also is nothing else to do. If the error remains "deleted" then there is a conflict between the enstore and pnfs databases. In most cases (but not all) the simplist thing is to unmark the file as deleted in the file DB. enstore file --bfid <BFID> --deleted no

missing file:

The file has no size, no layer 1, no layer 2 or layer 4. It is as if the user simply touched the file. The best thing to do is to contact the experiment and ask them about these files. It is best if they can be rewritten, otherwise the owner should remove them from PNFS.

does not exist:

The file has been removed from pnfs. Not much to do.

younger than 1 day:

This is a warning. It says that the file is very new and has not finished writting to tape. Check the file again after a day or so.

missing the original of link:

This is also a warning. The target of the link is not found. The owner of the link should be notified. Especially, if the link points to files not in pnfs.

reused pnfsid: (reverse scan)

In all likelyhood the file received an error while writting to tape and a retry succeded. The files in question should be marked as deleted. enstore file --bfid <BFID> --deleted yes

invalid directory entry:

These errors occur when a hard link succeeds to create the new directory entry, but fails to update the link count. This is a similar situation to the "duplicate entry" errors described below. To confirm the problem: cd <broken dir> ls | grep <basename> #will take a while for large directories. stat <basename> If the "ls | grep" says the file exists, but stat says "No such file or directory" then this is a corrupted filesystem (AKA ghost file). bash-2.05b# ls | grep CAF-MCv3-34870-tmb_p17.09.06_NumEv-10000_gam-z-ee_sm.n_dzero_mcp17_Wuppertal_34870-115410790723761-JIM_MERGE_0_p18.05.00.root CAF-MCv3-34870-tmb_p17.09.06_NumEv-10000_gam-z-ee_sm.n_dzero_mcp17_Wuppertal_34870-115410790723761-JIM_MERGE_0_p18.05.00.root bash-2.05b# stat CAF-MCv3-34870-tmb_p17.09.06_NumEv-10000_gam-z-ee_sm.n_dzero_mcp17_Wuppertal_34870-115410790723761-JIM_MERGE_0_p18.05.00.root stat: cannot stat `CAF-MCv3-34870-tmb_p17.09.06_NumEv-10000_gam-z-ee_sm.n_dzero_mcp17_Wuppertal_34870-115410790723761-JIM_MERGE_0_p18.05.00.root': No such file or directory To fix the problem (as root and the broken dir is the cwd): setup pnfs cd <broken dir> $PNFS_DIR/tools/sclient rmdirentry 1122 `enstore pnfs --id .` <basename> For example: $PNFS_DIR/tools/sclient rmdirentry 1122 `enstore pnfs --id .` CAF-MCv3-34870-tmb_p17.09.06_NumEv-10000_gam-z-ee_sm.n_dzero_mcp17_Wuppertal_34870-115410790723761-JIM_MERGE_0_p18.05.00.root

duplicate entry:

These should never be seen anymore. However, this section is to document what needs to be done to fix one. Most of these commands require you to be user root. If you list the directory with the duplicate error you will see something like: bash-2.05b# ls all all all bash-2.05b# /local/ups/prd/pnfs/v3_1_10-f8/tools/scandir.sh 000900000000000000163470 000900000000000000163430 0000000000000000 0 all 000900000000000000163470 0009000000000000003525B8 0000000000000200 1 all 000900000000000000163470 0009000000000000003525B8 0000000000000200 2 all The problem is that different files might be hidden behind the first "all" entry. The first one listed is the path that gets used. bash-2.05b# enstore pnfs --id all 000900000000000000163430 Also, the link count might be different from the number of hard links there really are. In the following "stat" examples the link counts are consistant, but the hidden "all" entries are files not directories. bash-2.05b# stat ".(access)(000900000000000000163430)" File: `.(access)(000900000000000000163430)' Size: 512 Blocks: 1 IO Block: 512 Directory Device: ah/10d Inode: 152450096 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 7816/ sam) Gid: ( 0/ root) Access: 2006-03-24 10:36:01.000000000 -0600 Modify: 2006-03-24 10:36:01.000000000 -0600 Change: 2001-06-12 14:51:51.000000000 -0500 bash-2.05b# stat ".(access)(0009000000000000003525B8)" File: `.(access)(0009000000000000003525B8)' Size: 0 Blocks: 0 IO Block: 512 Regular File Device: ah/10d Inode: 154478008 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 7816/ sam) Gid: ( 1507/ e740) Access: 2001-09-15 00:50:07.000000000 -0500 Modify: 2001-09-15 00:50:07.000000000 -0500 Change: 2001-09-15 00:50:07.000000000 -0500 If necessary (the example does not need this) modify link count: $PNFS_DIR/tools/sclient modlink 1122 <file id> <link diff> "link diff" will be negative to shrink the link count. If necessary modify the links (Assuming cwd is the broken directory): $PNFS_DIR/tools/sclient rmdirentry 1122 `enstore pnfs --id .` <name> $PNFS_DIR/tools/sclient adddirentry 1122 `enstore pnfs --id .` <name> <pnfsid> To rename the unaccessable "all" files in the example: $PNFS_DIR/tools/sclient adddirentry 1122 `enstore pnfs --id .` all2 0009000000000000003525B8 $PNFS_DIR/tools/sclient adddirentry 1122 `enstore pnfs --id .` all3 0009000000000000003525B8 At this point the output from scandir.sh is: 00090000000000000045ED60 0009000000000000003525B8 0000000000000000 0 all2 00090000000000000045ED68 0009000000000000003525B8 0000000000000000 0 all3 000900000000000000163470 000900000000000000163430 0000000000000000 0 all 000900000000000000163470 0009000000000000003525B8 0000000000000200 1 all 000900000000000000163470 0009000000000000003525B8 0000000000000200 2 all Continuing using the output from scandir.sh. We can not just use rm on "all" here. $PNFS_DIR/tools/sclient rmdirentrypos 1122 000900000000000000163470 0009000000000000003525B8 1 $PNFS_DIR/tools/sclient rmdirentrypos 1122 000900000000000000163470 0009000000000000003525B8 2 At this point the output from scandir.sh is: 00090000000000000045ED60 0009000000000000003525B8 0000000000000000 0 all2 00090000000000000045ED68 0009000000000000003525B8 0000000000000000 0 all3 000900000000000000163470 000900000000000000163430 0000000000000000 0 all The output from ls may be old due to file system caching (especially on Linux): bash-2.05b# ls all all all2 all3 To fix this: umount /pnfs/fs && mount /pnfs/fs And ls will then give: bash-2.05b# ls all all2 all3

parent id:

The tricky part of this error is that it may be a valid file. If there are two hard links to the i-node from different directories one pathname will give this error.

Last modified: Tue Sep 12 10:13:37 CDT 2006