An unexpected interruption while the Volume Server or File Server is manipulating the data in a volume can leave the volume in an intermediate state (corrupted), rather than just creating a discrepancy between the information in the VLDB and volume headers. For example, the failure of the operation that saves changes to a file (by overwriting old data with new) can leave the old and new data mixed together on the disk.
If an operation halts because the Volume Server or File Server exits unexpectedly, the BOS Server automatically shuts down all components of the fs process and invokes the Salvager. The Salvager checks for and repairs any inconsistencies it can. Sometimes, however, there are symptoms of the following sort, which indicate corruption serious enough to create problems but not serious enough to cause the File Server component to fail. In these cases you can invoke the Salvager yourself by issuing the bos salvage command.
Symptom: A file appears in the output of the ls command, but attempts to access the file fail with messages indicating that it does not exist.
Possible cause: The Volume Server or File Server exited in the middle of a file-creation operation, after changing the directory structure, but before actually storing data. (Other possible causes are that the ACL on the directory does not grant the permissions you need to access the file, or there is a process, machine, or network outage. Check for these causes before assuming the file is corrupted.)
Salvager's solution: Remove the file's entry from the directory structure.
Symptom: A volume is marked Off-line
in the output
from the vos examine and vos listvol commands, or
attempts to access the volume fail.
Possible cause: Two files or versions of a file are sharing the same disk blocks because of an interrupted operation. The File Server and Volume Server normally refuse to attach volumes that exhibit this type of corruption, because it can be very dangerous. If the Volume Server or File Server do attach the volume but are unsure of the status of the affected disk blocks, they sometimes try to write yet more data there. When they cannot perform the write, the data is lost. This effect can cascade, causing loss of all data on a partition.
Salvager's solution: Delete the data from the corrupted disk blocks in preference to losing an entire partition.
Symptom: There is less space available on the partition than you expect based on the size statistic reported for each volume by the vos listvol command.
Possible cause: There are orphaned files and directories. An orphaned element is completely inaccessible because it is not referenced by any directory that can act as its parent (is higher in the file tree). An orphaned element is not counted in the calculation of a volume's size (or against its quota), even though it occupies space on the server partition.
Salvager's solution: By default, print a message to the /usr/afs/logs/SalvageLog file reporting how many orphans were found and the approximate number of kilobytes they are consuming. You can use the -orphans argument to remove or attach orphaned elements instead. See To salvage volumes.
When you notice symptoms such as these, use the bos salvage command to invoke the Salvager before corruption spreads. (Even though it operates on volumes, the command belongs to the bos suite because the BOS Server must coordinate the shutdown and restart of the Volume Server and File Server with the Salvager. It shuts them down before the Salvager starts, and automatically restarts them when the salvage operation finishes.)
All of the AFS data stored on a file server machine is inaccessible during the salvage of one or more partitions. If you salvage just one volume, it alone is inaccessible.
When processing one or more partitions, the command restores consistency to corrupted read/write volumes where possible. For read-only or backup volumes, it inspects only the volume header:
If the volume header is corrupted, the Salvager removes the volume completely and records the removal in its log file, /usr/afs/logs/SalvageLog. Issue the vos release or vos backup command to create the read-only or backup volume again.
If the volume header is intact, the Salvager skips the volume (does not check for corruption in the contents). However, if the File Server notices corruption as it initializes, it sometimes refuses to attach the volume or bring it online. In this case, it is simplest to remove the volume by issuing the vos remove or vos zap command. Then issue the vos release or vos backup command to create it again.
Combine the bos salvage command's arguments as indicated to salvage different numbers of volumes:
To salvage all volumes on a file server machine, combine the -server argument and the -all flag.
To salvage all volumes on one partition, combine the -server and -partition arguments.
To salvage only one read/write volume, combine the -server, -partition, and -volume arguments. Only that volume is inaccessible to Cache Managers, because the BOS Server does not shutdown the File Server and Volume Server processes during the salvage of a single volume. Do not name a read-only or backup volume with the -volume argument. Instead, remove the volume, using the vos remove or vos zap command. Then create a new copy of the volume with the vos release or vos backup command.
The Salvager always writes a trace to the /usr/afs/logs/SalvageLog file on the file server machine where it runs. To record the trace in another file as well (either in AFS or on the local disk of the machine where you issue the bos salvage command), name the file with the -file argument. Or, to display the trace on the standard output stream as it is written to the /usr/afs/logs/SalvageLog file, include the -showlog flag.
By default, multiple Salvager subprocesses run in parallel: one for each partition up to four, and four subprocesses for four or more partitions. To increase or decrease the number of subprocesses running in parallel, provide a positive integer value for the -parallel argument.
If there is more than one server partition on a physical disk, the Salvager by default salvages them serially to avoid the inefficiency of constantly moving the disk head from one partition to another. However, this strategy is often not ideal if the partitions are configured as logical volumes that span multiple disks. To force the Salvager to salvage logical volumes in parallel, provide the string all as the value for the -parallel argument. Provide a positive integer to specify the number of subprocesses to run in parallel (for example, -parallel 5all for five subprocesses), or omit the integer to run up to four subprocesses, depending on the number of logical volumes being salvaged.
The Salvager creates temporary files as it runs, by default writing them to the partition it is salvaging. The number of files can be quite large, and if the partition is too full to accommodate them, the Salvager terminates without completing the salvage operation (it always removes the temporary files before exiting). Other Salvager subprocesses running at the same time continue until they finish salvaging all other partitions where there is enough disk space for temporary files. To complete the interrupted salvage, reissue the command against the appropriate partitions, adding the -tmpdir argument to redirect the temporary files to a local disk directory that has enough space.
The -orphans argument controls how the Salvager handles orphaned files and directories that it finds on server partitions it is salvaging. An orphaned element is completely inaccessible because it is not referenced by the vnode of any directory that can act as its parent (is higher in the filespace). Orphaned objects occupy space on the server partition, but do not count against the volume's quota.
During the salvage, the output of the bos status command reports the following auxiliary status for the fs process:
Salvaging file system
Verify that you are listed in the /usr/afs/etc/UserList file. If necessary, issue the bos listusers command, which is fully described in To display the users in the UserList file.
% bos listusers <machine name
>
Issue the bos salvage command to salvage one or more volumes.
% bos salvage -server <machine name
> [-partition <salvage partition
>] \ [-volume <salvage volume number or volume name
>] \ [-file salvage log output file] [-all] [-showlog] \ [-parallel <# of max parallel partition salvaging
>] \ [-tmpdir <directory to place tmp files
>] \ [-orphans <ignore | remove | attach >]
where
Names the file server machine on which to salvage volumes. This argument can be combined either with the -all flag, the -partition argument, or both the -partition and -volume arguments.
Names a single partition on which to salvage all volumes. The -server argument must be provided along with this one.
Specifies the name or volume ID number of one read/write volume to salvage. Combine this argument with the -server and -partition arguments.
Specifies the complete pathname of a file into which to write a trace of the salvage operation, in addition to the /usr/afs/logs/SalvageLog file on the server machine. If the file pathname is local, the trace is written to the specified file on the local disk of the machine where the bos salvage command is issued. If the -volume argument is included, the file can be in AFS, though not in the volume being salvaged. Do not combine this argument with the -showlog flag.
Salvages all volumes on all of the partitions on the machine named by the -server argument.
Displays the trace of the salvage operation on the standard output stream, as well as writing it to the /usr/afs/logs/SalvageLog file.
Specifies the maximum number of Salvager subprocesses to run in parallel. Provide one of three values:
An integer from the range 1 to 32. A value of 1 means that a single Salvager process salvages the partitions sequentially.
The string all to run up to four Salvager subprocesses in parallel on partitions formatted as logical volumes that span multiple physical disks. Use this value only with such logical volumes.
The string all followed immediately (with no intervening space) by an integer from the range 1 to 32, to run the specified number of Salvager subprocesses in parallel on partitions formatted as logical volumes. Use this value only with such logical volumes.
The BOS Server never starts more Salvager subprocesses than there are partitions, and always starts only one process to salvage a single volume. If this argument is omitted, up to four Salvager subprocesses run in parallel.
Specifies the full pathname of a local disk directory to which the Salvager process writes temporary files as it runs. By default, it writes them to the partition it is currently salvaging.
Controls how the Salvager handles orphaned files and directories. Choose one of the following three values:
Leaves the orphaned objects on the disk, but prints a message to the /usr/afs/logs/SalvageLog file reporting how many orphans were found and the approximate number of kilobytes they are consuming. This is the default if you omit the -orphans argument.
Removes the orphaned objects, and prints a message to the /usr/afs/logs/SalvageLog file reporting how many orphans were removed and the approximate number of kilobytes they were consuming.
Attaches the orphaned objects by creating a reference to them in the vnode of the volume's root directory. Since each object's actual name is now lost, the Salvager assigns each one a name of the following form:
_ _ORPHANFILE_ _. index for files |
_ _ORPHANDIR_ _. index for directories |
where index is a two-digit number that uniquely identifies each object. The orphans are charged against the volume's quota and appear in the output of the ls command issued against the volume's root directory.