RD Controls
EPICURE Software Release Note 116.1
VAX Disk Restoration Techniques

Deb Baddorf

February 19, 1999

Contents

Single File(s) Restoration

Several segments of this writeup are redundant. In order to make each section mostly self-contained, for use as a step-by-step guide during a crisis, some sections are repeated in several places.

What Files

The user must tell you what files he wants restored, and where they are to be put. If numerous files from the same directory are desired, try to use wildcards. If the files are unrelated by wildcarding, or if wildcards will restore too many files, then the destination should be a different, empty directory. You may want to use the empty directory technique any time you restore more than one or two files. You can restore the whole original directory, or the wildcarded set, into this new directory. Then the user can choose the files he wants and move them back to the desired destination. This way you avoid over-writing files unintentionally.

When Last Seen?

Have the user tell you, to the best of his ability, when the files were last known to exist. Remember that the backups begin at 00:05 or 03:00 and take till about 08:30, depending on the machine in question. Thus, for some files you will be able to restore the 00:05 version of the file, while for others the backup will have occurred at 08:00. This may matter if the accidental deletion or corruption occurred at 06:00.

Find A Backup Of The Files

Find the Journal File

A journal file is made of each backup we do. Journal files are kept for about the same length of time as the backup tape itself, before the tape is rotated through and overwritten. Journal files are kept in the SYS$BACKUPS directory, and are named by the day or date. Use the directory command to find appropriate journal files.

$ DIR /DATE /SINCE=date1 /BEFORE=date2   SYS$BACKUPS:

Find the File in a Journal

A journal file is not in readable format. Once you've identified several journal files which may be applicable, read them as follows.

$ BACKUP /JOURNAL=SYS$BACKUPS:journalfile /LIST=node-date.TMP
$                  ! the above might take 2-3 minutes
$ SEARCH   node-date.TMP   sought-for-file

Find the Save Set

When you find the journal file which actually backed up the disk in question, and which did contain the file you want, find the name of the save set on the tape:

$ SEARCH   node-date.TMP   "save set",sought-for-file
The output would look like this if I searched for the file containing this writeup:
     $  BACKUP /JOURNAL=SYS$BACKUPS:216ZZZ-16-FEB-1999-DAFFY_NIGHT2.BJL -
                 /LIST=DAFFY-16FEB.TMP
     $  SEARCH  DAFFY-16FEB.TMP  "SAVE SET",HOWTO_RESTORE_DISK.TEX
     Save set USR$DISK1.BCK created on 16-FEB-1999 00:07:33.37
         [BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;111
         [BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;110
         [BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;109
         [BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;108
         [BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;107
         [BADDORF.DOCS.TEK.NUMBERED]HOWTO_RESTORE_DISK.TEX;102
     Save set SYS_MASTER.BCK created on 16-FEB-1999 04:00:28.18
The information you are looking for is the fact that the desired file was backed up in save set USR$DISK1.BCK.

Find the Tape Itself

Take the date of the journal file and find the appropriate tape.

For WARNER or RDIV files, the tape may be below the computers in the computer room. If it was the first working day of the week or month, the tape should be in the cabinet in Deb Baddorf's office.

For DISNEY files or files on other nodes located at the Op Center, the tapes are in a set of drawers in the computer room.

Writelock the Tape

Take the desired backup tape and slide the write protect tab over. You should now see a RED BAR (1/4 inch or so). This will write protect the tape. Insert the tape into a tape drive.

Restore the File(s)

Log in as SYSMANAGER or SYSTEM, or if you have the privilege, turn on SYSPRV and BYPASS. This is necessary in order to restore the file with the original person as the owner.

Create the Destination Area

If restoring a number of files, I suggest you create an empty directory to put them in. The user can then look at all the files restored, and move the ones he really wants. A good possibility for a directory name is [.RESTORED], where this directory is one down from the place the user wants the files to be. This way the user will own the directory too, and can delete it later.

Identify the Drive Name

On RDIV, HYDRA, and some WARNER and DISNEY nodes, the tape drive is identified by a logical TAPE$8MM. This should be true on any node that has more than one tape drive.

$ SHOW LOGICAL TAPE$*
If there is no such logical defined, there should be only one tape drive. Many are MKA700, but some are not. VS3200 nodes have drives which are called MUxx. Find out by typing the following.
$ SHOW DEVICE MK
$ SHOW DEVICE MU

The Restore Command

Use the following commands to do the restore. If TAPE$8MM is not defined, use the drive name you found above. You may wish to set your screen to 132 columns wide, since the command line and the status messages are very long.

$ BACKUP /LOG drive:saveset.BCK /SELECT=([directory]file.ext) -
disk:[destination]*.*;* /BY_OWNER=PARENT
You may use a list of files in the SELECT clause. Put a comma between each file specification, and specify the complete directory and filename for each one. Do not use any logicals, and do not specify the disk, just the directory. The BY_OWNER=PARENT clause is to make the requestor be the file owner, and not you.

Caution: this is one of few DCL commands where the position of the qualifiers (the /somethings) does matter. The SELECT has to go after the save set name, and the BY_OWNER has to go after the destination name. Both have different meanings if placed somewhere else.

For a more specific example:

$ BACKUP /LOG TAPE$8MM:USR$DISK1.BCK -
/SELECT=([BADDORF.DOCS.TEK]HOWTO_RESTORE_DISK.TEX;*) -
USR$DISK1:[BADDORF.DOCS.TEK.RESTORED]*.*;* /BY_OWNER=PARENT
It can take an hour or two to do the restore, depending on where the desired save set is on the tape, and where the desired file is in the save set. Therefore, you want to do only one backup pass, if possible. Try to use wildcards in the SELECT clause. You can also use a list of filenames. If you are worried about getting the line just right, you can even edit it into a small file. Then when it is just right, execute the file with an @ sign.

Check the Output

The LOG qualifier will cause backup to tell you exactly what files it finds and restores. Read the output! This can tell you if you really found the right file. It can indicate if your wildcarding was too broad. It can tell you where the file was actually restored to; if it is wrong, you'll need to know where the file went so you can move or delete it.

Also, when the LOG output indicates that you've gotten all the files you wanted, you may ctrl-Y the backup command. Backup will continue to look through the whole save set for files. If you are restoring files for user AAA, you know you can ctrl-Y after you've gotten the files you want, since it will take a while for backup to search through the B-Z directories before it stops.

Dismount the Tape

The backup command will mount the tape for you. You don't need to mount a tape. But you should attempt to dismount it, particularly if you ctrl-Y'ed out of backup. If it tells you that the tape is not mounted, that's okay.

$ DISMOUNT TAPE$8mm         or
$ DISMOUNT drive
If you are not done with the tape, use:
$ DISMOUNT /NOUNLOAD drive
to rewind it so you can start again. This will not eject the tape, so you don't have to mess with reinserting it. This is especially useful to remember if you are doing this remotely, and having someone else insert the tape for you.

Cleaning Up

Put the tape's write-protect tab back to the position it was in originally. Unless the tape was write-protected when you started, you should un-protect the tape so it will be ready for the next use. Slide the tab so the red bar is not visible. Now put the tape back where you found it.

Tell the user his files are restored, and have him check that they are the ones he wanted.

Log out, if you used the SYSMANAGER or SYSTEM account!

Complete User Disk Restoration

A user disk on our clusters is a disk which doesn't have any VMS system software on it. Our standalone nodes have only one disk, so the VMS system software and the user software share the disk. Refer to Chapter  gif to restore such a disk. On the clusters, a user disk can be restored without disrupting the system software. One does, however, have to shutdown the node for a time period to replace the physical disk, if that is necessary.

Dismount The Disk

Attempt to Dismount

If a user disk is failing badly, the first step is to dismount the disk as soon as you can. Otherwise, users on different nodes will continue to try to use the disk. As they access the disk, their processes may become hung in a resource wait state (SHOW SYSTEM shows the process in RWxxx). If you cannot eventually get the disk to dismount on that node, you will have to reboot the node. Hence, if the disk is failing badly enough that processes are starting to go into resource wait states, you should try to dismount the disk as soon as possible, to minimize the number of nodes requiring a reboot.

$ DISMOUNT /CLUSTER disk:
You will be told which nodes cannot dismount the disk, because they still have open files.

Kill Processes As Needed

Find out what processes still hold what files open on those nodes. Log into each node which would not dismount the disk. On each node, do:

$ SHOW DEVICE/FILE disk
Try to stop the processes which have the files open, unless it is a system process. Use EOPERATOR to properly shutdown EPICURE processes, if possible. For other processes, use SHOW SYSTEM to get the process's id, and then try the following if it is not a batch job:
$ FORCEXIT :== $RDCS$EXE:FORCEXIT
$ FORCEXIT /ID=nnnnnnnn
If the process is a batch job, find the batch job entry number. Look at the queues for the node on which the process is running:
$ SHOW QUEUE /BATCH /ALL node*
Delete the batch job if you can find the entry number:
$ DELETE/ENTRY nnn
Depending on the batch job, it might take some time to actually finish aborting. A backup job, for instance, has to rewind the tape, which is a lengthy task. If the process is already stuck waiting for the disk which has failed, then the abort may never finish. If you cannot eventually dismount the disk with the commands in this section, you will have to reboot the node.

Try Dismount Again

When you have removed all the processes on one node which still have open files on the failing disk, try the DISMOUNT again. Keep removing processes and attempting to DISMOUNT until you get no further changes.

When you can get no further by killing processes, you can try

$ DISMOUNT /ABORT /CLUSTER disk:
This command is intended mostly (according to the manual) for dismounting disks after they go into a mount-verify condition. They will certainly go into mount-verify when you shutdown the node owning the disk and swap the disk out. However, if processes are hung in a resource wait state, even this command does not always succeed. You may still need to reboot several nodes.

Why All The Bother About Dismounting?

Nodes which still refuse to dismount the failing disk will eventually have to be rebooted. You can do these reboots now if it is convenient to do them. But you will have to do them before you can software mount the new disk on the owner node. This is because the new disk, after the physical installation, will have a different physical id. No nodes will allow the new disk to be mounted until all nodes stop referring to the old disk which used to be there. You can do the physical disk installation before rebooting the other nodes, but you will not be able to software-mount the new disk until the hung nodes are rebooted.

Find the Latest Good Backup Tape

  You may want to begin locating the restore tape while waiting for the hardware people to arrive to change the disk. Whenever you get to this step, do the following.

To find the latest GOOD backup, read the batch job log files which are accumulated and stored in:

WARNER::USR$DISK1:[BADDORF]CHECKBACKUP.LOG-SAVE;*
If you cannot get to this file (is USR$DISK1 the one that went bad?) the originals are kept in:
CMN$MANAGER:AUTO_BACKUP.LOG:* and
OCMN$MANAGER:AUTO_BACKUP.LOG:*
Edit the file(s) for read, or search for the section about your node - the one owning the failed disk (say, ELMER).
SEARCH CHECKBACKUP.LOG-SAVE ELMER /REMAIN
Now look for the section pertaining to the disk you need to restore. We'll say USR$DISK7 for example purposes.

Errors Which Are Okay

``Open for write by another user''

These are usually ok. Some judgement may be required here. See which particular file it was.

``Error opening `file' as input''

That's ok. The file was deleted by MAD_DELETER before it got around to backing it up. Probably a log file.

``Data not copied, File marked no backup''

This is a file whose contents are not important (like a page/swap file). Backup merely saves the file header and size.

``End of file position mismatch''

The verify pass on the backup found that the file has changed since backed up (there can be 30 minutes between the backup and the verify pass). This is normal for files which hold MAIL, batch or print queue information, or other files which may have changed since the backup started.

``Verification error for block n of file''

As long as no disk error is mentioned, this is usually ok. Files like the SYSUAF file change each time somebody logs in, and so may not match on the verify pass.

``Starting backup date recording pass''

A comment sort of line.

``Error writing backup date for file XXX. No such file.''

Another case where the
MAD_DELETER removed it half way through the backup. That's ok.

``Finished Full backup of ...''

The section about node ELMER should end with something like
     Finished Full backup of SYS_MASTER on 16-FEB-1999 05:11:34.01
     * * * * *	 
       SYSMANAGER   job terminated at 16-FEB-1999 05:11:34.44
If you've reached this marker and have no errors other than the ones above, then this was a good backup tape.

Errors Which Are NOT Okay

What you don't want to see:

``Error opening MKA700''

The batch job couldn't use the tape drive, so this is not a good backup.
     Error opening MKA700:[SYSMGR]16-FEB-1999.BCK;
     Device not ready, not mounted, or unavailable
     Device already allocated to another user
     * * * * *
       SYSMANAGER   job terminated at 16-FEB-1999 03:00:09.51
This means that no backup was done on 16-FEB for DEWEY because it couldn't use the tape drive.

``Parity Error''

You also don't want to see a parity error, which means the backup job had tape trouble and stopped.

     fatal error on ELMER$MKA700:[]USR$DISK3.BCK;
     Parity error
This may or may not be followed by a ``job terminated'' line ... the accumulated record of these log files may go right on to the next node in the list.

``CRC error''

Cyclic Redundancy Check error. An error was found while trying to write to the tape. 8mm tape drives correct for quite a few errors already, so by the time this message appears, there is a really bad problem with the tape.

Any ``FATAL'' error

Any FATAL error is probably not good. It usually means the backup job did not successfully finish.

Get the Tape

 

Pick a Good Backup Date, and Find the Physical Tape

Look at log files backwards through time till you find a good backup date. Note which machine backed up the disk you need: on WARNER, nodes DAFFY, and ELMER each back up some disks. On DISNEY, nodes MICKEY and MINNIE each back up some disks. Go to the tape shelf (or ask an operator to do it if it is a DISNEY disk) and get the tape with that date on it, for the correct node. If today's tape is the good one you are looking for, it may still be in the tape drive if no one has changed tapes yet today.

Find the Save Set Name

  The logfile section about the disk you desire will start with a phrase like the following.

     Starting Full backup of USR$DISK3 on 16-FEB-1999 
                  as file USR$DISK3.bck
The name of the save set file is ``USR$DISK3.BCK''. You will need to know this name.

There is a pattern to the formation of these names, which is listed below. However, you should read the log file to see if we have changed this naming convention when you finally need to use this writeup.


Save Set Naming Pattern Example
USR$DISKn USR$DISKn.BCK USR$DISK7.BCK
RFD$DISKn RFD$DISKn.BCK RFD$DISK2.BCK
SYS_MASTER SYS_MASTER.BCK DAFFY sysdisk
MICKEY sysdisk
RDIV01 sysdisk
SYS_SECOND SYS_SECOND.BCK ELMER sysdisk
MINNIE sysdisk
SYS_ALPHA SYS_ALPHA.BCK RDIVA2 sysdisk
SYS$SYSDEVICE SYS$SYSDEVICE.BCK HUEY disk
DEWEY disk ...

Shutdown and Change the Disk

Shutdown the node owning the disk. Login as username SHUTDOWN from either any RDCS account or from the HYDRA VCS monitor attached to the OPA0 console. If you use VCS, you will need to type SELECT node at the VCS command prompt. Once you reach the SHUTDOWN account, pick SHUTDOWN from the menu. Give users a few minutes to log out when it asks how many minutes till the shutdown. When the VCS window displays ``SHUTDOWN COMPLETE. USE CONSOLE TO HALT NODE.'' you can tell the hardware people to go ahead with changing the disk drive.

Reboot the node when the disk has been changed. The power up after the disk change may do the reboot for you. If not, type BOOT at the ``>>>'' prompt on the VCS console. Your cursor should be inside the window for the node in question. See section gif if you need VCS instructions.

Remember to bring your cursor back out of the VCS window by typing ctrl-G when you are done. Then you can leave VCS by typing EXIT.

Initialize the Disk

The hardware person will prefer that you try to initialize the disk while he is still there, so he can see if the new disk is going to be okay.

Log In

Log into any node in the cluster (even while the node owning the disk is still booting). Use username SYSTEM, or use SYSMANAGER and turn on all privileges. Do a SHOW DEVICE, and when the booting node is far enough along in the boot procedure, you will be able to see the new disk drive. Since it has no software on it, the normal boot procedures will not be able to mount it, but you should see that the device exists.

Disk Must Be ONLINE

If SHOW DEVICE indicates that the disk is remotely mounted (on some other node) then you have to cause that condition to go away. You can try the following again, on the node which still has the drive mounted:

$ DISMOUNT /CLUSTER /ABORT disk:
If it doesn't help, then now is the time you need to reboot all those other nodes which still have the drive mounted. When you get all those nodes to let go of the disk (via dismount or reboot) you should see ``Online'' in response to the SHOW DEVICE command.

Initialize the Disk

Do SHOW DEVICE and locate the full name for the disk drive. You can no longer refer to it as ``USR$DISK7'', for instance. It must now be referred to as ``ELMER$DKA300''.

You are going to initialize the disk with the name ``BLANKDISK''. This is not the correct name for any of our disks. This way, if any node is rebooting during this process of restoring the disk, even if the node tries to mount the disk, it will fail because the disk has the wrong name. You don't want anybody to mount the disk until you finish restoring onto it.

The initialize command is:

$ INIT   disk   BLANKDISK /SYSTEM /NOHIGHWATER

Restore the Software

Login as SYSTEM (not SYSMANAGER) to a node which owns a tape drive. The node which owns the disk may be fastest, since you are not restoring across the network, but it is not necessary to use that node. You do have to pick a node in the same cluster.

Privately Mount the Disk

Mount the disk privately, just for yourself to use. You are also mounting it FOREIGN, which only backup uses. You cannot do a DIRECTORY on a disk mounted foreign.

$ MOUNT /FOREIGN disk:

Write Protect the Tape

Take the good backup tape, and slide the write protect tab over. You should now see a RED BAR (1/4 inch or so). This will write protect the tape.

Mount the Tape

Insert the tape in the drive, on the node which you are logged into.

Identify the Drive Name

On some WARNER and DISNEY nodes, the tape drive is identified by a logical TAPE$8MM. This should be true on any node that has more than one tape drive.

$ SHOW LOGICAL TAPE$*
If there is no such logical defined, there should be only one tape drive. Many are MKA700, but some are not. VS3200 nodes have drives which are called MUxx. Find out by typing the following:
$ SHOW DEVICE MK
$ SHOW DEVICE MU

Restore the Backup Tape

There is no need to say MOUNT for the backup tape. BACKUP itself will do that for you. However, if you want to double check the date on the tape, you can mount it. The status line after the mount should indicate which tape is mounted, and the tape label usually contains the date. The backup scheme will encode the date as MDDxxx, where M is 1 through C for the hex value of the month. The last three letters, xxx, can be ignored. Again, you don't need to do this, but the command would be:

$ MOUNT /FOREIGN tape:
$ for example:    MOUNT/FOREIGN TAPE$8MM:

The backup command is as follows. You need to supply the name of the tape drive tape, the name of the save set saveset and the name of the disk disk.

$ BACKUP /IMAGE   tape:saveset   disk
$ for example:   BACKUP /IMAGE   TAPE$8MM:USR$DISK7.BCK   ELMER$DKA300:
The backup may take on the order of an hour. For a large or full disk it may be even longer. Do not panic.

Check the Disk

When the backup is done, check the files a bit before turning the disk over to the users. Since the disk is currently mounted foreign, you cannot do directory commands. So, you must dismount the disk from the foreign mount, and then quickly remount it in the normal fashion so you can do directory commands, but privately so other users can't use it yet.

$ DISMOUNT /NOUNLOAD disk
$ MOUNT /OVERRIDE=ID disk
Do some DIRECTORY commands on `disk' to see if it is okay. You cannot yet use the normal logicals such as USR$DISK7; you must still use the full name ELMER$DKA300:.
$ DIRECTORY disk:[000000] /OWNER
$ DIRECTORY disk:[somedirectory] /OWNER
If you did this from a SYSTEM account, the original owners should be correctly assigned.

Check the Disk Label

  You should have no problem with this if you did the restore from the SYSTEM account. However, since Therese ran into a problem once, check the label on the disk.

$ SHOW DEVICE /FULL disk:
The ``Volume Label'' field should have the name which corresponds to the USR$DISKn type name. See the following chart.


Volume Naming Pattern Example
Logical Volume Name Volume Name
USR$DISKn USERDISKn USERDISK7
RFD$DISKn RFDDISKn RFDDISK2
SYS_MASTER node_SYS DAFFY_SYS
MICKEY_SYS
RDIV01_SYS
SYS_SECOND node_SYS ELMER_SYS
MINNIE_SYS
SYS_ALPHA node_SYS RDIVA2_SYS
SYS$SYSDEVICE node_SYS HUEY_SYS
DEWEY_SYS ...

If the name is wrong, change it as follows:

$ SET VOLUME disk /LABEL=abcdefghi
$ for example:   SET VOLUME ELMER$DKA300: /LABEL=USERDISK7

Mount the Disk to the Cluster

Dismount the Tape and the Disk

Backup probably left the tape mounted (ignore the error if backup dismounted it, but it usually doesn't). So you need to dismount the tape drive.

$ DISMOUNT tape
You have the disk mounted for private use, while you checked things over. Dismount it, so you can remount it for the whole cluster.
$ DISMOUNT /NOUNLOAD disk
Now, mount it for the cluster. You are still logged in as SYSTEM, are you not?
$ @CMN$STARTUP:MOUNTDISKS
This will mount it for the whole cluster, and will give it the normal logicals that we use to address it.

Clean Up

Logout of the SYSTEM account, and any other spare logins you may have left lying around. Put the tape back where it was found. Put the write protect tab back into the original position, too.

System Disk Restoration - Clusters

Non Clustered nodes

This chapter describes how to restore a system disk for a cluster boot node. Refer to Chapter gif for stand-alone nodes which are not clustered.

Clustered System Disks

If a WARNER or DISNEY or RDIV system disk needs to be replaced, you will have to shut down the whole cluster. The physical location of files on the system disk will change after the restoration, and the physical id of the disk will be different. These factors mean that you cannot allow any other cluster node to think it still knows where files are on the system disk. It will be wrong in that belief, and you will be the one to suffer for it. You will have to restore the system disk again if you have not shut down all the workstations and other cluster nodes ahead of time!

Locate a Good Backup Tape

 

If The Cluster Is Still Up

If the cluster is still up, you can locate the most recent successful backup of the system disk by reading the log files of the backup batch jobs.

Refer to section gif to look through backup log files in order to find a successful backup.

Refer to section gif to get the name of the saveset on the tape. You will need this saveset name later.

If The Cluster Is Already Down

If the cluster is already down, because the system disk is down, you'll just have to use your own judgment in choosing a backup tape. In this case, use the chart in section gif to identify the name of the saveset, which you will need later. This chart is valid at the time of this writeup.

Shutdown The Cluster

You need to shutdown all the nodes in the cluster, saving the boot nodes (DAFFY and ELMER, or MICKEY and MINNIE, or RDIV01 and RDIVA2) for last.

Get a List and Use It

Make a list of all nodes. See HTTP://XENA.FNAL.GOV for the current nodes. I like to fill in a column title with each task that I have to do, and then put a check mark beside each node as I do that task. For instance, I label my columns ``Run Shutdown'', ``Shutdown Is Done'', ``HALT Button'', and later on ``Typed BOOT'', ``Node Is Up''.

Cluster Already Hung?

If the cluster is already hung because the boot node is dead, skip to step gif.

Run SHUTDOWN If Cluster Is Still Up

  If the cluster is still up, use these procedures to shutdown the other nodes in a graceful manner. You can log into account SHUTDOWN on each node, and run shutdown from there. However, it is faster and easier to remotely tell all the nodes to shutdown. You accomplish that as follows.

Log into one of the boot nodes as user SYSMANAGER or SYSTEM. Type the following command, to tell remote nodes to shut themselves down.

$ @SYS$MGR_UTIL:MASS_SHUTDOWN
Answer the questions as follows.

``Nodes (for RemoteDCL)''

Enter five or six nodenames, separated by commas and no spaces. Do only 5-6 at a time, and don't do the boot nodes till everything else is done.

node1,node2,node3,node4,node5

``Select SHUTDOWN or REBOOT''

Type SHUTDOWN, or just S.

``Minutes until shutdown commences (0)''

Type 3 for the first batch of nodes, then 5 for the next batch, etc. The idea is to stagger the shutdown a little bit, so they don't compete with each other while accessing the system disk to close files. I suggest not using 0 minutes the first time (3 or 5 is fine for the first delay) so that the first nodes to start shutting down don't busy up the system so much that your MASS_SHUTDOWN command doesn't have time to complete.

``Shutdown reason:''

Type a succinct and pithy reason for the shutdown. If you have to replace a system disk, you are probably mad enough to be extremely terse and concise! One line is all you get.

Repeat step gif as necessary until you get all the nodes shutdown. Use your checklist to tick off nodes which have been told to shutdown, and nodes which have finished their shutdown. A node has finished its shutdown when SHOW CLUSTER shows it gone, and when SHOW QUEUE node* shows all queues stopped.

Do SHOW CLUSTER to verify that all nodes are shutdown, and that the list was correct. When we add a new node to the cluster, it sometimes takes a little time before I add the node to the list. Same goes when we change the name or location of a node.

HALT Each Node

  Find and press the HALT button on each node. If you can't find it, you may turn off the power switch for the node. Use your checklist and mark off nodes as you HALT them.

Shutdown The Boot Nodes

When all the other nodes in the cluster are halted, shutdown both of the boot nodes.

Log into your account on HYDRA, or use the VCSMONITOR account. VCSMONITOR can be used while physically at the VCS workstation or from any RDCS account on a node which isn't in the cluster you are shutting down. From your personal account, type VCSMON to run the VCS software. The VCSMONITOR account will automatically run VCS for you.

Rather than typing the usual ``SELECT node'' at the VCS command prompt, you want to shutdown both boot nodes in parallel, so you will send the commands to both nodes at the same time. At the VCS ``Command:'' prompt, you can tell it to OUTPUT text to a particular node or nodes. Use the OUTPUT command, by typing the following to log into SHUTDOWN on both boot nodes. The example will be for WARNER. For DISNEY, use MICKEY and MINNIE in place of DAFFY and ELMER. Remember that you can use the up-arrow key and line editing so that you don't have to keep typing the beginning of the line over and over.

Command: OUTPUT DAFFY,ELMER ""    !a CR to get the USERNAME prompt
Command: OUTPUT DAFFY,ELMER SHUTDOWN    !the username
Command: OUTPUT DAFFY,ELMER SHUTDOWN    !choose SHUTDOWN from the menu
Command: OUTPUT DAFFY,ELMER YES        !if it asks ARE YOU SURE
Command: OUTPUT DAFFY,ELMER 0          !how many minutes
Command: OUTPUT DAFFY,ELMER SYSTEM DISK    !reason
Command: OUTPUT DAFFY,ELMER 2 HRS      !estimate time down

Since this is a cluster, one boot node depends on the other. Invariably, one node will shutdown a trifle faster than the other. When the VCS window for the faster node displays ``SHUTDOWN COMPLETE. USE CONSOLE TO HALT NODE.'' the other node will hang due to loss of quorum. It will not be able to shutdown any further. At this point you can hit the HALT button on both nodes.

Tell the hardware people to go ahead with changing the disk drive.

Boot StandAlone Backup

  Go back to the HYDRA screen. Move the cursor into the window for the node where the system disk has been replaced.

Command: SELECT node
Commands you type will now go directly to that node, rather than to the VCS command line (although of course, the VCS software is actually doing the sending). Refer to step gif for more VCS commands.

Boot up ``standalone backup'' from another disk in the cluster or from a tape. On the HYDRA screen, the cursor is now inside the window for your node.

SA Backup from Disk

Booting SA Backup from disk is far faster than the alternative. For this reason, we've installed SA Backup on just about every disk we own. If your node can see another disk besides the one just changed, try to boot SA Backup from that disk. To find out what disks are available, at the ``>>>'' prompt type

SHOW DEV

When you know what disks are available, pick a disk other than the newly replaced one. At the ``>>>'' prompt type

B/E0000000 otherdisk      !(that's E + 7 zeros)
If it says it can't find any software to boot, try any other disk which is visible. Watch the boot. It takes about 1-2 minutes. Enter the date and time when requested, in the exact format they ask for.
16-FEB-1999 16:45
It'll CONFIGURE disk and tape devices DKnn and MKnn (disk and tape). Then it will ask if all the devices you want to use are visible. Type YES. If there is a real long wait (10 minutes is too long) for it to ask the question
Enter ``YES'' when all needed devices are available
it may be that the question is there but the screen didn't print it. Try a carriage return; it'll ask the question again.

STANDALONE BACKUP will now finish booting, and you will have a $ prompt in another 1-2 minutes.

SA Backup from CD

If there is no other disk which will boot standalone backup, then you have to boot from a CD with standalone backup on it. Tape SA backup was obsoleted with VMS 6.1. (Node CRYBAK is purposely still at VMS V5.5-5. You can use SA backup from a tape for node CRYBAK. Such tapes are in my cabinet.)

The SA backup CD is in my cabinet. Look for a Software Product Library folder on the top shelf. It is marked up with a tag, and the label is marked up to say SABACKUP. Look for the last CD in the set. The label is ``Binaries & Documentation VAXVMS061 Disk 1 of 1.'' (There is another collection if you are working on an alpha. The disk is AXPVMS061.)

You will need to attach a portable SCSI CD drive to the node. There is one in my cabinet, about knee height, and up against the right wall. It is shrouded in a plastic bag, so that all the pieces stay together. The ``kit'' includes a SCSI cable, a power strip, and the CD drive. Caution: The SCSI cable may not have the right connectors for your node.

Determine the SCSI numbers which are already in use on the computer. Check the setting on the CD drive; there are jumpers on the botton. Power off the computer. Attach the SCSI drive, power up the drive, and then power up the computer again. Beware of SCSI number conflicts. (You can't use 6 on a VAX and you can't use 7 on an Alpha.)

Make sure the drive is visible when you type

SHOW DEV

Insert the VAXVMS061 CD in the drive, pressing gently until it clicks into place.

At the ``>>>'' prompt type

B DKAn00
where n is the SCSI id of the CD drive. Watch the boot. Enter the date and time when requested, in the exact format they ask for.
16-FEB-1999 16:45
It'll CONFIGURE disk and tape devices DKnn and MKnn (disk and tape). Then it will ask if all the devices you want to use are visible. Type YES. If there is a real long wait (10 minutes is too long) for it to ask the question
Enter ``YES'' when all needed devices are available
it may be that the question is there but the screen didn't print it. Try a carriage return; it'll ask the question again.

STANDALONE BACKUP will now finish booting, and you will have a $ prompt in 1-2 minutes from your YES answer.

Write Protect the Tape

Take the good backup save tape which you found earlier. Slide the write protect tab over. You should now see a RED BAR (1/4 inch or so). This will write protect the tape. You must do this, since restoring the tape will also restore the backup job which was making the tape, and it will try to write over the good tape on the first reboot.

Insert the Tape

Put the tape in the drive.

Go back to the HYDRA screen. The cursor should still be inside the window for your node.

StandAlone Backup Command

  Get the saveset name, which you looked for in step gif.

The tape drive for most of our boot nodes is MKA700 currently, so the instructions here will just say MKA700.

Get the name of the system disk sysdisk you are restoring: currently the system disk is DKA0 for DAFFY, ELMER, MICKEY, MINNIE, RDIV01 and RDIVA2.

Type the backup restore command:

$ BACKUP/IMAGE    MKA700:saveset.BCK    sysdisk:
Take a minute to stare at it and make SURE you have it right, then hit CR. Upper case isn't important; lower case is fine.

A system disk for one of our cluster boot nodes can take one hour (DISNEY) to two hours or perhaps more for WARNER. (These times are estimates; don't count on them.) You won't see anything on the screen during this time. You should see the tape drive light flashing, if you look. If the front panel is removed, you can also see the drive light flashing.

SA Backup is Done

SA Backup was a Success

StandAlone Backup is done when you get a message saying PROCDONE.

     %BACKUP-I-PROCDONE, operation completed.  Processing finished at ...
     If you do not want to perform another standalone BACKUP operation,
     use the console to halt the system.

     If you do want to perform another standalone BACKUP operation, 
     ensure the standalone application volume is online and ready.
     Enter ``YES'' to continue:

Since the backup was OK, hit the HALT button, if it is enabled, or the RESTART button otherwise.

SA Backup was NOT Okay

If the message does not say

%BACKUP-I-PROCDONE
but instead says something like
%BACKUP-W-
%BACKUP-F-
then read the message carefully. Maybe it wasn't a good save tape after all. In this case, type YES (you need to try again) and find another backup save tape to attempt to restore. Start again at step gif.

If the machine dies (out of the StandAlone Backup program) you can always start again at step gif and boot StandAlone Backup from the disk, or the StandAlone Backup tape.

Boot Both Boot Nodes

Bring your cursor on the VCS node back out of the node window by typing ctrl-G.

The CPUs should have a stored value telling them which disk to boot from. However, this value can get lost during the disk change. At the VCS command line, type

Command: OUTPUT DAFFY,ELMER SHOW BOOT
Command: VIEW DAFFY
Command: VIEW ELMER
Read both VIEW screens and make sure the nodes know where to boot. (Do the same thing for MICKEY and MINNIE if doing the DISNEY cluster, or RDIV01 and RDIVA2 for the RDIV cluster.) DAFFY, ELMER, MICKEY, MINNIE, RDIV01, and RDIVA2 should each be set to boot from DKA0. If any VAX node does not have the correct value, type
Command: OUTPUT node SET BOOT DKA0
On RDIVA2, the command is
Command: OUTPUT node SET BOOTDEF_DEV DKA0

One boot node cannot boot by itself; it will wait for a quorum of nodes to join the cluster. Boot both nodes now. Use the VCS OUTPUT command to type BOOT to both nodes.

Command: OUTPUT DAFFY,ELMER BOOT

Check The 2 Node Cluster

Log into the two node mini-cluster and see that things look ok. If you need to reboot either node again, feel free to do so.

Clean Up

Remove CD drive

If you attached a CD drive, run SHUTDOWN on the computer and power it off again. Power off the CD drive, and remove it. Reattach any SCSI devices you may have removed. Boot the computer again, and return the CD drive and the VAXVMS061 CD to their storage locations.

Put the Tape Back

Remove the ``good backup'' tape from the tape drive. Put the write protect switch back to the position you found it in, and put the tape away. Re-insert the backup tape for tomorrow's backup job.

Finish Booting

Reboot the rest of the cluster. Use your checklist to make sure you get all the nodes.

Log Off

Type ctrl-G in the VCS window, if needed. EXIT the VCS software, and log out of HYDRA. Logout of the SYSTEM account, and any other spare logins you may have left lying around.

System Disk Restoration - StandAlone Nodes

  This chapter describes how to restore a system disk for a stand-alone node.

How to Use This Chapter

Connected to HYDRA

Nodes connected to HYDRA include HUEY, DEWEY, LOUIE, WEBBY and DONALD.

Stand Alone Nodes

The remaining stand alone nodes are RODRNR, CRYBAK, and HYDRA. HYDRA, of course, cannot use VCS to monitor itself.

Use the instructions in this chapter. Every reference to doing something ``in the HYDRA window'' or ``in the VCS window'' will be replaced by ``at the screen for your node''.

Shutdown the Node

  Log in to the SYSTEM or SYSMANAGER account. Type:

$ @SYS$SYSTEM:SHUTDOWN
and answer the questions.

You can halt the system when you see:

SYSTEM SHUTDOWN COMPLETE - USE CONSOLE TO HALT SYSTEM

How to Halt the System

  Type ctrl-P in the HYDRA window where you are connected to the node (if it asks ``Do you want to transmit a BREAK'', answer YES), or hit the HALT button on the front of the node.

Locate a Good Backup Tape

Node is Still Up

If the node is still up, you can locate the most recent successful backup of the system disk by reading the log files of the backup batch jobs.

Refer to section gif to look through backup log files in order to find a successful backup.

Refer to section gif to get the name of the saveset on the tape. You will need this saveset name later.

Node is Already Down

If the node is already down, because the system disk is down, you'll just have to use your own judgment in choosing a backup tape. In this case, use the chart in section gif to identify the name of the saveset, which you will need later. This chart is valid at the time of this writeup.

Write Protect the Tape

Take the backup tape, and slide the write protect tab over. You should now see a RED BAR (1/4 inch or so). This will write protect the tape. You must do this, since restoring the tape will also restore the backup job which was making the tape, and it will try to write over the good tape on the first reboot.

Insert the Tape

Put the tape in the drive.

Go back to the HYDRA screen. The cursor should be inside the window for node HUEY. (Or whatever node.) If it isn't inside the window, refer to step gif and put it inside the window. Type

Command: SELECT nodename

Boot Standalone Backup

  Boot up ``standalone backup'' from the disk or from a tape. On the HYDRA screen, the cursor is now inside the window for node HUEY.

SA Backup from Disk

At the ``>>>'' prompt type

B/E0000000 DKA0      !(that's E + 7 zeros)
Watch the boot. It takes about 1-2 minutes. Enter the date and time when requested, in the exact format they ask for.
16-FEB-1999 16:45
It'll CONFIGURE devices DKA0 and MKA700 (disk and tape). Type YES to say that that is all you expect to see. If there is a real long wait (10 minutes is too long) for it to ask the question
Enter ``YES'' when all needed devices are available
it may be that the question is there but the screen didn't print it. Try a carriage return; it'll ask the question again.

STANDALONE BACKUP will now finish booting, and you will have a $ prompt in another 1-2 minutes.

SA Backup from CD

If it is a newly installed disk, or if the disk is too far gone to let you boot standalone backup from the disk, then you have to boot standalone backup from a CD. Tape SA backup was obsoleted with VMS 6.1. (Node CRYBAK is purposely still at VMS V5.5-5. You can use SA backup from a tape for node CRYBAK. Such tapes are in my cabinet.)

The SA backup CD is in my cabinet. Look for a Software Product Library folder on the top shelf. It is marked up with a tag, and the label is marked up to say SABACKUP. Look for the last CD in the set. The label is ``Binaries & Documentation VAXVMS061 Disk 1 of 1.'' (There is another collection if you are working on an alpha. The disk is AXPVMS061.)

You will need to attach a portable SCSI CD drive to the node. There is one in my cabinet, about knee height, and up against the right wall. It is shrouded in a plastic bag, so that all the pieces stay together. The ``kit'' includes a SCSI cable, a power strip, and the CD drive. Caution: The SCSI cable may not have the right connectors for your node.

Determine the SCSI numbers which are already in use on the computer. Check the setting on the CD drive; there are jumpers on the botton. Power off the computer. Attach the SCSI drive, power up the drive, and then power up the computer again. Beware of SCSI number conflicts. (You can't use 6 on a VAX and you can't use 7 on an Alpha.)

Make sure the drive is visible when you type

SHOW DEV

Insert the VAXVMS061 CD in the drive, pressing gently until it clicks into place.

At the ``>>>'' prompt type

B DKAn00
where n is the SCSI id of the CD drive. Watch the boot. Enter the date and time when requested, in the exact format they ask for.
16-FEB-1999 16:45
It'll CONFIGURE disk and tape devices DKnn and MKnn (disk and tape). Then it will ask if all the devices you want to use are visible. Type YES. If there is a real long wait (10 minutes is too long) for it to ask the question
Enter ``YES'' when all needed devices are available
it may be that the question is there but the screen didn't print it. Try a carriage return; it'll ask the question again.

STANDALONE BACKUP will now finish booting, and you will have a $ prompt in 1-2 minutes from your YES answer.

StandAlone Backup Command

  When you reach the $ prompt, insert the tape with the ``good HUEY backup'' (write protected). Type the backup restore command:

$ BACKUP/IMAGE    MKA700:    DKA0:
Take a minute to stare at it and make SURE you have it right, then hit CR. Upper case isn't important; lower case is fine.

The front end nodes (HUEY,DEWEY,LOUIE,WEBBY) take about 30 minutes to do the restore. You won't see anything on the screen during this time. You should see the tape drive light flashing, if you look. If the front panel is removed, you can also see the drive light flashing.

SA Backup is Done

SA Backup was a Success

StandAlone Backup is done when you get a message saying PROCDONE.

     %BACKUP-I-PROCDONE, operation completed.  Processing finished at ...
     If you do not want to perform another standalone BACKUP operation,
     use the console to halt the system.

     If you do want to perform another standalone BACKUP operation, 
     ensure the standalone application volume is online and ready.
     Enter ``YES'' to continue:

Since the backup was OK, hit ctrl-P for a HALT and answer YES to ``do you really want to send a BREAK''. You will get

     02 EXT HLT 
            PC = 8015D6A4
     >>>

SA Backup was NOT Okay

If the message does not say

%BACKUP-I-PROCDONE
but instead says something like
%BACKUP-W-
%BACKUP-F-
then read the message carefully. Maybe it wasn't a good save tape after all. In this case, type YES (you need to try again) and find another backup save tape to attempt to restore. Start again at step gif.

If the machine dies (out of the StandAlone Backup program) you can always start again at step gif and boot StandAlone Backup from the disk, or the StandAlone Backup tape.

Boot Up

Cross your fingers. Type

>>> B DKA0
and watch the boot to make sure it works normally.

Clean Up

Remove CD drive

If you attached a CD drive, run SHUTDOWN on the computer and power it off again. Power off the CD drive, and remove it. Reattach any SCSI devices you may have removed. Boot the computer again, and return the CD drive and the VAXVMS061 CD to their storage locations.

Put the Tape Back

Remove the ``good backup'' tape from the tape drive. Put the write protect switch back to the position you found it in, and put the tape away. Re-insert the backup tape for tomorrow's backup job.

Log Off HYDRA

If you typed SELECT HUEY, then your cursor is in the HYDRA window. If you typed CONNECT HUEY, then your cursor is attached to HUEY and you don't even see the HYDRA window bar any more. In either case, remember that you are still in HYDRA! When you are done, you must remember to type ctrl-G and answer YES to return control to HYDRA, so other people can monitor systems. Then type EXIT to quit the VCS program, and LOGOUT to leave HYDRA. If you used the VCSMONITOR account, you don't need to logout since it automatically does that when you EXIT from VCS.

Ctrl-G
YES
Command: EXIT
$ LOGOUT

Camac FE ``Scribbling On System Disk''

How to recognize and fix the Huey, Dewey, Louie, or Webby disk problem, if it's the same one we had before (in 1991, repeatedly).

Look at VCS Info

  Log into HYDRA and watch the node in question. If you do not have a personal account on HYDRA, you can use the VCSMONITOR captive account. You must log into VCSMONITOR from a known account: from a WARNER or DISNEY ``RDCS'' account or system account, or from a DISNEY crew chief account. Alternatively, you can refer to the password list and log in as SYSTEM. You do not want to use the SYSMANAGER account, since it cannot run VCS, due to protections which are unavoidable. The SYSTEM account can run VCS.

A few basic HYDRA commands include:

$ VCSMON
Command: VIEW HUEY
Command: REVIEW HUEY /SINCE="03-DEC 13:15"
Command: SELECT HUEY !to move cursor into window if needed
Command: EXIT
The REVIEW command can get you data which is too old to appear on your screen. However, it doesn't continue to update with current information. You need to use VIEW to get back to the active info. Other commands to move within the HYDRA windows include:
Command: prev-screen or next-screen key
Command: KP5      !scroll back one line in the window
Command: KP2      !scroll forward one line in the window
Command: FIND/REVERSE phrase      !find, backwards. Can use KP7.
Command: FIND/FORWARD phrase      !find, backwards. Can use KP8.
Command: GOTO LAST or KP4      !move to the current stuff,
!                             !at the bottom of the window
Command: GOTO FIRST or KP6     !move to top of window
!                             !oldest stuff available

If you get an ``Access conflict'' message on the very bottom line (under Command) this implies that two people are looking at the same screen and the software got confused. It will probably not update the screen anymore, so you won't see current data. This doesn't imply that two people should not look at the same thing, just that occasionally you'll have to jog VCS's memory. You may be able to get the display back to current data by typing VIEW `node' again, but you may have to exit and re-enter VCSMON to fix the condition.

Also, if at any time you expect to see something happening on the node you are watching, and you don't see the HYDRA screen update for too long a time, try exiting and re-entering VCSMON. The software does have occasional memory lapses of this sort, though not very often.

Diagnosis: Some Symptoms

  If you cannot get basic things to work (like booting) then the disk may be messed up. You can look for some of the symptoms mentioned below.

If you already know the disk is messed up, but the system is still up, go to section gif to shutdown the node and then replace the disk. If the node is already down, go to section gif.

Disk Goes Into Mount Verification

After a reboot, disk node$DKA0 goes into mount verification: ``HUEY$DKA0 contains the wrong volume. Mount verification in progress.'' At this point it will either hang forever (I only waited 10 minutes) or will eventually boot itself. If it doesn't boot itself, you need to boot it. Press the halt button and type BOOT, or power cycle the node.

Disk Becomes Write Locked

After the second boot, HYDRA will show possibly:

Message from user SOMETHING on HUEY (ERRFMT, JOB_CONTROL, etc)
-SYSTEM-F-DATACHECK, write check error

Message from user SOMETHING on HUEY (ERRFMT, JOB_CONTROL, etc)
-RMS-E-WLK, device currently write locked
A system cannot run on a write locked disk. It needs to write system logfiles and stuff.

When do the Symptoms Occur?

In all of the cases we've encountered so far, the reboot was fine until the point when the EPICURE batch job ran. This has led us to insert some delays into the EPICURE startup job, to try to remove conditions which seem to cause the ``scribble on system disk'' problem.

How to Log On Once It's Write Locked

  Once the disk is write locked (the overall symptom to this problem), no users will be allowed to log in. The only way to log in will be through the HYDRA screen. You will then have to type

$ VCSMON
Command: SELECT HUEY
This will put your cursor up into the view window. REMEMBER that you are still in HYDRA! When you are done, you must remember to type ctrl-G and return control to HYDRA, so other people can monitor systems.

Now you can log in as anybody - it doesn't have a sysuaf file, so it can't check. You will have all privileges. It may be the case that it won't let you log in as anybody; if so, then use the username SYSTEM.

Proceed to Restore the System Disk

Continue with the instructions in Chapter gif to restore the system disk after it has been corrupted.

Boot Up

Cross your fingers. Type

>>> B DKA0
and hope this boot is a good one. I had to do this whole process twice in August 1991, on node HUEY. On December 3, 1991, between John and I this whole process was repeated about 7 times. In all of these cases, the reboot was fine until the point when the EPICURE batch job ran. This has led us to insert some delays into the EPICURE startup job, to try to remove conditions which seem to cause the ``scribble on system disk'' problem.

The problem has not been seen since 1991; we think it was cured. (The front ends were vs3200s then, and are now vs4000-60's.)

Keywords: WARNER, DISNEY, backups, restore, disk, tape
Distribution:
normal

baddorf@fnal.gov

Security, Privacy, Legal