RD/EE Controls<P> EPICURE Software Release Note 122.0<P> <b> When the QUEUE file is corrupted</b>

RD/EE Controls

EPICURE Software Release Note 122.0

When the QUEUE file is corrupted

Debra S. Baddorf

VMS VERSION

Most, if not all, of this information will be obsolete when we go to VMS V5.5. Bear this in mind!

SYMPTOMS

During startup, there will be complaints the the queues are not running, the queue manager is not started or cannot be started, and/or that the queue file is corrupted. The name of the queue file is SYS$SYSTEM:JBCSYSQUE.DAT before VMS V5.5.

If you log on, when you do SHOW QUEUE you will be told that the queue manager isn't running. You will not be shown any queues. If you try to remedy this by logging in as SYSMANAGER and doing @CMN$STARTUP:STARTQUEUES, you will be told that the queue manager cannot start due to a corrupt queue file.

CURE

All of the following should be done as SYSMANAGER.

Try to Recover the old queue file

Need a Running Queue Manager

If the queue manager is still running on at least one node, log into that node and go to step . Until you STOP the queue manager (or the node reboots) it may continue to run on the corrupt queue file, so you may still have some nodes whose batch jobs are OK.

To find out if any nodes still have a running queue manager, type

$ SYSMAN

SYSMAN> SET ENV/CLUSTER

SYSMAN> SET TIMEOUT 0:2

SYSMAN> DO SHOW QUEUE

SYSMAN> EXIT

If all nodes say that the queue manager is not running, then continue with this step, and try to force one to run.

If all the queue managers are stopped, then you have to force one to start. If this doesn't work, then you'll have to abandon the attempt to ``recover the old queue file'' and go to the ``make a new queue file'' section . This trick is ``unsupported and undocumented'' but it works well enough for someone else to have written these procedures, and it did work the only time I tried it.

Pick a node and log into it (it might be safer not to use DAFFY or ELMER, unless those are the only nodes booted and you have a minimum cluster). Do the following to set the dynamic SYSGEN parameter JOBCTLD to 1 (this tells the queue manager to skip the file checking).

$ SYSGEN

SYSGEN> USE ACTIVE

SYSGEN> SHOW JOBCTLD

SYSGEN> SET JOBCTLD 1

SYSGEN> WRITE ACTIVE

SYSGEN> EXIT

$ START/QUEUE/MANAGER

$ SYSGEN

SYSGEN> USE ACTIVE

SYSGEN> SET JOBCTLD 0

SYSGEN> WRITE ACTIVE

SYSGEN> EXIT

If the START/QUEUE/MANAGER part worked ok, you can now do SHOW QUEUE (on this node only). If it didn't work, then it gave an error message, and you'll have to go to the ``make a new queue file'' section . In either case, do all the steps above, so that you set the JOBCTLD parameter back to 0, which should be how it started.

Kludge to recover old queue file

You should already be logged into a node which has a queue manager running. Do the following:

$ SET DEFAULT USR$DISK1:[BADDORF.WIP.FIXQUE]

$ @FIXQUE

This will do SHOW QUEUE commands with the output sent to files called FIXQUE_CHARS.LIST, FIXQUE_FORMS.LIST, and FIXQUE_QUEUE.LIST. It then parses the contents of these files to create a command procedure called FIXQUE_RELOAD.COM.

Now, start a new, empty queue file and we'll put this save information into it.

$ STOP/QUEUE/MANAGER

$ START/QUEUE/MANAGER/NEW MSYS_COMMON:[SYSEXE]

$ @FIXQUE_RELOAD

This should reconstruct the information in the queue file. Proceed to the ``Post Cure'' section for any nodes which are already up. Reboot those which are still down; that is all they will need.

Make a New Queue File from Scratch

If you can't recover the old queue file, you need to create a new queue file from scratch. Do the following.

$ START/QUEUE/MANAGER/NEW MSYS_COMMON:[SYSEXE]

$ @SYS$MGR_UTIL:RESTORE_PRINT_FORMS

$ SYSMAN

SYSMAN> SET ENV/CLUSTER

SYSMAN> SET TIMEOUT 0:5

SYSMAN> @CMN$STARTUP:STARTQUEUES

SYSMAN> EXIT

This will reconstruct the bare queue files, but none of the jobs in the queues. They are lost. Proceed to the ``Post Cure'' section for any nodes which are already up. Reboot those which are still down; that is all they will need.

POST CURE

Problem occurred during cluster reboot

If this queue file failure occurred during a cluster reboot (which it frequently does on a power glitch during backup jobs), then we need to go back and run all the stuff which should have run in batch queue during startup. It didn't run during the reboots since the batch queues were busted, remember?

First, create a temporary file in USR$SCRATCH:[SYSMGR]AFILE.TMP, and put the following lines in it:

$ @CMN$STARTUP:EPICURE_SYSSTARTUP

$ scsnode == f$getsyi("NODENAME")

$ SYS_SUBMIT := SUBMIT /NOPRINT /QUEUE=`scsnode'_SYSTEM -

/USER=SYSMANAGER /LOG=SYS$MANAGER: /PRIORITY=255

$ SYS_SUBMIT /NAME=BOOTJOB SYS_STARTUP:BOOTBATCH.JOB

Now, do the following. Yes, if you came here from section you are doing STARTQUEUES twice. That is on purpose, since some queues won't be defined properly till certain other queues are created.

$ SYSMAN

SYSMAN> SET ENV/CLUSTER

SYSMAN> SET TIMEOUT 0:5

SYSMAN> @CMN$STARTUP:STARTQUEUES !yes, again

SYSMAN> @USR$SCRATCH:[SYSMGR]AFILE.TMP

SYSMAN> EXIT

Your temporary command file will run the epicure startup job and the other batch-based startup jobs which didn't happen during the reboot, due to the file problem.

Problem occurred while nodes were live

If the queue file went bad during running, then probably most queue managers are still running. In that case, you were probably able to recover the queue definitions using the procedure in ``Recovering old queue file'' section . So you now have a new queue manager running on one node, and the old one still running on many other nodes, but talking to a corrupt queue file. We need to stop all those old versions, and restart them all to point to the new queue files.

Create a temporary file USR$SCRATCH:[SYSMGR]AFILE.TMP, and put the following lines in it:

$ STOP/QUEUE/MANAGER

$ @CMN$STARTUP:STARTQUEUES

Now, run this file on all nodes in the cluster.
$ SYSMAN

SYSMAN> SET ENV/CLUSTER

SYSMAN> SET TIMEOUT 0:5

SYSMAN> DO @USR$SCRATCH:[SYSMGR]AFILE.TMP

SYSMAN> EXIT

Keywords: RDCS, WARNER, DISNEY, controls, queues

Distribution:

D Baddorf

J DeVoy

B Kramper

T Watts

Security, Privacy, Legal

rwest@fsus04.fnal.gov