However, we had a bunch of IO related to the interlock subsystem that we were hard pressed to find room for in the new mount computer. So, what we did was to leave the 26 volt IO (as well as a 16 bit ADC converter - the acromag VME board) in the old mount chassis, which at this point in time was renamed the "interlock" computer.
In a properly operating system, things work like this:
It is vital to note that interpretting event log information during a communication failure is highly uncertain. Information may or may not be getting to the event log from either the mount or interlock crates. Either crate may be performing actions for which the usual messages are not making their way to the logs.
It is also important to note that there is uncertainty about the timestamps in the event log. We have 3 computers participating with unsynchronized clocks. We do not know what network latencies may exist between a log message being sent and it being entered in the log. In any event, the times logged are the time in the computer running the log server (typically hacksaw) at the time the message is received and entered into the log, not the time when the event occurred.
The mount crate has an 0.5 second timeout running while waiting for data packets. If this timeout expires, it indicates that 5 data messages have not been received. If this happens, it will emit the message: "Interlock comm failure". It would like to kill the drives, but does not expect to be able to if communication has failed. When and if it does again receive a valid data packet, it will kill the drives (this cleans up a bunch of state in the mount code). It does not send heartbeat signals if it is not receiving data from the interlock crate. Only the interlock crate has the ability to kill the drives in the event of a communication failure. It will do so if it does not receive heartbeat signals, but additionally the watchdog will expire and kill the drives.
The interlock crate also has a 0.5 second timeout running while waiting for heartbeat packets. Once 10 of these timeouts have occurred (after 5 seconds), it announces: "lost heartbeat, killing drives". By this time though, the watchdog has already (3 seconds ago) killed the drives.
In practice, what we see is that the watchdog kills the drive in about 2 seconds. The interlock notices that the drive enabling signal has been lost and confirms this killing of drives (keeping software state consistent as it does for all ready chain situations). If communication is ever restored, the mount will also command a kill of the drives as the last participant in this beating of a dead horse.
Thu May 31, 2012 03:25:54.793 MOUNT (128.196.100.228) Interlock comm failure Thu May 31, 2012 03:25:54.6210 INTERLOCK (128.196.100.26) Chain kills alt drives outbits: 111100000000000000001010000000000000000000000000 inbits: 111111111111111000010001011111101111111111000110100110110000111100011011000000000001100001111000 [DETAILS] Thu May 31, 2012 03:25:57.5127 INTERLOCK (128.196.100.26) lost heartbeat, killing drives Thu May 31, 2012 03:26:08.775 MOUNT (128.196.100.228) Killed ALT drive due to: DAC out of range Thu May 31, 2012 03:26:24.753 MOUNT (128.196.100.228) Rotator drives shutting down NE head: 70.18997 Thu May 31, 2012 03:26:24.1054 MOUNT (128.196.100.228) F check (Excess PERR) killing DEROT axisHere we have event log messages from both the mount and interlock computers. The watchdog kills the drives at about the same time that the mount realizes that it is not getting data from the interlock computer.
The "DAC out of range" message (14 seconds later) is due to servo windup after the amplifier has been switched off and the drives killed. The mount computer has no idea that the servo amplifier is turned off and is still trying to control the telescope.
It is curious that the Rotator drives do not get killed by the watchdog.
I intend to make a code change where the mount will do a drive kill when it first detects a "comm failure". This will avoid the servo windup when the watchdog turns off the amplifiers.
The operators log indicates that the GUI's "went white" and were not updating. Skip says that he could not ping the mount, and only the mount was rebooted.
It is worthy of note that the mount was able to make event log entries during the start of the problem, and again some 32 seconds later, but eventually was entirely unresponsive.
Fri Jun 08, 2012 20:37:10.4950 INTERLOCK (128.196.100.26) Chain kills alt drives outbits: 111100000000000000001010000000000000000000000000 inbits: 111111111111111000010001011111101111111111000110011010110000111100111111000000000001101001111000 [DETAILS] Fri Jun 08, 2012 20:37:13.3995 INTERLOCK (128.196.100.26) lost heartbeat, killing drives .... .... .... Fri Jun 08, 2012 20:53:37.3297 INTERLOCK (128.196.100.26) End of heartbeat outage after 1965 missing (interlock)Here we have just the opposite situation - no messages are making it to the event log from the mount crate and we have only the interlock crates side of the story. It was necessary to reboot the mount crate to recover from whatever went wrong (which included the GUI interfaces whiting out). We are pretty sure it was not responding to pings prior to the reboot, but aren't sure. The interlock crate apparently recovered by itself after the wait of 1965 timeouts. (16.4 minutes ??).