MMT Telescope Heartbeat

Introduction

Once upon a time there was just one computer that controlled the motions of the MMT. We called this the "mount crate". It was a VME machine with a motorola MVME167 cpu. At some point in time, we wanted to improve the performance of the elevation servo and required something with more horsepower than a 33 Mhz CISC processor with 16 megabytes of RAM. We introduced a new mount "crate" based on a Pentium processor running at over 2 Ghz and with several gigabytes of RAM. This machine also had a PCI network card capable of 100 Mhz communication. All significant upgrades from the previous mount computer.

However, we had a bunch of IO related to the interlock subsystem that we were hard pressed to find room for in the new mount computer. So, what we did was to leave the 26 volt IO (as well as a 16 bit ADC converter - the acromag VME board) in the old mount chassis, which at this point in time was renamed the "interlock" computer.

UDP communication

The interlock computer sends a package of information at 10 Hz to the mount computer via UDP. This is a fairly compact bundle of information which includes all of the 26 volt "bits" as well as the ADC values. This allows the mount computer to keep a local "mirror" of the values in the interlock crate (effectively a read only distributed shared memory scheme). Changes to the interlock settings are commanded via TCP commands sent from the mount computer to the interlock computer. In general, only the mount computer communicates with the interlock computer. This is not strictly true - the interlock computer does boot from hacksaw and makes event log entries on some log server (currently hacksaw), but the intent is that all application software is unaware of the partitioning of the two computers and still talks only to the mount computer, just as was the case before we partitioned control between two computers. Nothing prevents other computers or processes from talking directly to the interlock computer, but this is strongly discouraged and not supported.

The heartbeat

The system as described thus far would be adequate to run the telescope if nothing ever went wrong. We were concerned however about possible loss of network communication, or failure of the mount computer itself. The original (pre-partition) design included a hardware watchdog to guard against hung software. The watchdog is a device which needs to be continually reset, and if not reset will time out (after approximately 2 seconds) and open the circuit providing power to the relays that enable the telescope drives. In the original system, the reset pulse was sent from the thread that was running the telescope servos (at 100 Hz). In the new system, only the interlock computer can directly access the watchdog, and must reset it in response to some kind of regular "thumbs up" signal from the mount crate. We use a UDP packet with no particular data content that we call the heartbeat packet. The mount computer sends this heartbeat packet in response to every data update message it receives from the interlock crate.

In a properly operating system, things work like this:

The interlock crate sends data to the mount crate at 10 Hz.
Each time the mount crate receives data, it responds with a heartbeat.
The interlock crate receives heartbeat packets and pulses the watchdog.

When things go wrong

It is vital to note that interpretting event log information during a communication failure is highly uncertain. Information may or may not be getting to the event log from either the mount or interlock crates. Either crate may be performing actions for which the usual messages are not making their way to the logs.

It is also important to note that there is uncertainty about the timestamps in the event log. We have 3 computers participating with unsynchronized clocks. We do not know what network latencies may exist between a log message being sent and it being entered in the log. In any event, the times logged are the time in the computer running the log server (typically hacksaw) at the time the message is received and entered into the log, not the time when the event occurred.

The mount crate has an 0.5 second timeout running while waiting for data packets. If this timeout expires, it indicates that 5 data messages have not been received. If this happens, it will emit the message: "Interlock comm failure". It would like to kill the drives, but does not expect to be able to if communication has failed. When and if it does again receive a valid data packet, it will kill the drives (this cleans up a bunch of state in the mount code). It does not send heartbeat signals if it is not receiving data from the interlock crate. Only the interlock crate has the ability to kill the drives in the event of a communication failure. It will do so if it does not receive heartbeat signals, but additionally the watchdog will expire and kill the drives.

The interlock crate also has a 0.5 second timeout running while waiting for heartbeat packets. Once 10 of these timeouts have occurred (after 5 seconds), it announces: "lost heartbeat, killing drives". By this time though, the watchdog has already (3 seconds ago) killed the drives.

In practice, what we see is that the watchdog kills the drive in about 2 seconds. The interlock notices that the drive enabling signal has been lost and confirms this killing of drives (keeping software state consistent as it does for all ready chain situations). If communication is ever restored, the mount will also command a kill of the drives as the last participant in this beating of a dead horse.

Heartbeat Loss 5-31-2012 (night of 5-30-2012)

Thu May 31, 2012 03:25:54.793
MOUNT (128.196.100.228)
Interlock comm failure

Thu May 31, 2012 03:25:54.6210
INTERLOCK (128.196.100.26)
Chain kills alt drives
outbits: 111100000000000000001010000000000000000000000000
inbits: 111111111111111000010001011111101111111111000110100110110000111100011011000000000001100001111000
[DETAILS]
 
Thu May 31, 2012 03:25:57.5127
INTERLOCK (128.196.100.26)
lost heartbeat, killing drives
 
Thu May 31, 2012 03:26:08.775
MOUNT (128.196.100.228)
Killed ALT drive due to: DAC out of range

Thu May 31, 2012 03:26:24.753
MOUNT (128.196.100.228)
Rotator drives shutting down
NE head: 70.18997
 
Thu May 31, 2012 03:26:24.1054
MOUNT (128.196.100.228)
F check (Excess PERR) killing DEROT axis

Here we have event log messages from both the mount and interlock computers. The watchdog kills the drives at about the same time that the mount realizes that it is not getting data from the interlock computer.

The "DAC out of range" message (14 seconds later) is due to servo windup after the amplifier has been switched off and the drives killed. The mount computer has no idea that the servo amplifier is turned off and is still trying to control the telescope.

It is curious that the Rotator drives do not get killed by the watchdog.

I intend to make a code change where the mount will do a drive kill when it first detects a "comm failure". This will avoid the servo windup when the watchdog turns off the amplifiers.

The operators log indicates that the GUI's "went white" and were not updating. Skip says that he could not ping the mount, and only the mount was rebooted.

It is worthy of note that the mount was able to make event log entries during the start of the problem, and again some 32 seconds later, but eventually was entirely unresponsive.

Heartbeat Loss 6-8-2012 (night of 6-8-2012)

The first thing we see in the event log is that the interlock crate notices that the watchdog has killed the drives.

Fri Jun 08, 2012 20:37:10.4950
INTERLOCK (128.196.100.26)
Chain kills alt drives
outbits: 111100000000000000001010000000000000000000000000
inbits: 111111111111111000010001011111101111111111000110011010110000111100111111000000000001101001111000
[DETAILS]
 
Fri Jun 08, 2012 20:37:13.3995
INTERLOCK (128.196.100.26)
lost heartbeat, killing drives

....
....
....

Fri Jun 08, 2012 20:53:37.3297
INTERLOCK (128.196.100.26)
End of heartbeat outage after 1965 missing (interlock)

Here we have just the opposite situation - no messages are making it to the event log from the mount crate and we have only the interlock crates side of the story. It was necessary to reboot the mount crate to recover from whatever went wrong (which included the GUI interfaces whiting out). We are pretty sure it was not responding to pings prior to the reboot, but aren't sure. The interlock crate apparently recovered by itself after the wait of 1965 timeouts. (16.4 minutes ??).