Power Management on Bigbro and REMOTE ***** Intro - Bigbro, Remote and the switch are both powered via a American Power Corporation Back_UPS 650 UPS. Bigbro communicates via a serial cable with the UPS to determine when there is a power failure. Remote and Bigbro communicate via TCP/IP. When power fails both machines switch off. ***** Cable Between UPS and Bigbro The cable connecting Bigbro and the UPS is not a standard 9-pin serial connector. In the APCUPSD manual it's referred to as a "SIMPLE" cable. Computer - 9DB Female - DTR - PIN 4 - 4.7K Ohm connected to PIN 8 on Female and PIN 5 on Male (UPS). CTS - PIN 8 - Low Battery GND - PIN 5 - Ground DCD - PIN 1 - On Battery RTS - PIN 7 - Kill UPS Power UPS - 9DB Male - CTS - PIN 5 - Low Battery GND - PIN 4 - Ground DCD - PIN 2 - On Battery RTS - Pin 1 - Kill UPS Power. The only unusual part is that Pin 4 and Pin 8 of the Female are hooked together via a 4.7KOhm. Look in apcupsd.pdf for details. ******* Configuring Bigbro Main Config in /etc/apcupsd/apcupsd.conf Edit it before firing up the daemon. UPSCABLE simple - We've got a "simple" type cable as we have a basic UPS UPSTYPE backup - A fairly dumb UPS that just reports when it's using batteries and when those batteries are close to flat. ANNOY 60 - Pester users every 60 seconds when running on battery pages ANNOYDELAY 30 - Wait 30 seconds after batteries switch on to wait for TIMEOUT 90 - Turn the computer off 90 seconds after we switch to battery power. Ignore MINUTES and BATTERYLEVEL as neither works for BASIC_UPS. NOLOGON disable - Don't stop users logging in even though we're on emergency power. NETSERVER OFF - Don't want to open up a port we don't need. EVENTSFILE /var/log/apcupsd.events - This is where we'll learn about apcupsd's day. STATTIME 5 - write to the status file every 5 seconds. STATFILE /var/log/apcupsd.status - where status is reported. Rest of the config is for networking. UPSCLASS netmaster - would be standalone if we weren't controlling remote. UPSMODE net - disable if standing alone. NETTIME 10 - Report to remote.biophys.cornell.edu every 10 seconds. NETPORT 6663 - We're using port 6663 to talk to remote. SLAVE remote.biophys.cornell.edu Checked bigbro's firewall to make sure the two machines can communicate on that port. Finally, add remote.biophys.cornell.edu to /etc/hosts to ensure name resolution doesn't matter. Initializing and Stopping Daemon The script for starting, stopping and checking status of /sbin/apcuupsd is in /etc/rc.d/apcupsd Included it in /etc/rc.d/rc5.d and rc3.d via SuSE Control Centre. APCUPSD comes up after the network so it can talk to remote.biophys.cornell.edu Status File Every 5 seconds /var/log/apcupsd.status is updated with the status of the UPS. Event Log File This file lists the last 50 great things to happen to apcupsd. /var/log/apcupsd.events Scripts Controlling Shutdown When a shutdown is called for, apcupsd using the script in /etc/apcupsd/apccontrol There's a special "safe" version apccontrol.safe that won't actually halt the PC. It's great for testing if your config works. You just copy apccontrol to apccontrol.real and rename the safe version. Now you can pull the plug on the UPS and have a fake "shutdown" to show the system works well. I've editted Bigbro's apccontrol script to insert a 30 second delay on the halt just to give remote a fighting chance of shutting down. The command is, "sleep 30" and is just after the line "doshutdown)". See page 60 of apcupsd.pdf for details. ******** Configuring Remote /etc/apcupsd/apccontrol UPSCABLE ether UPSTYPE backups ANNOY 20 - wait 20 seconds between messages telling users to hop off. ANNOYDELAY 5 - wait 5 seconds on batteries before warning everyone that the power's failed. TIMEOUT 40 - Turn remote off after 40 seconds on battery power. NETSERVER off - Don't want port 7000 on, but I turned it on while configuring so /sbin/apcaccess could check the status of remote. STATTIME, STATFILE /var/log/apcupsd.status - doesn't seem to work reporting the status. Don't know why. Networking section vital UPSCLASS netslave UPSMODE net NETACCESS true - don't quite know what this does NETTIME 10 - hear from bigbro every 10 seconds. Don't think this matters for the slave. NETPORT 6663 - TCP port to listen to bigbro on. MASTER bigbro.biophys.cornell.edu USERMAGIC remote_magic - This is passed between bigbro and remote so they can know each other as they communicate. Any name will do. Couldn't get the status file, /var/log/apcupsdd.status to work so used NETSERVER on and the program /sbin/apcaccess to check. ***************** Restarting Bigbro 1. Check the BackUPS 650 is turned on. 2. Check the NETWORK Switch is on. 3. Press the POWER button the front of bigbro. 4. Using the KVM switch to channel 2 to see Bigbro's boot script. It should take a couple of minutes to get to a login screen. 5. Once bigbro's rebooted, check you can read your email, and try to send yourself a test email from outside our subnet (eg. bounce it via your cornell address @cornell.edu back to bigbro). 6. Check to see if the web page has come up. Potential Problems. 1. If there is no access outside our subnet (Firewall broken) then Bigbro may take quite a while to boot. 2. If bigbro shut down unexpectedly, it may have errors on the hard drives. It will scan and attempt to solve any problems itself, but if there are too many problems it will choke and hand control over to you. Here's the official word, "Most of the time, any file system problems are minor ones caused by file buffers not being written to the disk, such as deleted inodes still marked in use. In the majority of cases, the file system check will be able to detect and repair such anomolies automatically, and upon completion the Linux boot process will continue normally. Should a file system problem be more severe (such problems tend to be caused by faulty hardware such as a bad hard drive or memory chip; something to keep in mind should file system corruption happen frequently), the file system check may not be able to repair the problem automatically. This is usually, but not always, the case when the root file system itself is corrupted. In this case, the Red Hat boot process will display an error message and drop you into a shell, allowing you to attempt file system repairs manually. As the recovery shell unmounts all file systems, and then mounts the root file system "read-only", you will be able to perform full file system checks using the appropriate utilities. Likely you will be able to run e2fsck on the corrupted file system(s) which should hopefully resolve all the problems found." ***************** Restarting Remote 1. Check the BackUPS 650 is turned on. 2. Check the NETWORK switch is on. 3. Press the POWER button the front of remote. 4. Using the KVM switch to channel 4 to see Remote's boot script. It should take a couple of minutes to get to a login screen. 5. Check you can log in and that's that. Potential Problems. 1. If there is no access outside our subnet (Firewall broken) then Bigbro may take quite a while to boot. 2. If bigbro shut down unexpectedly, it may have errors on the hard drives. It will scan and attempt to solve any problems itself, but if there are too many problems it will choke and hand control over to you. Here's the official word, "Most of the time, any file system problems are minor ones caused by file buffers not being written to the disk, such as deleted inodes still marked in use. In the majority of cases, the file system check will be able to detect and repair such anomolies automatically, and upon completion the Linux boot process will continue normally. Should a file system problem be more severe (such problems tend to be caused by faulty hardware such as a bad hard drive or memory chip; something to keep in mind should file system corruption happen frequently), the file system check may not be able to repair the problem automatically. This is usually, but not always, the case when the root file system itself is corrupted. In this case, the Red Hat boot process will display an error message and drop you into a shell, allowing you to attempt file system repairs manually. As the recovery shell unmounts all file systems, and then mounts the root file system "read-only", you will be able to perform full file system checks using the appropriate utilities. Likely you will be able to run e2fsck on the corrupted file system(s) which should hopefully resolve all the problems found." ******* Implimentation Log Installed version apcupsd-3.8.5 from SUSE 8.1 CDs. Looked in /etc/rc.d/halt to see if there. Actually, APC daemon stuff seems to have gone into /etc/rc.d/halt.local. The status file is supposed to be going out of fashion. Instead, you're supposed to check the status of APCUPSD via a TCP/IP port. Here's how it's supposed to work. In /etc/apcupsd/apcupsd.conf you set NETSTATUS on STATUSPORT 7000 This fires up a daemon, /sbin/apcnisd which listens on STATUSPORT. You can talk to it with the executable /sbin/apcaccess. This works. However, the way you want to do it is via /etc/inetd.conf. I tried this and couldn't get it to work. Tried to use the NETSTATUS system via inetd. Turned off NETSTATUS, added a line to /etc/services for apcnisd on port 7000. Added line /etc/inetd.conf and reloaded. Also editted hosts.allow. See page page 68 of apcupsd.pdf Gave strange and consistent error messages. There's a program called apctest that lets you try out your cable (see page 64 of apcupsd.pdf). Our cable works. ***** References APCUPSD manual - apcupsd.pdf