==============
Keywords: software autoreboot, autorebooting, auto-reboot, auto-rebooting, auto rebooting
Watchdog is a program that you can use to reboot your server automatically in a lot of cases.
It has been used succesfully to reboot servers in the "Unexplained Crash" problem, that can have as causes a disk queue starvation problem, or a quota/ext3 filesystem deadlock, crashing the server many times randomly. If downtime due crashes in your system is a problem, probably you must use watchdog to assure you peacefully tranquility back again.
This works in any distribution: Ensim, Plesk, CPanel, etc., in any Linux system.
As documentation in /usr/src/[your-linux-kernel]/Documentation/watchdog.txt, kernel provides watchdog timer interfaces in a device named /dev/watchdog, "which when open must be written to within a timeout or the machine will reboot. Each write delays the reboot time another timeout. In the case of the software watchdog the ability to reboot will depend on the state of the machines and interrupts. The hardware boards physically pull the machine down off their own onboard timers and will reboot from almost anything.". The timeout default is 60 seconds.
The watchdog program simply uses the /dev/watchdog device, activating the softdog module on your system, if you have support in your kernel, and writes in /dev/watchdog within 10 seconds, making several other (configurable) checks in your system. If your system crashes, or watchdog stop to working, or in any case watchdog be supposed not to write in that device in 60 seconds, but kernel remains live, it will reboot within 60 seconds.
I have acknowledgement the following RedHat kernel already comes with support to softdog module:
2.4.18-27.7.x
2.4.20-19.7
2.4.20-24.7
2.4.20-27.7
2.4.20-28.7
2.4.21-27.ELsmp (RHEL3)
The major distros already comes with softdog module support. If you donīt use any of above kernels, try to check if your version/distro come with softdog module suport, with the command "modprobe softdog", and check with "lsmod|grep softdog". If so, quickly execute "rmmod softdog", to your server not reboot automaticly. If not supported, you must compile a kernel with support for watchdog, setting these parameters:
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=m
Refer to "Kernel compile HowTo" to compile a new kernel for your system.
Installation
============
In general steps, to install watchdog itīs suffice download, install, and change a few parameters in /etc/watchdog.conf. Itīs very simple. But in *NO* way experiment with watchdog !!! You can have a bad experience, and need to restore your server. Only do what you know what you are doing! Be advised. Iīm a experienced network administrator (20 years IT, 11 years with hosting), and although my experience, this costed me 2 (two) restores with EV1 to learn.
Always check your backups *before* install watchdog.
Download:
=========
If you are using Ensim, download from:http://rpm.pbone.net/
# wget ftp://ftp.pbone.net/mirror/dag.wieers.com...g.rh73.i386.rpm (Several other different versions in dag repository)
Run your rpm:
# rpm -ivh watchdog-5.2-5.dag.rh73.i386.rpm
FIRST IMPORTANT THING TO DO: Disable auto-start of watchdog (explained below the reason):
(This is for RedHat like distros. Check how to do it for another distros)
# chkconfig watchdog off
Configuration:
==============
Softdog is auto-loaded by watchdog, so you donīt need make nothing.
You need at least to change the /etc/watchdog.conf, in the following lines, uncomenting its:
Uncomment:
CODE
=================================
#file = /var/log/messages
#watchdog-device = /dev/watchdog
=================================
#file = /var/log/messages
#watchdog-device = /dev/watchdog
=================================
Turning in:
CODE
=================================
file = /var/log/messages
watchdog-device = /dev/watchdog
=================================
file = /var/log/messages
watchdog-device = /dev/watchdog
=================================
You can adjust any other configurations at your taste. Check too the file '/etc/sysconfig/watchdog' (for RedHat-like distros) to startup / command line configurations of watchdog. (for example, mine is: OPTIONS="-v -b" to verbose log, and soft reboot)
Create the watchdog device:
# mknod /dev/watchdog c 10 130
Check if it exists really:
# ls -alF /dev/watchdog
If ok, execute the following:
# service watchdog start
You already have watchdog working.
Check in your /var/log/messages if there are some lines like the following:
CODE
Jan 13 15:06:13 ensim kernel: Software Watchdog Timer: 0.05,
timer margin: 60 sec
Jan 13 15:06:13 ensim kernel: pcwd: v1.13 (03/06/2002) Ken Hollis
(kenji@bitgate.com)
Jan 13 15:06:13 ensim kernel: pcwd: No card detected, or port not
available
Jan 13 15:06:13 ensim kernel: WDT driver for Acquire single board
computer initialising.
Jan 13 15:06:13 ensim watchdog: watchdog startup succeeded
Jan 13 15:06:13 ensim watchdog[3130]: starting daemon (5.1): (...
long line with options...)
timer margin: 60 sec
Jan 13 15:06:13 ensim kernel: pcwd: v1.13 (03/06/2002) Ken Hollis
(kenji@bitgate.com)
Jan 13 15:06:13 ensim kernel: pcwd: No card detected, or port not
available
Jan 13 15:06:13 ensim kernel: WDT driver for Acquire single board
computer initialising.
Jan 13 15:06:13 ensim watchdog: watchdog startup succeeded
Jan 13 15:06:13 ensim watchdog[3130]: starting daemon (5.1): (...
long line with options...)
If so, itīs all right.
After, test watchdog, rebooting your server:
# service watchdog stop
(NOTICE 1: This is not a truely shutdown/reboot procedure! The kernel will make a hard reboot here. So, analyse the consequences, if you do not have any program writting in your disk. Close all processes first, if you have worries about, shutdowning all daemons before. Is kernel rebooting your machine, not watchdog daemon program.)
In 60 seconds your system will reboot. If not, try to check if the module are loaded, with the "lsmod" command. In more modern systems like some newer version of RedHat-like distro, the "service" command to "stop" or "restart" does nothing. This is much more secure to work with watchdog. If so, try to reboot manually your server. Your system should restart in normal time (nearly two minutes. Pray!)
If your system itīs ok, restart watchdog again (service watchdog restart), and you could include a line in the end of file /etc/rc.d/rc.local:
# echo "/sbin/service watchdog restart" >> /etc/rc.d/rc.local
If you to want, test again watchdog:
# service watchdog stop
If reboot ok, you are already protected.
If not reboot, ask for EV1 reboot in single user mode, or a different kernel, and undo yourself the changes. The main is to remove the device /dev/watchdog, with "rm -f /dev/watchdog" command. When the computer starts, and see no /dev/watchdog, the softdog do nothing, and your server stops to rebooting in next boot.
!!! CAUTION !!! CAUTION !!! CAUTION !!!
1) Never, never, never use chkconfig to make watchdog auto-restart in next boot. Redhat kill processes when changing runlevels, and when kill watchdog your system will eternally rebooting, needing a restore from EV1. Donīt experiment with watchdog in a real production machine.
2) Following rigorously the steps above worked for me, and I think can work for your. But I cannot warranty any thing to you, so you are the ultimate responsible to following them.
!!! CAUTION !!! CAUTION !!! CAUTION !!!
That steps above are secure. But before you install new kernels NEVER forget to drop the line in your /etc/rc.d/rc.local file (comment it). Test watchdog again in the new system, like showed before, but without start automatically in the next boot, commenting the start line of watchdog in rc.local. If any problem, simply ask a reboot to EV1, and all itīs ok again, allowing you know what fails, if kernel not support watchdog, if installation problem, etc.
Some succesful and unsuccesful (me) installations of watchdog in:
<http://forum.ev1servers.net/showthread.php...?threadid=33908>
I donīt understand why EV1/ThePlanet not offers default servers with watchdog/softdog installed default. Would save enourmous time of their staff.
Enjoy it.
Regards,
Netino