gloomy
May 13 2005, 02:19 AM
I decided to post it here before contacting the support.
I administer a dedicated freeBSD server (5.4-SATBLE).
At some unpredictible times i start getting 'watchdog timeout' errors in [/var/log/messags]. At that point the network becomes unreachable.
The only way to solve the problem is to hot-reboot the server using web interface for the master reboot switch.
The server has got almost no network or cpu/IO load. The network adapter manufactured by RealTek.
Browsing the forums i figured out that the main problem is NIC/IRQ related. Main problem solutions are:
a) turn ACPI support OFF
b) turn PNP OS support OFF [in bios]
c) change the NIC to some other brand like 3com or Intel.
I disabled the ACPI (hint.acpi.0.disabled="1" in [/boot/device.hints]) and that did not help.
Got no way to turning off the PNP OS support option or reseting the IRQ settings the in BIOS.
And got no ideas about the possibilities of changing the NIC.
Any ideas?
jbyers
May 14 2005, 11:49 AM
QUOTE (gloomy)
I decided to post it here before contacting the support.
I administer a dedicated freeBSD server (5.4-SATBLE).
At some unpredictible times i start getting 'watchdog timeout' errors in [/var/log/messags]. At that point the network becomes unreachable.
The only way to solve the problem is to hot-reboot the server using web interface for the master reboot switch.
The server has got almost no network or cpu/IO load. The network adapter manufactured by RealTek.
Browsing the forums i figured out that the main problem is NIC/IRQ related. Main problem solutions are:
a) turn ACPI support OFF
b) turn PNP OS support OFF [in bios]
c) change the NIC to some other brand like 3com or Intel.
I disabled the ACPI (hint.acpi.0.disabled="1" in [/boot/device.hints]) and that did not help.
Got no way to turning off the PNP OS support option or reseting the IRQ settings the in BIOS.
And got no ideas about the possibilities of changing the NIC.
Any ideas?
"The server has got almost no network or cpu/IO load. The network adapter manufactured by RealTek."
What type of server do you have? Most servers come with on-board Intel Gigabit network adapters, not Realtek ...(with the exception of SG360 Firewall Cards)
Are these "watchdog timeouts" from the NIC itself or the link/cable? What are the contents of tcpdump trace prior to the system "crash"?
gloomy
May 15 2005, 12:23 PM
Hola,
QUOTE (jbyers)
What type of server do you have? Most servers come with on-board Intel Gigabit network adapters, not Realtek ...(with the exception of SG360 Firewall Cards)
Server is dedicated, freebsd (1.3Ghz celeron, 512mb ram).
what dmesg says about the nic:
rl0:
port 0xec00-0xecff mem 0xd4400000-0xd44000ff irq 11 at device 14.0 on pci0
So it is not integrated.
QUOTE (jbyers)
Are these "watchdog timeouts" from the NIC itself or the link/cable?
From the log: May 13 03:30:05 server kernel: rl0: watchdog timeout
I guess this means that the error comes from the nic.
QUOTE (jbyers)
What are the contents of tcpdump trace prior to the system "crash"?
The point is that when the problem occurs the server doesnt crash. Only network doesnt work. Havent been checking the tcp traffic, not sure if i want to log all traffic.. will try to find a way of getting last traffic before the errors.
One more thing : the NIC and USB controller are sharing the same IRQ_11. I guess this can be the problem. [irq11: rl0 uhci0+]
eztiger
May 17 2005, 02:53 AM
hey,
I had the same thing yesterday (got my box friday) and in the course of trouble shooting I had to reboot using the master reboot switch occasionally (because the watchdog timeout would drop the net interface).
Eventually things got into such a state the box wouldn't come back at all.
I submitted a trouble ticket asking for a reboot / for them to check on the box and also asked if they could disable acpi in the bios for me whilst they were there.
The support team, to my surprise, responded in the trouble ticket that they had indeed changed the bios settings for me.
So, in summary I was pleasantly surprised they'd go out their way to do that and I'd suggest you give it a go. Its been about 24 hours now and (touch wood) no more problems.
The support here has so far been nothing short of fantastic.
Kev
eztiger
May 17 2005, 02:55 AM
sorry slight addition I noticed you also said you have a usb port and the net controller sharing an irq.
I custom recompiled my kernel when I upgraded from freebsd 5.3 --> 5.4 and disabled all the usb settings so I don't have that problem.
I'd suggest you either recompile the kernel or ask support to also fiddle the irq bios settings. Again it seems a bit above their permit, but so far they've been a helpful accomodating bunch
Kev
gloomy
May 17 2005, 01:15 PM
QUOTE (eztiger)
sorry slight addition I noticed you also said you have a usb port and the net controller sharing an irq.
I custom recompiled my kernel when I upgraded from freebsd 5.3 --> 5.4 and disabled all the usb settings so I don't have that problem.
I'd suggest you either recompile the kernel or ask support to also fiddle the irq bios settings. Again it seems a bit above their permit, but so far they've been a helpful accomodating bunch
Kev
Hola.
I said in the 1st post that i already had 5.4 (-; bus thats ok. I will submit a trouble ticket, and will ask to disable USB controller in the bios and reset IRQ pool.
*crosses fingers*
p.s. Realtek NICs suck in general. And going together with cheap motherboards can cause some trouble. Beware.
p.s.2. Yes, paying little money for dedicated servers we can only pray.
kreely
May 23 2005, 11:18 PM
It's the NIC card.
May 23 23:31:01 comic su: tbelding to root on /dev/ttyp1
May 23 23:34:22 comic kernel: rl0: watchdog timeout
May 23 23:35:02 comic kernel: rl0: watchdog timeout
May 23 23:38:02 comic syslogd: kernel boot file is /boot/kernel/kernel
May 23 23:38:02 comic kernel: rl0: port
0xec00-0xecff mem 0xd3400000-0xd34000ff irq 19 at device 14.0 on pci0
Specific problem - under high load conditions (well, relative to normal
web loads), the cards issue a 'watchdog timeout', and cease to respond
to outside events.
Setting up an automatic script to monitor the network cards, then stop
the card, and restart it (ifconfig rl0 down, ifconfig rl0 up) doesn't
fix the problem - only a hard reboot fixes the problem.
High load conditions, in this case, are 5 megabits and higher. Opening
a FTP job to a redhat box also stored at EV1 servers hit 6.99 megabit,
and completely locked up - after only transferring 9 megabytes.
A continual 'wget' at 200k, however, functions with no problems.
kreely
gloomy
Jun 1 2005, 02:56 AM
Disabling ACPI did not help.
Dedicating IRQ for the NIC did not help.
Changing the hardware (nic/mb) did help, havent seen the error for a week now. So 'watchdog timeouts' is mostly hardware related problem.
How nice when you dont have to reboot serever every day (-;
p.s. Thanks to the support team.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please
click here.