Nethead
Oct 2 2003, 09:09 PM
I have a 1.7GHz Celeron box running cPanel. Multiple times in the past month now this box has died without warning. Suddenly the box becomes unresponsive to all remote connections as if all services have died. The box however still is pingable.
Observations:
1. The server load is not high (< 1.0) leading up to the crash
2. All log files just abruptly end at the time the server becomes unresponsive with no preceding errors or signs of impending doom.
3. Logging open server sockets, running processes, and the like right before the crash shows nothing odd.
4. A reboot is required to get back online.
5. Since the start of these problems I have messed with DMA settings fearing that this was an issue. The server still eventually crashed.
6. A firewall is in place and thorough security audits on the box suggest no compromises.
I have seen similar threads on the forums here, but no consensus on the root cause of the problem. EV1Servers has swapped out the hard drive, tested the hard drive, and tested the ram, but no problems were found.
These failures occur at random intervals sometimes days and sometimes weeks apart. The server is fairly busy but logs I keep of server load don't show anything unusual.
EV1Servers refuses to do a motherboard swap on this box, instead insisting that I would need to move to a whole new box with new IPs, if anything. This is not a technically reasonable nor desirable solution.
So, if you have any experience with a problem like this and any clue on how I might go about tracking down the problem myself, please chime in.
Thanks!
Doobla
Oct 2 2003, 09:39 PM
While I am sorry for your fortune, I am glad to see that somebody else is having the same problem as I am and that it's not an Ensim thing.
I am having the same problem except that my crashes seem to be more regular than yours happening usually once ever 6-10 days. I convinced EV1 to replace my server in hopes that the problem was actually a hardware issue, but it was not as within a week I was requesting a reboot again. Since my clients are frustrated to no end I chose a band-aid approach for now and set cron to reboot the system automatically for me Sunday and Wednesday morning at 2am.
I have tried upgrading kernels, compiling the kernel from source with a minimal driver set (although I am not an expert int hat area by any means) and a host of other software "solutions" to no avail. I believe that it comes down to a combination of hardware in the server and how the hardware relates to each other, not that I think any hardware is bad per se. This is speculation at this point but I am frustrated beyond belief with this and don't know where to turn.
Here's hoping that somebody will have a suggestion or two....
Jon
Erwin
Oct 2 2003, 09:47 PM
Have you tried looking for scripts that may be faulty, that causes processes to go into a loop and not die? Whenever I have issues like ths, it's usually from a script (in my case PHP) with badly written code - once this is fixed, the server is back to normal.
Check your error logs and see if that helps.
Nethead
Oct 2 2003, 09:55 PM
In the case of my issue there are no errors recorded. All logs just abruptly stop.
I have a script running via cron to record server state every 5 minutes. Today it took this snapshot no more than 1 minute before a crash. Examining it, all looked normal with no high loads, no excessive open connections, and no unusual or high number of processes active.
I'm stumped beyond stumped. I'm running a CPU burn in test now to see if that kills it. So far its hanging in there.
Wouldn't runaway scripts show up in processor overloads?
Doobla
Oct 2 2003, 10:05 PM
QUOTE
Originally posted by Nethead
In the case of my issue there are no errors recorded. All logs just abruptly stop.
I have a script running via cron to record server state every 5 minutes. Today it took this snapshot no more than 1 minute before a crash. Examining it, all looked normal with no high loads, no excessive open connections, and no unusual or high number of processes active.
I'm stumped beyond stumped. I'm running a CPU burn in test now to see if that kills it. So far its hanging in there.
Same scenario as me and when I tried the cpu burn it hung in there for 2 days and at that point I figured that it wasn't the cpu. I also tested the RAM and that wasn't it.
When EV1 replaced my server they replaced with like hardware and so if it was a hardware conflict then like hardware would cause the same thing to happen which it did.
But just as Nethead said, logs just stop along with just about everything else. SIM stops working without notice. Only way to get the system back is to put in a trouble ticket.
Nethead, what other hardware do you have in your system? It should be reported when you boot up.
What are the other possibilities for this kind of system crash?
Nethead
Oct 2 2003, 10:14 PM
QUOTE
When EV1 replaced my server they replaced with like hardware and so if it was a hardware conflict then like hardware would cause the same thing to happen which it did.
What are you running? On this end its a Celeron 1.7 GHz, 60 GB Drive, 1 GB RAM, CPanel 7
Kernel wise I have Linux version 2.4.20-18.7 (bhcompile@stripples.devel.redhat.com)
beams
Oct 2 2003, 10:31 PM
Well just for the record, I experienced the very same thing for the first time last night.
5:37am AU time machine stops responding and all logs stop. I had to reboot the machine to bring it up and cant find any sign a reason.
Im on a celeron1.3 ensim machine, latest patches, afp firewall, chkrootkit etc.
No sign of intrusion, no unusual httpd activity, no errors in any logs.
Id sure be interested in any hints you my have.
BTW Nethead - what script are you using to 'snapshot' the server status?
Tom
Nethead
Oct 2 2003, 10:37 PM
QUOTE
Originally posted by beams
BTW Nethead - what script are you using to 'snapshot' the server status?
The following was posted somewhere else here in the forums by an RS staffer:
CODE
#!/bin/bash
#This is a very simple system monitering script.
#Written by RS-Nate 05/09/03
echo "0000000000000000000000000000000" >> /var/log/system-snapshot.log
date >> /var/log/system-snapshot.log
uptime >> /var/log/system-snapshot.log
cat /proc/meminfo >> /var/log/system-snapshot.log
ps fuaxww >> /var/log/system-snapshot.log
netstat -na >> /var/log/system-snapshot.log
I just have it scheduled to run every 5 minutes through a cron job. Though it ran for me today less than a minute before the box went down today, it didn't show anything of note.
Doobla
Oct 2 2003, 11:08 PM
I have a celeron 1.3 with 512MB ram and 60 GB drive. The basic server available.
This started happening for me a few months ago after installing Ensim Pro and upgrading to Redhat 7.3 from 7.2 and likewise upgrading the kernel. There was about a month in between the installation of Ensim Pro (and Redhat 7.3) and the first crash which I had upgraded the kernel again in between that time. Since that kernel upgrade (or that time period) I have had this problem.
Upgrading kernels has no affect. I am currently at 2.4.20-19.7 which was part of an agreement with the EV1 techs that if I put my kernel back at the "stock" kernel in their image and the server still had problems then they would replace the box. I have just not upgraded since getting my box replaced.
Here's some info from the logs that may or may nto be relevant:
CODE
Linux version 2.4.20-19.7 (bhcompile@porky.devel.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-113))
CPU: Intel(R) Celeron(TM) CPU 1300MHz stepping 01
PCI: Using IRQ router VIA [1106/0686] at 00:07.0
VFS: Disk quotas vdquot_6.5.1
Uniform Multi-Platform E-IDE driver Revision: 7.00beta3-.2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 00:07.1
VP_IDE: chipset revision 6
VP_IDE: not 100%% native mode: will probe irqs later
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:pio, hdd:pio
hda: ST360021A, ATA DISK drive
hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=7297/255/63, UDMA(100)
PCI: Found IRQ 12 for device 00:07.3
PCI: Sharing IRQ 12 with 00:07.2
usb-uhci.c: USB UHCI at I/O 0xe800, IRQ 12
usb-uhci.c: Detected 2 ports
8139too Fast Ethernet driver 0.9.26
PCI: Found IRQ 11 for device 00:11.0
eth0: RealTek RTL8139 Fast Ethernet at 0xe014c000, 00:01:80:23:88:51, IRQ 11
eth0: Setting 100mbps full-duplex based on auto-negotiated partner ability 45e1.
Ultimately I don't know what is relevant so I am just posting possibly relevant info. I have apf installed. Mailscanner is installed but disabled because it increased my load. Ensim Pro 3.5.19.
Jon
By the way, can somebody PM me how to check to make sure that my hard drive is running at UDMA 100, etc for performance?
Nethead
Oct 3 2003, 09:37 AM
Maybe its just coincidence, but if I follow correctly all of us in this thread have a Celeron. I wonder if there is some quirky issue between our Linux builds and that motherboard?
I have had this server since June. It ran fine until September. Now its gone down a few times since then. Somewhere back in August I needed a restore due to another issue. So the only change of note between the months of good service and the spottiness of the recent weeks is that restore in August. For the life of me though I have no way to know if the August image differed from the June one.
I hope amongst us all here we can come up with a way to trigger the problem. Once we can predictably reproduce it, we are sure to get to the bottom of it.
I have a new ticket in again with EV1, I'll post any updates based on that.
Konrad Frye
Oct 3 2003, 12:50 PM
QUOTE
Originally posted by Doobla
While I am sorry for your fortune, I am glad to see that somebody else is having the same problem as I am and that it's not an Ensim thing.
I am having the same problem except that my crashes seem to be more regular than yours happening usually once ever 6-10 days. I convinced EV1 to replace my server in hopes that the problem was actually a hardware issue, but it was not as within a week I was requesting a reboot again. Since my clients are frustrated to no end I chose a band-aid approach for now and set cron to reboot the system automatically for me Sunday and Wednesday morning at 2am.
I have tried upgrading kernels, compiling the kernel from source with a minimal driver set (although I am not an expert int hat area by any means) and a host of other software "solutions" to no avail. I believe that it comes down to a combination of hardware in the server and how the hardware relates to each other, not that I think any hardware is bad per se. This is speculation at this point but I am frustrated beyond belief with this and don't know where to turn.
Here's hoping that somebody will have a suggestion or two....
Jon
It seems that there are quite a few people having this problem. I agree with you that obscure hardware incompatibilities are the likely culprit. Celeron users seem to be the ones that are most affected.
My friend has an ensim box that exhibits this problem and she's gone through pretty much the same troubleshooting routine as everyone else. Sometimes the box stays up for 4 or 5 weeks, other times it dies 7-10 days after the last reboot. The server is home to Vbulletin forums but that's about it. Machine load rarely goes above 0.3 and the crashes seem to be totally random. Sometimes they take place late at night, sometimes in the middle of the afternoon. Day of the week doesn't seem to matter either.
Someone suggested it might be a swap issue but I haven't seen any evidence of that.
Nethead
Oct 3 2003, 01:21 PM
QUOTE
Originally posted by Konrad Frye
Someone suggested it might be a swap issue but I haven't seen any evidence of that.
The only oddities I have noticed with respect to swapping is in viewing the swap info through 'vmstat'.
CODE
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 109900 9788 126336 655532 1 5 73 160 207 132 35 5 60
The "so" value never is at 0 and in fact has at times climbed to 10 and higher. I record this info every 5 minutes along with server load and it seems that over time there is a slow trend up or down with regards to the swap out value.
The server does have a reasonable number of sites on it, but even upgrading from 512MB to 1 GB RAM didn't seem to help this curious phenomenon. Not sure if this is normal or not, and this is probably too little info to conclude anything from, but I throw it out there as a curious observation just in case its somehow related. The high rate of interrupts and context switches also catch my eye.
Nethead
Oct 3 2003, 03:23 PM
EV1 took another look. Their latest trouble ticket response is:
QUOTE
However as your services are having problem, hence i suggest you to upgrade or reinstall all the services for which you have to upgrade cpanel to latest version. You can do the same from WHM under Cpanel menu and clicking on "Upgrade to Latest Version" link. Do the same and monitor the server for few hours. If the problem still persists, then i suggest you to order a restore in which OS and cpanel will be resinstalled.
I think most of us affected have already done most of these things? Other affected please confirm.
Another observations was:
QUOTE
The main reason for the server going down was because someone changed the httpd.conf file to set the maxclients variable to 260, this caused apache to spawn processes until it died and took the machine with it.
Though this had been automatically adjusted higher by cPanel one day when server load skyrocketed due to a hot site on the server, this doesn't directly correlate with the crashes. In other words, there were no high loads at the time of the crash. In fact there were no more than maybe 2 dozen apache processes running at the time.
One last thing, in tweaking the httpd.conf they switched apache to run under apache instead of nobody. This broke scripts so I changed it back. I assume nobody is the normal user for apache with cPanel?
Doobla
Oct 3 2003, 03:28 PM
QUOTE
I think most of us affected have already done most of these things? Other affected please confirm.
Yes and we're having the exact same problem but using two different panels so it is not the control panel...
QUOTE
In other words, there were no high loads at the time of the crash. In fact there were no more than maybe 2 dozen apache processes running at the time.
Same here. I think they are just baffled and so they don't want to take the time to investigate. It is not a load problem by any means.
rrand
Oct 8 2003, 04:00 PM
I too have been having the same problem on a Celeron. Usually it's been once a month for the last 4 months. But it just happened two days in a row and now I'm getting really worried.
beams
Oct 8 2003, 04:15 PM
This thread seems to be very helpful - it suggests a possible hardware<->redhat problem.
http://forum.ev1servers.net/showthread.php...&threadid=23255
Tom
Doobla
Oct 8 2003, 04:19 PM
QUOTE
Originally posted by beams
This thread seems to be very helpful - it suggests a possible hardware<->redhat problem.
http://forum.ev1servers.net/showthread.php...&threadid=23255
Tom
helpful, yes. But there was nothign there that resolved my problem. I tried the kernel route and it didn't do anything for me (actually made it worse).
Maybe I should reattempt, I dunno.
Nethead
Oct 14 2003, 04:26 PM
Had another occurrence of the mystery crash on the same box. Again nothing logged. Server remained responsive to pings.
I guess the next step is to somehow convince RS that I have a lemon of a server and need it replaced.
beams
Oct 14 2003, 04:41 PM
Good luck,
I managed to get them to replace my RAM and cables and do some tests, but they told me the HW was all ok and that I should increase my RAM.
I dont believe it is my RAM from the logs, but I figured I should comply with their suggestions otherwise I would not get any further help.
Doobla
Oct 14 2003, 04:52 PM
Well, I had added a cron entry to reboot my server at 2 am on Sunday and Wednesday mornings to try to head this off at the pass and I had a crash on Sunday and one on Monday so i opened another ticket with Rackshack. Below is the ticket in chronological order:
QUOTE
10/13/03 10:32:12 PM
My server for some time has been locking up at random times requiring a reboot ticket to be put in. It started about a month after I upgraded to Ensim Pro which was several months ago and I believe there was a kernel upgrade that happened around that time. Since then I had my hard drive replaced as the techs found that there were bad sectors on the drive and they thought that might have been the cause of the problems, however it wasnt but a few days later and my system hung again forcing another reboot ticket. Long story short (the full story you would probably read in my trouble tickets) my box was replaced and still I have the problem. After one time of it happening, after the server was replaced, I put cron entries in to reboot the server at 2am on Sunday and Wednesday mornings as a band-aid to the situation as many of my clients were going to leave me. Im still worried about this as no particular load can be observed before the crashes and it obviously wasnt "bad" hardware but I dont have anything out of the ordinary installed on the server. It has been suggested that Redhat Linux sometimes has problems with certain *combinations* of hardware and so I was wondering if you could investigate this and/or advise me on my next steps. Obviously the band-aid approach I have taken will have to be temporary and I have tried everything I could find on the forums as a software solution to no avail.
The previous paragraph was sent in an email to Patrick who never responded to the email. Now over the past two days the server has locked twice for no apparent reason. I have not changed the configuration that was on the hard drive when they gave me a restore except to upgrade Ensim, etc. Nothing out of the ordinary was done and there were no hacks performed on the machine. Going on the idea that it could be a hardware conflict, is it possible to give me a new server with a different hardware config. I dont know if you use different network cards or motherboards or what but I will move my sites to whatever server youd be willing to give me that has a different configuration and then I would give up the old server obviously. Otherwise Ill entertain any suggestions you have on this matter, however I am not the only person that has this problem.
Please refer to these threads for reference to my problem as well as others who have expressed a similar issue.
http://forums.rackshack.net/showthread.php...&threadid=22377
http://forum.rackshack.net/showthread.php?...?threadid=23255 http://forum.ev1servers.net/showthread.php...?threadid=33908
Thank you very much in advance for your help.
Jon
QUOTE
10/14/03 9:37:15 AM
Dear Customer,
I was still unable to find any reason for your random crashes. I do have a theory though. I see that you were running apf as a firewall. Is it possible your firewall is misbevin, and it appears that the server is down when its just being filtered?
What I would like to do is to wait until the next outage, and then have DataCenter console into the machine to see if its just the firewall blocking access, and not the machine being down itself.
I ran a memtest & chkrootkit, which came back clean, and the hardware has been totally switched out, so the only thing I can think of is either the firewall misbehaving, or perhaps you just need to reimage the machine and start over.
Ill go ahead and close this ticket. If the machine goes down again, please open a new ticket requesting that DataCenter verify if the firewall is blocking access or is the machine down itself.
-Eve1 Servers Support
QUOTE
10/14/03 5:04:23 PM
I have considered the firewall and eliminated that as a problem. First of all APF wasnt being used when this problem first started. Secondly, there was a period of time recently when I had a kernel installed (trying to see if a kernel configuration change would fix the problem) where I could not get iptables to work so apf wouldnt even run and I still had a coupel of crashes during that time.
As for reimaging the machine, i dont see hwo that wil help since I havent changed the configuration except for security updates and the installation of a few programs that should affect anything (like mrtg).
Also, regarding your comment about the hardware, I do not believe that any hardware is bad but that Redhat is having a problem with the configuration of the hardware. How many different configurations of Celeron boxes do you have? Do you use different motherboards? There is an obvious hardware pattern to this problem as I have only heard of Celeron customers with this problem so my thoughts are (as well as some of theirs) that it is the mix of hardware that makes up the box that redhat doesnt like.
I would entertain any other suggestions that you might have and I would really really appreciate it if you would look at the threads that I mentioned in my original ticket as these people are also having the same problem.
Thanks in advance for working to resolve this for me. My customers are not happy at all and this seems way out of my control.
QUOTE
10/14/03 5:26:53 PM
Dear Customer,
The problem that seems to appear on your system is an isolated one and as not been seen on many of the other systems. I have gone through the threads and although the issue seems similar (wrt crashes) but the logs do not exactly correlate.
The hardware has been checked and verified, so it cannot be a hardware issue. Also, chkrootkit has been run and system integrity verified.
At this point we have the only option to wait for another outage as suggested earlier and then look at the system to verify whether it is the firewall causing it or some other reason. It is only after this that e can ascertain that the reimage is essential
Please reopen this ticket or put in a new ticket as soon as anotehr outage occurs and I assure you we shall look into it.
Thank You,
Ev1servers Support Team
So I gues I am waiting for a crash again and hoping that it doesn't come at a really bad time. I'm taking my cron entries out too so that this thing doesn't drag on any longer than necessary.
Jon
beams
Oct 14 2003, 05:15 PM
QUOTE
10/14/03 5:26:53 PM
Dear Customer,
The problem that seems to appear on your system is an isolated one and as not been seen on many of the other systems. I have gone through the threads and although the issue seems similar (wrt crashes) but the logs do not exactly correlate.
Im not so sure your problem is isolated, as I seem to be having the same problem (thankfully not as often).
The problem we face here is that there is no coordinated approach to the problem. On our end, we see quite a few people with similar problems, but we are not working as a team.
On the RS end, they probably have numerous techs doing different tasks and no one person or team is seeing all these problems. So knowledge is being lost or never accumulated.
We really need some kind of 'register' where people with this or similar problems log all the necessary details.
Doobla
Oct 14 2003, 05:19 PM
QUOTE
Originally posted by beams
Im not so sure your problem is isolated, as I seem to be having the same problem (thankfully not as often).
The problem we face here is that there is no coordinated approach to the problem. On our end, we see quite a few people with similar problems, but we are not working as a team.
On the RS end, they probably have numerous techs doing different tasks and no one person or team is seeing all these problems. So knowledge is being lost or never accumulated.
We really need some kind of 'register' where people with this or similar problems log all the necessary details.
so how do you propose we move forward on such a coordinated effort? I referenced this thread along with a couple of other ones in my trouble ticket and the tech said that he/she read through them but said that my problem was different for some reason. I disagree. Maybe different that some of the ones in the other threads but certainly not in this thread. Anyways, I am all about getting this fixed ASAP so any suggestions are welcome!
Jon
Nethead
Oct 14 2003, 05:35 PM
Though I also run APF, I have to concur that it does not appear to be a plausible cause of our crashes. Sure a misbehaving firewall may lock people out, but it does not explain ALL server log files abruptly ending with no sign of any problems.
I am suspicious of the motherboard / OS interaction in this case. I am stymied though when it comes to how to resolve this. It has been hinted at before by RS that if the problems persist that perhaps they would need to replace the server. Sounds great until you consider this means all new IPs impacting a boat-load of sites in my case. Also, they don't seem to be at that point of committing to providing a totally new server just yet. Frankly I don't want nor trust another RS Celeron box at this point.
beams
Oct 14 2003, 05:46 PM
I run apf too. for the record. I agree with the comments above, it does not make sense that apf would stop all file system access.
but it is a common thing to the three of us.
amps
Oct 14 2003, 06:32 PM
The prob is not APF if you have no logs during the downtime.
The problem IS an IO issue and whoever was noticing irregularities in the swap activity and various IO data may be on to something. The fact that everything locks up with no logs means that your IO bus / ATA controller is locking up preventing the kernel from writing to it.
The fact you have lockups after Ensim Pro re-enforces my point. Ensim Pro is much more CPU and disk intensive than 3.1. The extra disk IO over extended periods is probably the main cause of the controller loop. I had a similar issue where by disabling a Promise ATAraid controller I was able to completely cure the issue. Note that this could be accumulative -- i.e. IO processes over time with a slowly leaking buffer somewhere. Doing a quick 1 or 2 day test may not isolate the issue. I think that you will find as your average server load increases, the crashes will become more frequent.
VIA chipsets are notoriously buggy and I am 99.9% certain this is your problem. Because linux is designed around server grade hardware not desktop systems (VIA), it is unlikely there are enough people in true production environments utilizing VIA based motherboards to contribute enough to solving this issue with your particular make/model which was designed for your average desktop clone running WinXP. MS has had more experience with working around cheap hardware bugs since they design desktop grade OS.
I have seen similar issues with certain Adaptec SCSI controllers. But Like I said, all the symptoms you guys have point directly to the hard drive controller / chipset you're running.
It is unlikely RS techs will be of ANY help. They are not going to admit there may be a problem with the hardware they chose to run linux on. And since this is the most likely cause, well you get the picture. You will go in circles all day until it's time to reboot your machine again.
I noticed as well they are now selling Linux servers with Promise Fasttrack controllers. All of these people are going to have the SAME PROBLEM as you guys... what they need to do is actually test these configurations in a real production environment for months on end before they start selling them.
there WAS a post somewhere around here where someone was able to compile a plain kernel with some custom options and fix the issue with the rackshack celery VIA's
I wish you all the best of luck....
beams
Oct 14 2003, 06:49 PM
maybe I should know, but how do you monitor disk IO?
What tools are available? If I knew what to do I would implement some logging process to 'see' what is happening.
Doobla
Oct 14 2003, 08:22 PM
QUOTE
Originally posted by amps
there WAS a post somewhere around here where someone was able to compile a plain kernel with some custom options and fix the issue with the rackshack celery VIA's
I wish you all the best of luck....
Man, if you could help dig that up that'd be great! I tried several times myself and it only made my conversation with RS go bad because I had to request reboots for a failed kernel upgrade because I used the wrong network controller or something so they said all of my issues were my own fault (at least at the time).
Anyways, I'm going to look around for that post so anybody that can, how about giving me a hand.
Also, regarding APF, I know APF wasn't my problem because there was a period of time when APF wasn't even installed on my machine and I still had the same problems.
Thank you for your contributions amps,
Jon
Netino
Oct 14 2003, 09:35 PM
I´m researching this problem since one year ago, and I firmly believe APF have nothing to to with this issue. I do, would be easy to discover: search your log, and see if apf is blocking anyone during the freezes. You will see has nothing logged!
I´m having new directions on research, and the problems exists in all hardware/software and OS combinations. The last discovery is PHP/Apache already caused a similar problem in buffering files in older RedHat 6, I didn´t try to test but apparently the problem was solved. See in <
http://www.phpbuilder.com/mail/php-develop...200001/0544.php>.
Regards,
Netino
Doobla
Oct 14 2003, 09:45 PM
QUOTE
Originally posted by Netino
I´m researching this problem since one year ago, and I firmly believe APF have nothing to to with this issue. I do, would be easy to discover: search your log, and see if apf is blocking anyone during the freezes. You will see has nothing logged!
I´m having new directions on research, and the problems exists in all hardware/software and OS combinations. The last discovery is PHP/Apache already caused a similar problem in buffering files in older RedHat 6, I didn´t try to test but apparently the problem was solved. See in <http://www.phpbuilder.com/mail/php-develop...200001/0544.php>.
Regards,
Netino
So installing a kernel from source didn't fix your problem netino? I read that solution in another thread similar to this one.
Doobla
Oct 14 2003, 10:18 PM
Hey amps,
You fixed yoru probelm with a kernel upgrade right? And you used a redhat rpm to do that?
What is an smp kernel (vs. some other kernel rpm)???? In other words, what does the smp stand for?
thanks,
Jon
Netino
Oct 14 2003, 10:56 PM
QUOTE
Originally posted by Doobla
So installing a kernel from source didn't fix your problem netino? I read that solution in another thread similar to this one.
Yes, Doobla. Einstein says "One experiment can drop 100 theories, if happen". Sorry for not add a new post, only edit my (un)"sussceful" post. I did think the problem was solved, but the problem happen again, and I searching again for a solution, and I post a new thread, where I´m supposing a sendmail problem, but I´m having new directions again.
I don´t know if have to do, but I change some parameters in /etc/sysctl.conf, and my server crashes three times more in one day. This never occurs before, so, may be something to do with the problem. The parameters are:
net.core.rmem_default = 8388608
net.core.rmem_max = 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.core.wmem_default = 65536
net.core.wmem_max = 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.core.optmem_max = 8388608
After the crashes, I change to following parameters:
net.core.rmem_default = 25165824
net.core.rmem_max = 25165824
net.ipv4.tcp_rmem = 4096 25165824 25165824
net.core.wmem_default = 65536
net.core.wmem_max = 25165824
net.ipv4.tcp_wmem = 4096 65536 25165824
net.core.optmem_max = 25165824
(already updated with "sysctl -p")
Some values are three times more.
I will see the behavior.
Regards,
Netino
amps
Oct 15 2003, 09:07 AM
QUOTE
Originally posted by Doobla
Hey amps,
You fixed yoru probelm with a kernel upgrade right? And you used a redhat rpm to do that?
What is an smp kernel (vs. some other kernel rpm)???? In other words, what does the smp stand for?
thanks,
Jon
My case was sort of unique. I had to compile in Promise modules to run my IDE drives in a RAID configuration. After a couple weeks I began having teh exact same problems as you guys. The only solution was upgrading to a standard redhat Kernel and disabling the Promise controller.
I have since had solid uptime for over 100 days now. But my server doesn't use a VIA chipset... it's all Intel. I think those with newer VIA chipsets facing this issue may not be able to solve this issue easily, since it is a Redhat <=> hardware incompatibility issue. I can almost guarantee there is no software on your box that is causing this. The problem is the hardware hasn't been around enough for the Redhat developers to build in a workaround to the issue (such as the Southbridge Workaround for the older VIA KT133 chipsets, ect). Redhat Kernels are always pretty far behind the Linux curve.
You may have luck with compiling a bleeding edge kernel from Kernel.Org .. but I do not have experience with doing this. They may have already provided a fix in the plain kernels.
Doobla - SMP kernels are for multi-CPU servers.
Doobla
Oct 15 2003, 09:33 AM
I see, thanks for the info. I wonder if a BIOS upgrade would help the situation at all....?
devo-x
Oct 15 2003, 11:29 AM
What processes are consuming the most CPU and RAM usage though commands such as 'top d1' and 'ps -aux --forest'?
Try to isolate which programs are active immediately before the 'crash' ......
Post the contents of /proc/interrupts .....
amps
Oct 15 2003, 12:17 PM
If these guys are in the same hardware issue boat as I was, there will be NO abnormal activity or spike in usage leading up to the crash. It just suddenly dies out of nowhere... you could be editing a file with Pico with nothing else going on, you could be pushing 100 megs of traffic... there really is no rhyme or reason other than TIME and average load over that time.
Doobla
Oct 15 2003, 04:37 PM
QUOTE
Originally posted by amps
If these guys are in the same hardware issue boat as I was, there will be NO abnormal activity or spike in usage leading up to the crash. It just suddenly dies out of nowhere... you could be editing a file with Pico with nothing else going on, you could be pushing 100 megs of traffic... there really is no rhyme or reason other than TIME and average load over that time.
Although my situation IS a little different than amps in that I only have one drive and no Raid setup going on, I will say that his description is accurate for my situation. There really is nothing going on and then bam, server goes down.
Nethead
Oct 15 2003, 04:58 PM
I will concur that though I capture complete snapshots of all running processes, open sockets, memory utilization, etc every 5 minutes, and have captured state seemingly seconds before my most recent failure, no unusual processor hogs, memory usage, etc are noted.
I want to add to the info I contributed at the start of this thread based on the discussion we are having.
Looking at /var/log/messages at boot time, as far as I can tell I do not have a VIA chipset.
CODE
kernel: ICH4: IDE controller at PCI slot 00:1f.1
kernel: PCI: Found IRQ 10 for device 00:1f.1
kernel: PCI: Sharing IRQ 10 with 00:02.0
kernel: ICH4: chipset revision 2
kernel: ICH4: not 100%% native mode: will probe irqs later
kernel: ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio
kernel: ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio
kernel: hda: ST360021A, ATA DISK drive
kernel: blk: queue c0377440, I/O limit 4095Mb (mask 0xffffffff)
kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
For what it is worth, here is my 1.7 GHz Celeron based system's /proc/interrupts:
CODE
CPU0
0: 6226134 XT-PIC timer
1: 3 XT-PIC keyboard
2: 0 XT-PIC cascade
4: 45 XT-PIC serial
5: 0 XT-PIC ehci-hcd
8: 1 XT-PIC rtc
9: 0 XT-PIC usb-uhci
10: 0 XT-PIC usb-uhci
11: 6178586 XT-PIC usb-uhci, eth0
14: 1466219 XT-PIC ide0
NMI: 0
ERR: 0
Doobla
Oct 15 2003, 05:32 PM
Well, can you dig up more hard ware info liek what I posted on page 1 of this thread? All I can see that you and I have in common hardware wise is that we both use a celeron (different speeds) and we both have the exact same hard drive model.
Other than that, not enough info.
beams
Oct 15 2003, 05:53 PM
Im not good with hardware so Im not sure what chipset I have but I have a celeron1.3 and the HD is the same as you guys -
hda: ST360021A, ATA DISK drive
CPU: Intel® Celeron CPU 1300MHz stepping 01
Nethead
Oct 15 2003, 06:10 PM
Here is more info from /var/log/messages. Basically those areas covered in your post of the same:
CODE
Linux version 2.4.20-20.7 (bhcompile@porky.devel.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux
7.3 2.96-113)) #1 Mon Aug 18 14:56:30 EDT 2003
CPU: Intel(R) Celeron(R) CPU 1.70GHz stepping 03
PCI: PCI BIOS revision 2.10 entry at 0xfb430, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Ignoring BAR0-3 of IDE controller 00:1f.1
Transparent bridge - Intel Corp. 82801BA/CA/DB PCI Bridge
PCI: Using IRQ router PIIX [8086/24c0] at 00:1f.0
PCI: Found IRQ 10 for device 00:1f.1
PCI: Sharing IRQ 10 with 00:02.0
Uniform Multi-Platform E-IDE driver Revision: 7.00beta3-.2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ICH4: IDE controller at PCI slot 00:1f.1
PCI: Found IRQ 10 for device 00:1f.1
PCI: Sharing IRQ 10 with 00:02.0
ICH4: chipset revision 2
ICH4: not 100%% native mode: will probe irqs later
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio
hda: ST360021A, ATA DISK drive
blk: queue c0377440, I/O limit 4095Mb (mask 0xffffffff)
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-disk driver.
hda: host protected area => 1
hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=7297/255/63, UDMA(100)
8139too Fast Ethernet driver 0.9.26
PCI: Found IRQ 11 for device 01:0d.0
eth0: RealTek RTL8139 Fast Ethernet at 0xf88da000, 00:10:de:c1:fe:17, IRQ 11
eth0: Setting 100mbps full-duplex based on auto-negotiated partner ability 45e1.
Other failure / interesting(?) messages:
CODE
PCI: 00:1d.7 PCI cache line size set incorrectly (0 bytes) by BIOS/FW.
PCI: 00:1d.7 PCI cache line size corrected to 128.
I continue to do digging on this end. Also I am exploring a remote logging setup to see if I can capture anything that doesn't otherwise make it to the local disk log files.
I guess its a matter of waiting for it to crash again. If it wasn't a production box I'd play with some of the values Netino posted earlier just to get it to crash sooner.
amps
Oct 15 2003, 07:17 PM
QUOTE
Originally posted by Doobla
Although my situation IS a little different than amps in that I only have one drive and no Raid setup going on, I will say that his description is accurate for my situation. There really is nothing going on and then bam, server goes down.
Please note that the problem occured with or without the raid configuration. The only way to stop the crashing was to completely recompile the Kernel and remove SUPPORT for the Promise controller.
amps
Oct 15 2003, 07:21 PM
Just some more food for thought -- the Realtek 8139 controller is a POS that is unstable under linux pushing anything more than a few megs. They are worth about $5.99 and have no place in anything but the $299 special from Clones-R-Us. Three other servers of mine ALL had realteks and ALL dropped offline frequently under heavy network load. The fix was yanking them in place of Intel 10/100's.
If they are on-board, I'm sorry... =)
The good thing is that if the Realtek is locking up on you, you should still have logs during the downtime indicating the server was indeed alive. Another telltale sign the Realtek is locking up on you is that you can't ping it while its down, and issueing "service network restart" from the console will put it back online.
tgillespie
Oct 15 2003, 07:23 PM
I too have the same problem. The last few days the server randomly crashes without logging anything. The only way to get it back online is to submit a reboot ticket. I have asked over and over for someone to take a look. They ticket always goes into investigation, and then is resolved with some random fix they claim is the fault. Few hours later the server rolls over.
I am getting very upset, and am almost to the point where I am fed up with RackShack. The tech support is basically useless because when the server goes down, they can't access it. The server comes back up and tech support thinks everything is fine. There is a large bottle kneck having the actual tech support miles away from the machines.
I can assure you that if these random server crashes continue, I will be picking up my machines, and moving elsewhere.
The machine that crashes regularly is a Celeron 1.3, Ensim Pro 3.5.19. The only thing remotely close to causing this is my recent installation of modern bill. The crashes started near the time we installed it.
Doobla
Oct 15 2003, 07:44 PM
Just a thought, but it seems like the first thing that people look at is heavy load or high RAM usage, but it seems that most if not all of the people reporting this problem have reported low loads at the time of the crash. Is it possible that maybe Power Management in the BIOS or something is kicking in? I've just started to think in new directions and that popped into my head.
Anybody know where to find a BIOD editor for Linux that can be run via the Shell?
Nethead
Oct 15 2003, 09:13 PM
Could some one who is also experiencing these odd crashes post a typical output of the 'vmstat' command on your box? As noted earlier in this thread, my numbers are a bit odd, and I want to rule in or rule out this as a common thread among the servers we have crashing.
Here is my problem box:
CODE
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 166988 21888 120952 603956 1 13 107 169 216 149 12 5 84
Thanks!
Doobla
Oct 15 2003, 09:21 PM
Here's my current vmstat:
CODE
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 48268 14812 60528 229596 60 86 42 94 27 240 12 8 80
Jon
Netino
Oct 15 2003, 09:34 PM
QUOTE
Originally posted by Nethead
Could some one who is also experiencing these odd crashes post a typical output of the 'vmstat' command on your box? (...)
Here is my problem box:
CODE
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 166988 21888 120952 603956 1 13 107 169 216 149 12 5 84
Thanks!
Do not use "vmstat" command without the parameter '1'. Each time you execute it, you have some columns growed.
I already use it to monitor a crash, without '1' parameter, recording in a file, but the results right before a crash was unconclusive. The columns 'bi' and 'bo' had a little growing. but if you execute "vmstat '" you will see it occurs the same thing in the single execution, successive times. With '1' parameter, the execution have a little overhead in the first line, but the following are really normal.
Regards,
Netino
Nethead
Oct 15 2003, 09:44 PM
QUOTE
Originally posted by Netino
Do not use "vmstat" command without the parameter '1'.
I learn something new every day. Thanks for the info. I just tried what you said and see the difference. Thanks again!
devo-x
Oct 16 2003, 07:29 AM
Have you tried running the '8139 diag-tool'?
Download Here
I have noticed cases in which the link speed, mode or NIC driver settings were incorrect .....(Mainly in Red Hat 7.3)
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please
click here.