Hi All,
I've been running a fedora server for over a year now, and has seen some very strange issue, that really make it uncomfortable to use, and that I cannot manage to solve easily.
The problem is as follow : sometimes (every 20-30 minutes or so : this time is quite random), the server completely hangs.
I'm using it mainly remotely, through ssh and nfs. What I see when it "hangs" is the following
-no ping response
-nfs stalled
-ssh sessions hangs (for example, if I run a "top" command", it just isn't updated anymore
-disk activity led stays completely off (in normal activity, it is almost always blinking, even due to "internal" server activities, so network disconnection doesn't explain that either)
The most strange thing is that, to resume it, I basically have 3 options : wait (sometimes few tens of seconds, usually between 2 and 4 minutes!), just
hit a key on the keyboard!, or do something like un/plugging any king of usb peripheral (which makes me think of some interruption mechanism that is stalled).
When it "wakes up", I see my "top" session over ssh suddenly being quickly updated hundreds of times (for all it has not received during the "pause"), ping says that the packets have actually all been received (with long times, for example, packet 1 : 80xxxms, packet 2 : 79xxxms, packet 3 : 78xxxms..... packet 79 : 1xxxms, packet 80 : 0.xxxms), the disk is quite overloaded for a few seconds, and everything is back at normal!!
Furthermore, I think this i related, but my clock drifts for few hours per day (3 holding minutes every 20-30 minutes makes me think there is some relation!)
I tried to set up ntpd to compensate it, but sometimes the suspensions are just too long, and I ended up with
Code:
Nov 27 01:22:36 server ntpd[699]: 0.0.0.0 0617 07 panic_stop +1203 s; set clock manually within 1000 s
and ntpd dies...
You'll ask me to have a look at the log, which I did by
Code:
tail -f /var/log/* /var/log/*/*
, but when the suspension happens, there is absolutely nothing new in all those files!
Version information :
Code:
uname -a
Linux vmserver.grelot.net 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
I said in the title "F13-F14", because I already had the problem with F13, but after some time it disappeared (I may have changed some configuration, I have to admit...). I still have a backup of the "/etc" tree of that "working" F13.
The hardware for F13[working] and F14[not_working] is the same : Phenom II X4 on an ASUS M3N78 PRO, data on raid 10, system on separated SATA disk, 8Gb RAM, some qemu vms running (between 4 and 6).
It would be really great if someone had an idea about this problem : it annoyed me for several months now, and I cannot stay calm anymore when I suddenly cannot do anything but wait, since I'm not near enough to press the "Ctrl" key on the keyboard!!
Thanks a lot in advance,
Goulou.