Fedora Linux Support Community & Resources Center
  #1  
Old 12th December 2009, 06:47 AM
secmakc Offline
Registered User
 
Join Date: Dec 2009
Posts: 1
solarisfirefox
Kernel error:kernel: BUG: soft lockup - CPU#1 stuck for 61s!

Hi all,
We have configured a cluster consisting of four nodes in our organization ,running with fedora 10 operating system .The main purpose of the cluster to run batch jobs using PBSPro (version 10.0) .Here goes the detailed configuration fo the cluster :
Hardware : one dell poweredge r610 and thre dell powewredge r410.
Operating System : Fedora release 10 with kernel 2.6.27.5-117.fc10.x86_64.
Software : MPICH2-1.2 to run mpi based jobs.
Two nodes of the cluster crashed down ,this is what i get in /var/log/message logs ,when rebooted
Dec 7 12:05:01 dell-server kernel: oceanM[27938]: segfault at 18 ip 000000329327b042 sp 00007fff6cb53f00 error 4 in libc-2.9.so[3293200000+168000]
Dec 7 16:38:46 dell-server mpd: dell-server_43532 (handle_rhs_input 1209): lost rhs; re-entering ring
Dec 7 16:38:47 dell-server mpd: dell-server_43532 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:38:47 dell-server mpd: dell-server_43532 (handle_rhs_input 1214): back in ring
Dec 7 16:39:03 dell-server mpd: dell-server_43532 (runmainloop 320): no pulse_ack from rhs; re-entering ring
Dec 7 16:39:04 dell-server mpd: dell-server_43532 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:39:04 dell-server mpd: dell-server_43532 (runmainloop 325): back in ring
Dec 7 16:44:13 dell-server mpd: dell-server_48202 (runmainloop 320): no pulse_ack from rhs; re-entering ring
Dec 7 16:44:14 dell-server mpd: dell-server_48202 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:44:14 dell-server mpd: dell-server_48202 (runmainloop 325): back in ring
Dec 7 16:44:46 dell-server kernel: BUG: soft lockup - CPU#1 stuck for 61s! [oceanM:29477]
Dec 7 16:44:46 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:46 dell-server kernel: CPU 1:
Dec 7 16:44:46 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:46 dell-server kernel: Pid: 29477, comm: oceanM Not tainted 2.6.27.5-117.fc10.x86_64 #1
Dec 7 16:44:46 dell-server kernel: RIP: 0010:[<ffffffff812e1dcf>] [<ffffffff812e1dcf>] tcp_transmit_skb+0x1cf/0x64e
Dec 7 16:44:46 dell-server kernel: RSP: 0000:ffff88063e49fa10 EFLAGS: 00000202
Dec 7 16:44:46 dell-server kernel: RAX: 0000000000000110 RBX: ffff88063e49fa80 RCX: ffff880332d22d10
Dec 7 16:44:46 dell-server kernel: RDX: ffff88063e49fa00 RSI: 0000000000000020 RDI: ffff88056f038000
Dec 7 16:44:46 dell-server kernel: RBP: ffff88063e49f990 R08: 000000004e02a8c0 R09: 0000000000017d2b
Dec 7 16:44:46 dell-server kernel: R10: 0000000000000020 R11: ffff88063e49f890 R12: ffffffff810113d8
Dec 7 16:44:46 dell-server kernel: R13: ffff88063e49f990 R14: 0000000000000000 R15: ffff88056fc3f500
Dec 7 16:44:46 dell-server kernel: FS: 00007f58123296f0(0000) GS:ffff88033e42d300(0000) knlGS:0000000000000000
Dec 7 16:44:46 dell-server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 7 16:44:46 dell-server kernel: CR2: 00007f580fde7e48 CR3: 000000033e519000 CR4: 00000000000006e0
Dec 7 16:44:46 dell-server kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 7 16:44:46 dell-server kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 7 16:44:46 dell-server kernel:
Dec 7 16:44:46 dell-server kernel: Call Trace:
Dec 7 16:44:46 dell-server kernel: <IRQ> [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e2416>] ? tcp_send_ack+0xfd/0x101
Dec 7 16:44:46 dell-server kernel: [<ffffffff812dfb57>] ? __tcp_ack_snd_check+0x65/0x7d
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e03e8>] ? tcp_rcv_established+0x5b3/0x84d
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e7a28>] ? tcp_v4_do_rcv+0x1dd/0x38b
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffff810258cb>] ? __ticket_spin_lock+0xe/0x1a
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e804b>] ? tcp_v4_rcv+0x475/0x6a8
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cda35>] ? ip_local_deliver_finish+0x0/0x19f
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cdb38>] ? ip_local_deliver_finish+0x103/0x19f
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cdc46>] ? ip_local_deliver+0x72/0x7a
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cd72d>] ? ip_rcv_finish+0x305/0x321
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cd9a7>] ? ip_rcv+0x25e/0x294
Dec 7 16:44:46 dell-server kernel: [<ffffffff812a5694>] ? netif_receive_skb+0x3cb/0x3f0
Dec 7 16:44:46 dell-server kernel: [<ffffffffa003422a>] ? bnx2_poll_work+0x97f/0xad2 [bnx2]
Dec 7 16:44:46 dell-server kernel: [<ffffffff81021e1e>] ? ack_apic_level+0x3d/0xe8
Dec 7 16:44:46 dell-server kernel: [<ffffffff81337a40>] ? bad_gs+0x1593/0x2563
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffffa00345c5>] ? bnx2_poll+0x11a/0x1e9 [bnx2]
Dec 7 16:44:46 dell-server kernel: [<ffffffff812a3c7d>] ? net_rx_action+0xd4/0x1fd
Dec 7 16:44:46 dell-server kernel: [<ffffffff81046b22>] ? __do_softirq+0x7e/0x10c
Dec 7 16:44:46 dell-server kernel: [<ffffffff81011bcc>] ? call_softirq+0x1c/0x28
Dec 7 16:44:46 dell-server kernel: [<ffffffff81012dd2>] ? do_softirq+0x4d/0xb0
Dec 7 16:44:46 dell-server kernel: [<ffffffff810466f7>] ? irq_exit+0x4e/0x9d
Dec 7 16:44:46 dell-server kernel: [<ffffffff810130ee>] ? do_IRQ+0x147/0x169
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010933>] ? ret_from_intr+0x0/0x2e
Dec 7 16:44:46 dell-server kernel: <EOI> [<ffffffff81010a56>] ? retint_careful+0x14/0x6c
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a4d>] ? retint_careful+0xb/0x6c
Dec 7 16:44:46 dell-server kernel:
Dec 7 16:44:51 dell-server kernel: BUG: soft lockup - CPU#6 stuck for 61s! [oceanM:29473]
Dec 7 16:44:51 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:51 dell-server kernel: CPU 6:
Dec 7 16:44:51 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:51 dell-server kernel: Pid: 29473, comm: oceanM Not tainted 2.6.27.5-117.fc10.x86_64 #1
Dec 7 16:44:51 dell-server kernel: RIP: 0010:[<ffffffff810258d3>] [<ffffffff810258d3>] __ticket_spin_lock+0x16/0x1a
Dec 7 16:44:51 dell-server kernel: RSP: 0018:ffff88062d1ffab8 EFLAGS: 00000297
Dec 7 16:44:51 dell-server kernel: RAX: 0000000000007c7b RBX: ffff88062d1ffab8 RCX: 0000000000000003
Dec 7 16:44:51 dell-server kernel: RDX: 0000000000000000 RSI: ffff88062d1fe010 RDI: ffff88056fc3f540
Dec 7 16:44:51 dell-server kernel: RBP: ffffffffff5fc380 R08: 0000000000000040 R09: 0000000000000000
Dec 7 16:44:51 dell-server kernel: R10: ffffffff814e8000 R11: 0000000000000246 R12: 0000000000001fbc
Dec 7 16:44:51 dell-server kernel: R13: ffff8803c9377000 R14: ffff88062d1fe000 R15: ffffffff816db990
Dec 7 16:44:51 dell-server kernel: FS: 00007ff739e016f0(0000) GS:ffff88063e44d300(0000) knlGS:0000000000000000
Dec 7 16:44:51 dell-server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 7 16:44:51 dell-server kernel: CR2: 00007ff73686fe48 CR3: 000000062d163000 CR4: 00000000000006e0
Dec 7 16:44:51 dell-server kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 7 16:44:51 dell-server kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Dec 7 16:46:57 dell-server kernel:
Can please anyone help me out???.
Why did the system crashed? it repeated happened twice with the same error messages.
thanks in advance
Reply With Quote
  #2  
Old 26th December 2009, 07:17 PM
richardpitt Offline
Registered User
 
Join Date: Dec 2009
Posts: 2
linuxfedorafirefox
I've been getting the same message on two SMP machines - AMD quad-core and Intel I7 - both under fairly heavy load at the time (CPU load)
The AMD is running FC11 with kernel 2.6.30.5-43.fc11.x86_64
The Intel i7 is running FC12 with kernel 2.6.31.6-162.fc12.x86_64

Kernel.org seems to think the fix is this patch http://git.kernel.org/tip/bfeed8fcf9...4e77fd49930e36 mentioned in this post http://bugzilla.kernel.org/show_bug.cgi?id=14289 but of course those of us using Fedora don't usually compile our own kernels. At this point I've nailed my AMD at full speed by turning cpuspeed off - so far this has at least delayed the onset. I've just done the same to the i7 - will report back if this fails. I expect it is not a solution - only getting the above patch in will be a solution.
I note that on the i7 which is remote and in a place I only have access to after calling for someone to unlock - sometimes as much as a day later - the system seems to reboot itself after a while (many hours). I once managed to have a "ps -ef | wc" command run and it appeared to reboot after the process table got full from cron events that started but never finished.

richard
Reply With Quote
Reply

Tags
61s, bug, cpu#1, errorkernel, kernel, lockup, soft, stuck

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
fedora kernel: BUG: soft lockup TSPx Using Fedora 7 9th July 2009 05:50 AM
Fedora Kernel: soft lockup ever resolved? simpfeld Using Fedora 0 9th November 2008 07:49 PM
FC6: Soft lockup on first boot dandaman32 EOL (End Of Life) Versions 15 16th December 2006 03:59 AM


Current GMT-time: 12:03 (Sunday, 21-09-2014)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat