Hi all,
We have configured a cluster consisting of four nodes in our organization ,running with fedora 10 operating system .The main purpose of the cluster to run batch jobs using PBSPro (version 10.0) .Here goes the detailed configuration fo the cluster :
Hardware : one dell poweredge r610 and thre dell powewredge r410.
Operating System : Fedora release 10 with kernel 2.6.27.5-117.fc10.x86_64.
Software : MPICH2-1.2 to run mpi based jobs.
Two nodes of the cluster crashed down ,this is what i get in /var/log/message logs ,when rebooted
Dec 7 12:05:01 dell-server kernel: oceanM[27938]: segfault at 18 ip 000000329327b042 sp 00007fff6cb53f00 error 4 in libc-2.9.so[3293200000+168000]
Dec 7 16:38:46 dell-server mpd: dell-server_43532 (handle_rhs_input 1209): lost rhs; re-entering ring
Dec 7 16:38:47 dell-server mpd: dell-server_43532 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:38:47 dell-server mpd: dell-server_43532 (handle_rhs_input 1214): back in ring
Dec 7 16:39:03 dell-server mpd: dell-server_43532 (runmainloop 320): no pulse_ack from rhs; re-entering ring
Dec 7 16:39:04 dell-server mpd: dell-server_43532 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:39:04 dell-server mpd: dell-server_43532 (runmainloop 325): back in ring
Dec 7 16:44:13 dell-server mpd: dell-server_48202 (runmainloop 320): no pulse_ack from rhs; re-entering ring
Dec 7 16:44:14 dell-server mpd: dell-server_48202 (reenter_ring 843): reenter_ring rc=0 after numTries=1
Dec 7 16:44:14 dell-server mpd: dell-server_48202 (runmainloop 325): back in ring
Dec 7 16:44:46 dell-server kernel: BUG: soft lockup - CPU#1 stuck for 61s! [oceanM:29477]
Dec 7 16:44:46 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:46 dell-server kernel: CPU 1:
Dec 7 16:44:46 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:46 dell-server kernel: Pid: 29477, comm: oceanM Not tainted 2.6.27.5-117.fc10.x86_64 #1
Dec 7 16:44:46 dell-server kernel: RIP: 0010:[<ffffffff812e1dcf>] [<ffffffff812e1dcf>] tcp_transmit_skb+0x1cf/0x64e
Dec 7 16:44:46 dell-server kernel: RSP: 0000:ffff88063e49fa10 EFLAGS: 00000202
Dec 7 16:44:46 dell-server kernel: RAX: 0000000000000110 RBX: ffff88063e49fa80 RCX: ffff880332d22d10
Dec 7 16:44:46 dell-server kernel: RDX: ffff88063e49fa00 RSI: 0000000000000020 RDI: ffff88056f038000
Dec 7 16:44:46 dell-server kernel: RBP: ffff88063e49f990 R08: 000000004e02a8c0 R09: 0000000000017d2b
Dec 7 16:44:46 dell-server kernel: R10: 0000000000000020 R11: ffff88063e49f890 R12: ffffffff810113d8
Dec 7 16:44:46 dell-server kernel: R13: ffff88063e49f990 R14: 0000000000000000 R15: ffff88056fc3f500
Dec 7 16:44:46 dell-server kernel: FS: 00007f58123296f0(0000) GS:ffff88033e42d300(0000) knlGS:0000000000000000
Dec 7 16:44:46 dell-server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 7 16:44:46 dell-server kernel: CR2: 00007f580fde7e48 CR3: 000000033e519000 CR4: 00000000000006e0
Dec 7 16:44:46 dell-server kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 7 16:44:46 dell-server kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 7 16:44:46 dell-server kernel:
Dec 7 16:44:46 dell-server kernel: Call Trace:
Dec 7 16:44:46 dell-server kernel: <IRQ> [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e2416>] ? tcp_send_ack+0xfd/0x101
Dec 7 16:44:46 dell-server kernel: [<ffffffff812dfb57>] ? __tcp_ack_snd_check+0x65/0x7d
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e03e8>] ? tcp_rcv_established+0x5b3/0x84d
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e7a28>] ? tcp_v4_do_rcv+0x1dd/0x38b
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffff810258cb>] ? __ticket_spin_lock+0xe/0x1a
Dec 7 16:44:46 dell-server kernel: [<ffffffff812e804b>] ? tcp_v4_rcv+0x475/0x6a8
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cda35>] ? ip_local_deliver_finish+0x0/0x19f
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cdb38>] ? ip_local_deliver_finish+0x103/0x19f
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cdc46>] ? ip_local_deliver+0x72/0x7a
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cd72d>] ? ip_rcv_finish+0x305/0x321
Dec 7 16:44:46 dell-server kernel: [<ffffffff812cd9a7>] ? ip_rcv+0x25e/0x294
Dec 7 16:44:46 dell-server kernel: [<ffffffff812a5694>] ? netif_receive_skb+0x3cb/0x3f0
Dec 7 16:44:46 dell-server kernel: [<ffffffffa003422a>] ? bnx2_poll_work+0x97f/0xad2 [bnx2]
Dec 7 16:44:46 dell-server kernel: [<ffffffff81021e1e>] ? ack_apic_level+0x3d/0xe8
Dec 7 16:44:46 dell-server kernel: [<ffffffff81337a40>] ? bad_gs+0x1593/0x2563
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a07>] ? restore_args+0x0/0x30
Dec 7 16:44:46 dell-server kernel: [<ffffffffa00345c5>] ? bnx2_poll+0x11a/0x1e9 [bnx2]
Dec 7 16:44:46 dell-server kernel: [<ffffffff812a3c7d>] ? net_rx_action+0xd4/0x1fd
Dec 7 16:44:46 dell-server kernel: [<ffffffff81046b22>] ? __do_softirq+0x7e/0x10c
Dec 7 16:44:46 dell-server kernel: [<ffffffff81011bcc>] ? call_softirq+0x1c/0x28
Dec 7 16:44:46 dell-server kernel: [<ffffffff81012dd2>] ? do_softirq+0x4d/0xb0
Dec 7 16:44:46 dell-server kernel: [<ffffffff810466f7>] ? irq_exit+0x4e/0x9d
Dec 7 16:44:46 dell-server kernel: [<ffffffff810130ee>] ? do_IRQ+0x147/0x169
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010933>] ? ret_from_intr+0x0/0x2e
Dec 7 16:44:46 dell-server kernel: <EOI> [<ffffffff81010a56>] ? retint_careful+0x14/0x6c
Dec 7 16:44:46 dell-server kernel: [<ffffffff81010a4d>] ? retint_careful+0xb/0x6c
Dec 7 16:44:46 dell-server kernel:
Dec 7 16:44:51 dell-server kernel: BUG: soft lockup - CPU#6 stuck for 61s! [oceanM:29473]
Dec 7 16:44:51 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:51 dell-server kernel: CPU 6:
Dec 7 16:44:51 dell-server kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipv6 dm_multipath uinput dcdbas pcspkr iTCO_wdt iTCO_vendor_support serio_raw bnx2 ses enclosure joydev shpchp megaraid_sas [last unloaded: freq_table]
Dec 7 16:44:51 dell-server kernel: Pid: 29473, comm: oceanM Not tainted 2.6.27.5-117.fc10.x86_64 #1
Dec 7 16:44:51 dell-server kernel: RIP: 0010:[<ffffffff810258d3>] [<ffffffff810258d3>] __ticket_spin_lock+0x16/0x1a
Dec 7 16:44:51 dell-server kernel: RSP: 0018:ffff88062d1ffab8 EFLAGS: 00000297
Dec 7 16:44:51 dell-server kernel: RAX: 0000000000007c7b RBX: ffff88062d1ffab8 RCX: 0000000000000003
Dec 7 16:44:51 dell-server kernel: RDX: 0000000000000000 RSI: ffff88062d1fe010 RDI: ffff88056fc3f540
Dec 7 16:44:51 dell-server kernel: RBP: ffffffffff5fc380 R08: 0000000000000040 R09: 0000000000000000
Dec 7 16:44:51 dell-server kernel: R10: ffffffff814e8000 R11: 0000000000000246 R12: 0000000000001fbc
Dec 7 16:44:51 dell-server kernel: R13: ffff8803c9377000 R14: ffff88062d1fe000 R15: ffffffff816db990
Dec 7 16:44:51 dell-server kernel: FS: 00007ff739e016f0(0000) GS:ffff88063e44d300(0000) knlGS:0000000000000000
Dec 7 16:44:51 dell-server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 7 16:44:51 dell-server kernel: CR2: 00007ff73686fe48 CR3: 000000062d163000 CR4: 00000000000006e0
Dec 7 16:44:51 dell-server kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 7 16:44:51 dell-server kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Dec 7 16:46:57 dell-server kernel:
Can please anyone help me out???.
Why did the system crashed? it repeated happened twice with the same error messages.
thanks in advance