He's running with stonith enabled (which I hardly ever do, it's hard on the hardware). I can reproduce his problem quite readily - I sent the mem stat interval down to 5 minutes, and boom every 5 minutes it dies.
It needs stonith for it to fail. I removed the stonith option and it didn't fail.
I can also make it happen when I send it a SIGUSR2.
It looks like if I send a SIGUSR2 to the highest process id, then it prints out the memory stats and dies -- or just dies...
I need to look at the config file when it comes back up... I had 8 processes, one of them a zombie on sgi2. I'm not sure how many the config wanted...
The config had two links: one serial and one udp. The processes we create are:
control process (parent process: runs last) write process read process write process read process master status process
But, I had 8 processes at that time...
There seems to be something wrong here ;-)
Here's the pids from the logs:
310 Prints "configuration validated" 312 prints "udp heartbeat started on..." 320 Still running... Locked... 321 Still running... 322 Still running... Locked... 322 Still running... Locked... 323 Still running... NOT Locked... 324 prints local status now set to... and Heartbeat restart on... prints "link sgi2:eth0 up" Still running LOCKED prints "resource acquisition completed (none)" prints "Link sgi1:eth0 dead" prints "mach_down takeover complete"644 defunct... prints "resource acquisition completed"
697 Control process... 697 Also prints heartbeat restart on... (?) 697 Also prints "link sgi2:eth0 up" 684 Control process prints "starting serial heartbeat on ..." 693 HBWRITE 694 HBREAD 696 HBWRITE 696 HBREAD 697: writes all the messages ;-) Master status process. prints and Heartbeat restart on... but only for local node
Found a fork in initiate_reset() with no exit at the end... and in an error leg in req_our_resources(), and giveup_resources()
This appears to have been the bug. After replacing the implicit return with an explicit exit, and doing so in a couple of other places in some funky error legs, I can't reproduce the problem any more.
I had also gotten a bug report that the multicast option-parsing code didn't work. I had "broken" it by fixing the ppp-udp code. However, my change was correct, and the multicast parsing code was incorrect. So, I fixed the multicast parsing code. I discovered in the process that even with the bug fix in, that it didn't work because the install process didn't install the mcast code. So, now I have both the mcast code working and this bizarre Stonith bug fixed. I've been running this test configuration with multicast (which I'd never tested before), and the stonith fix (but stonith turned off, because I suspect that the test code won't deal well with the machines getting rebooted each time they leave the cluster). Guess I ought to run a hundred iterations or so of that. (and fix the test code if it's broken). Robert_Macaulay@Dell.com (the original bug reporter) is currently setting it up for testing on his machines. It seems pretty likely that it'll work just fine for him.
I ran 1000 iterations of the test code. The final results
are:
2001/03/11_14:23:38 Running test Restart [1000]
2001/03/11_14:24:26 Stopping Cluster Manager on all
nodes
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Overall Results:{'BadNews': 0,
'success': 1000, 'failure': 0}
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Detailed Results
2001/03/11_14:24:31 Test Restart:{'success': 524,
'WasStopped': 156, 'node:sgi1': 253, 'calls': 524,
'node:sgi2': 271, 'skipped': 0, 'failure': 0,
'auditfail':0}
2001/03/11_14:24:31 Test flip:{'down->up': 160,
'up->down': 316, 'success': 476, 'started': 160,
'calls': 476, 'stopped': 316, 'skipped': 0, 'failure':
0,
'auditfail': 0}
2001/03/11_14:24:31 <<<<<<<<<<<<<<<< TESTS COMPLETED
435.94user 75.19system 13:28:48elapsed 1%CPU
(0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (6605568major+2597838minor)pagefaults
0swaps
This is great news. I need to run another set of 1000, and then some other tests (probably involving the stonith_host option), and then we'll declare it stable I think. Many thanks to Aaron Nienhuis and Robert Macaulay for finding these bugs and saving our users from finding them in a "stable" release.
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!