I'm back even if late ;-)
As I didn't have any dump or trace, debugging was nearly impossible and opening a PMR with no debugging material would have been useless (or they would have asked me to reproduce the case... which I am not ready to attempt).
BUT, I got a rather similar problem on another lpar and that helped me to better understand how and when 'ARSYSPIN' started to die ... bringing the running lpar and all the others down with it.
Moreover, I think, we found a way to avoid this bad case (even if it is still to be checked and checked again) or, at least, to avoid a sysplex-wide crash, should this case still occurs.
First of all, issue does not seem to be limited to the case where two ARSYSPINs are running on the same lpar. It may happen with one ARSYSPIN.
1) For some dark reason, ARSYSPIN seems to be looping, doing big amounts of I/Os and consuming a lot of CPU cycles.
2) As it does not archive anything, after some long minutes/hours, we may be tempted to do a 'P ARSYSPIN'; First mistake... but still not dramatic up to now
3) As ARSYSPIN does not stop, we decide to 'CANCEL' it; 2d mistake and far more harmfull :
after some minutes we got messages speaking about PDSE problems on 'SARSLOAD' library and asking for a PDSE analysis;
Analysis shows a 'latch' held by ARSYSPIN on SARSLOAD.
If you don't do the 'V SMS,PDSE,FEELATCH' (or don't remember the syntax of this command) quickly enough, SMSPDSE on the other lpars of the sysplex are getting
locked as well and you see appearing messages, telling the 'faulty' lpar does not respond and should be stalled.
If you enter the freelatch or 'deactivate' the faulty lpar quickly enough, the sky is clearing and other lpars will survive.
4) Now, why is ARSYSPIN starting a kind of loop in some case ?
5) I noticed that for the two lpars I got this issue,
many outputs were still to archive (nothing dramatic anyway but 400-500 outputs)
and, above all, parameter MASJESDELAY was not coded in ARSYparm (on 'safe' lpars, MAXJESDELAY was = 5)
6) Keeping my 'vary SMS,PDSE,analysys and freelatch' toolbox ready, I coded MAXDELAY=2 and resubmitted ARSYSPIN on the restarted lpar (no outputs purged).
7) Then, miracle, ARSYSPIN started doing I/O and consuming CPU but... archived all outputs rather quickly (in several blocks, due to a MAXFC=250).
and I could stop it (P ARSYSPIN) without problem.
As I read it, MAXJESDELAY should have been defaulted to 60 seconds when not coded. So may be we should have been more patient and wait for some hours more to
see it archiving some outputs...
But I'm still amazed that only 400-500 outputs (jobs) + a MAXDELAY=60 can 'cost' some many I/Os and CPU cycles.
In addition, I still don't understand why this 'latch' held by ARSYSPIN on SARSLOAD library.
If any member of user group also got problem related to MAXJESDELAY, I will appreciate to share info.
Jean