Author Topic: Running 2 versions (8.4.1 and 8.5) of ARSYSPIN on one lpar crashes... the lpar  (Read 3377 times)

JGODFROID

  • Guest
Hello. I'm busy with migrating a CMOD z/OS 8.4.1 to 8.5 (all maintenance applied). We are running z/OS 2.1.
I can play in a 'system sandbox' lpar (however member of a 'prodplex').
As there was a CMOD 8.4 test environment already set up on this lpar, I reused it.
After having installed CMOD 8.5 for z/OS, I customized a bit an 'ARSSOCKD' (steplib, arsbin, + changes in 'parms' to target the DB2 server running on the sandbox) and started it;
I also customized an ARSYSPIN JOB (i.e STEPLIB pointing to the new SARSLOD and an ARSBIN dd pointing to new CMOD 8.5 bin, ...) This worked perfectly and saw spool files correctly archived and readable using a CMOD 8.5.0.6 client.
Then I stopped these new ARSSOCKD and ARSYSPIN and started an ARSYSPIN 'old version' (CMOD 8.4.1 running in sandbox but targeting a big DB2 server, in a prod lpar).This old ARSYSPIN worked perfectly and archived spool files correctly.
Then I started the new ARSSOCKD and a new ARSYSPIN (both running CMOD 8.5 and targeting the sandbox DB2). And then it was a disaster : the new arsyspin consumed a lot of CPU cycles, did huge amounts of I/Os (without archiving anything) after some long minutes I decided to cancel it, without success... than SMSPDSE crashes, ZFS crashes with some impact on the all prodplex (ZFS on some other LPARS crashed as well).
Frankly speaking I'm getting scared about going on with testing   :-\
Did anybody hear about possible coexistence problems when two versions of ARSYSPIN/ARSLOAD run on the same lpar ?
My feeling is that problems are somewhere in OMVS... but I do not have the evidence on that
The only difference I saw is that :  in /usr/lpp/ars841 there is a symlink V8R4M1 pointing to ... /usr/lpp/ars841.   
                                                       in /usr/lpp/ars850 symlink V8R5M0 did not exist  (I have created one in between...  but didn't dare to retry the failing scenario).

Tks upfront for any help, idea,...   I will appreciate anything, including that I was stupid to attempt such coexistence  ;)

Jean   

Greg Ira

  • Full Member
  • ***
  • Posts: 240
    • View Profile
Well I can tell you that you're not out of line trying to run both versions.  We went from v8.4 - v8.5 (z/OS) a few years back and that is exactly what we did.  We have multiple instances running and we moved them one at a time while the other instances ran the old version.  There's no reason it shouldn't work. If you've verified everything is configured correctly I would probably open an issue with IBM.

Ed_Arnold

  • Hero Member
  • *****
  • Posts: 1208
    • View Profile
Quote
...SMSPDSE crashes, ZFS crashes with some impact on the all prodplex (ZFS on some other LPARS crashed as well).

Wow.  Definitely worth opening up a PMR.  Do you have dumps from either of those?  If so, I recommend you open up a PMR with the SMSPDSE folks or the zFS folks directly.

Ed
#zOS #ODF

JGODFROID

  • Guest
I'm back even if late ;-)
As I didn't have any dump or trace, debugging  was nearly impossible and opening a PMR with no debugging material would have been useless (or they would have asked me to reproduce the case... which I am not ready to attempt).
BUT, I got a rather similar problem on another lpar and that helped me to better understand how and when 'ARSYSPIN' started to die ... bringing the running lpar and all the others down with it.
Moreover, I think, we found a way to avoid this bad case (even if it is still to be checked and checked again) or, at least, to avoid a sysplex-wide crash, should this case still occurs.
First of all, issue does not seem to be limited to the case where two ARSYSPINs are running on the same lpar. It may happen with one ARSYSPIN.
1) For some dark reason, ARSYSPIN seems to be looping, doing big amounts of I/Os and consuming a lot of CPU cycles.
2) As it does not archive anything, after some long minutes/hours, we may be tempted to do a 'P ARSYSPIN';  First mistake... but still not dramatic up to now
3) As ARSYSPIN does not stop, we decide to 'CANCEL' it; 2d mistake and far more harmfull :
         after some minutes we got messages speaking about PDSE problems on 'SARSLOAD' library and asking for a PDSE analysis;
         Analysis shows a 'latch' held by ARSYSPIN on SARSLOAD.
         If you don't do the 'V SMS,PDSE,FEELATCH' (or don't remember the syntax of this command) quickly enough, SMSPDSE on the other lpars of the sysplex are getting
         locked as well and you see appearing messages, telling the 'faulty' lpar does not respond and should be stalled.
         If you enter the freelatch or 'deactivate' the faulty lpar quickly enough, the sky is clearing and other lpars will survive.
4) Now, why is ARSYSPIN starting a kind of loop in some case ?
5) I noticed that for the two lpars I got this issue,
           many outputs were still to archive (nothing dramatic anyway but 400-500 outputs)
           and, above all, parameter MASJESDELAY was not coded in ARSYparm (on 'safe' lpars, MAXJESDELAY was = 5)
6) Keeping my 'vary SMS,PDSE,analysys and freelatch' toolbox ready, I coded MAXDELAY=2 and resubmitted ARSYSPIN on the restarted lpar (no outputs purged).
7) Then, miracle, ARSYSPIN started doing I/O and consuming CPU but... archived all outputs rather quickly (in several blocks, due to a MAXFC=250).
    and I could stop it (P ARSYSPIN) without problem.
8) As I read it, MAXJESDELAY should have been defaulted to 60 seconds when not coded. So may be we should have been more patient and wait for some hours more to
    see it archiving some outputs...   
    But I'm still amazed that only 400-500 outputs (jobs) + a MAXDELAY=60 can 'cost' some many I/Os and CPU cycles.
    In addition, I still don't understand why this 'latch' held by ARSYSPIN on SARSLOAD library.

If any member of user group also got problem related to MAXJESDELAY, I will appreciate to share info.

Jean   
   
   

Ed_Arnold

  • Hero Member
  • *****
  • Posts: 1208
    • View Profile
Jean - just curious how current the service is on your 8.4.1 system....

PK92973: SLOW READ BY ARSYSPIN FOR DYNAMIC SYSOUTS

http://www-01.ibm.com/support/docview.wss?uid=swg1PK92973

--- or ---

You wouldn't be caught by the ARSYSPIN and JES2 co-dependency mentioned here, would you?

http://www.odusergroup.org/forums/index.php?topic=897.0

Ed
#zOS #ODF