What was the "Mandatory Maintenance for Java Cloud Service" last night?
04 Oct 2015 by Simon Haslam (in Cloud)
|Had I been working in Operations and on-call at an organisation using Java Cloud Service then it appeared I may well have been rudely woken up last night!|
At around 2am in the UK this email innocently dropped into my mail box:
This warning of an impending outage to Java Cloud Service was closely followed by one for Database Cloud Service. Judging by the email - "during the maintenance your services will be unavailable" - it seems an hour and a half later the instance failure notifications (from Enterprise Manager or Nagios etc) would have probably started coming in.
Then at 4:36am I had an email to say my JCS was now available again, and 5:11am to say my DBCS was too so, a total outage according to the emails of just under 2 hours. However when it happened all of my Java and Database test instances (test) were down already but as far as I can tell no changes were actually made to them (they were not started up for example) - be sure to read my conclusions at the end of this post!
So this evening I've been having a good look round. The most obvious first change is that you can provision new domains with WebLogic 184.108.40.206.4, and existing full JCS ones (i.e. the ones including 'cloud tooling' for patching) have the following patches available:
If you follow the "Readme File" links above they go to underlying product's Read Me in My Oracle Support, but I know that previously the WebLogic installation was more than just the base release plus PSU. A September release of JCS in What's New in Oracle Java Cloud Service describes some minor functional changes, but I couldn't find anything more detailed about the patches. Maybe I've just missed documentation elsewhere?
Anyway, I created a new JCS instance to see what WebLogic patches are installed and found these (opatch inventory output is edited here for brevity):
Interim patches (4) :
Patch 20838345 : applied on Fri Jul 31 17:55:57 UTC 2015
Patch description: "WebLogic Server 220.127.116.11.4 PSU Patch for BUG20838345 July 2015"
Created on 20 May 2015, 21:43:06 hrs PST8PDT
Created on 12 Jan 2015, 03:07:37 hrs PST8PDT
Patch 19917991 : applied on Fri Jul 31 17:55:57 UTC 2015
Patch description: "One-off"
Created on 30 Jul 2015, 19:04:30 hrs PST8PDT
Patch 20716006 : applied on Fri Jul 31 17:55:56 UTC 2015
Patch description: "One-off"
Created on 7 Apr 2015, 20:05:28 hrs PST8PDT
We can actually deduce a surprising amount from these snippets:
- JCS does indeed include the WebLogic 18.104.22.168.4 bundled patch, plus 3 additional one-off patches.
- The newest patch was not created until 30 July, a fortnight after the CPU/PSU was released (on 14th July).
- All of the patches are applied in one sitting, so this golden image of the Oracle Home which is used in JCS is cleanly rebuilt from scratch for every version - this is very good and suggests a fully automated approach (probably part of continuous integration).
With my earlier experiments a couple of months or so ago I had already worked out that there was a problem with JCS and the July PSU. As soon as the patches were released, and the JCS one didn't appear, out of curiosity I used OPatch to manually try to update 22.214.171.124.3 to .4 and hit a conflict - there was a patch on JCS for bug "MS GOES TO ADMIN MODE DUE TO APP DEPLOY ERRORS IN MSI MODE (Patch 19917991)" that was incompatible with 126.96.36.199.4. After backing that patch out I did successfully apply the PSU, but of course at the cost of a bug apparently "un-fixing itself". Therefore I think it's very reasonable that Oracle waited until it had a merge of this patch before making 188.8.131.52.4+ available to JCS users; the obvious questions though are: why wasn't this patch already in the PSU and, if not, why had no-one anticipated this conflict earlier?
One thought that has been troubling be was what happens if you want to create old versions, e.g. to reproduce a known, working state, or else you want to create a new JCS instance before you are ready to adopt the latest PSU? Whilst you can't create old versions through the JCS console, back in July I tried via the REST API to create a 12.1.2 version that I knew existed on JCS and it was rejected. I had hoped that maybe the prior version to this new update would still work but sadly it too was rejected:
HTTP/1.1 400 Bad Request
Date: Sun, 04 Oct 2015 19:26:47 GMT
Invalid Weblogic version [184.108.40.206.3] specified. Version must be from the following list: [220.127.116.11.4, 10.3.6.0.12].
This does present an interesting dilemma. Let's say, for sake of argument, that the new PSU introduced a bug which broke something in your application (which a customer of mine saw for 18.104.22.168.3 over 22.214.171.124.1) then you have no way to recreate the old version on JCS. Maybe this is less important for the higher level PaaS products, such as DOCS, but for low level ones like JCS, where you are deploying your own application code, I can see it being an issue in some, though relatively rare, cases. On the other hand the older images are definitely still on the OPC platform (they must be, to allow you to roll back patches) so perhaps there's a hidden option somewhere (maybe only available to Oracle Support) to enable creation of instances for older versions.
Each of last night's emails suggested a "Service Instance" was being upgraded, but there was only one email for JCS and one for DBCS (when I have multiple instances). In JCS terminology an "instance" is a complete environment, referenced by an alphanumeric ID - not the Service Instance ID number (which I can't find anywhere) quoted in the email. Therefore I think perhaps the emails are only referring to outages of the Service Managers (i.e. that run the consoles and APIs), not the JCS/DBCS Instances themselves. If so, the emails were definitely misleading (the service managers aren't too important during normal operation), though it is puzzling why such an upgrade would take so long.
Hopefully this post has been useful - these are just my observations and speculation: if you have any different experiences please let me know (e.g. on Twitter https://twitter.com/simon_haslam).