Deploying SOA across Active-Active Data Centres
09 Aug 2015 by Simon Haslam (in HA)
I've recently started on another active-active SOA infrastructure project and thought I'd share a few introductory notes since it's a relatively new topic in the context of Fusion Middleware.
Over recent years within Europe some customers have accepted a close geographic distance (e.g. 15 km) between data centres in return for a high bandwidth, low latency interconnect (typically ‘dark’ or dedicated fibre). This is to enable them to share normal production workload across both sites – so called "active/active" data centres – which is contrary to the traditional approach of having a highly segregated Disaster Recovery (DR) site (an "active/standby" topology).
Prior to the mid-2013 Oracle supported a single approach to High Availability (HA), which was clustering within a single site, as described by Fusion Middleware High Availability Guide, and not stretched across two sites. The Oracle supported approach to Disaster Recovery was entire site fail-over using an active-standby model, as shown below (from Fusion Middleware Disaster Recovery Guide):
A few large customers had negotiated agreements with Oracle to support architectures that were simultaneously active in 2 data centres, but these were people spending considerable sums on licences and ACS support.
Over the last 5 years the cost of site-to-site fibre has fallen considerably with “metropolitan” scale data centre links becoming common. Whilst the term Metropolitan Area Network (MAN) originally meant a connection across a city, these days it tends to also apply to low latency connections between neighbouring conurbations, typically running Ethernet over DWDM. The scale of a MAN falls in between the Local Area Network (LAN) within a single data centre and the traditional Wide Area Network (WAN) which can span hundreds of miles.
When comparing a data centre LAN with a site-to-site MAN, other than bandwidth/latency characteristics (which in practical terms are converging), the main difference is one of control and risk. The fibre between sites is usually owned and managed by a third party, and crosses third party land, so is subject to disturbance outside the control of the data centre operator, such as damage from road-works. This can be mitigated to a degree (such as by using multiple routes and suppliers) but nevertheless a MAN still has a risk profile that is different to a network that is entirely under your own control.
Oracle doesn't have a recommended architecture for active-active SOA infrastructures as part of the official documentation set. The closest to this is a white paper ("Best Practices for Oracle Fusion Middleware SOA 11g Multi Data Center Active-Active Deployment"), resulting from some excellent work by the MAA (Maximum Availability Architecture) team, released in the middle of 2013 (whilst I was working on another active-active SOA project). This paper is referenced in MOS 1374202.1 which says that Oracle will support configurations that follow the recommendations stated in this paper.
The configuration the MAA paper proposes is:
(copied from page 13 http://www.oracle.com/technetwork/database/availability/fmw11gsoamultidc-aa-1998491.pdf)
This design has local load balancing at each site with global local balancing across sites. The examples all use different networks at each sites, though there is a comment that suggests a single network is possible too.
A recent industry trend has been the availability of stretched networks (i.e. Layer 2 / aka data centre bridging) which allow the same network to be presented at both sites. This has encouraged some platform designers to stretch their WebLogic domains and clusters without giving it too much thought - something I call "Unwittingly Stretched Clusters"! The beauty of such a stretched environment is that you treat it like a single site installation and it will probably work without problem under normal conditions and moderate load. Its biggest drawback is that when it fails it can do so catastrophically by losing service at both data centres – probably not the intention of the budget holder who signed off the expenditure of the second live site!
Network infrastructure manufactures have various features to try to mitigate failure of stretched VLANs (such as the "Layer 2.5" protocol introduced with DWDM) but it seems everyone I know running stretched VLANs has had a serious outage at some point – whether this is due to immaturity of the technology, or complexity of underlying configuration, I'm not sure. Certainly my larger customers either don't have stretched VLANs at all or only deploy them only for isolated situations (when there's no alternative I suspect).
If you are considering an Active-Active SOA platform to reduce the need for an isolated DR site there are several factors you need to think about:
- An active-active data center topology may or may not improve resilience over a traditional active/standby design. On the upside the second site will always have the latest software deployed, removing the risk a DR site has of being out of date or 'broken' in some way. On the downside the second site is far more vulnerable to faults synchronised with the first site.
- Either site needs to be able to handle the full production load so a 1+1 approach may not save any money in licensing. An n+1 approach, such as with 3 data centres, may be advantageous in this respect, though even more exotic!
- Site outage testing will need to be carried out on production systems which in itself could introduce risk (as compared to testing an isolated DR system).
- By definition any redundant 2 node topology introduces the "split-brain" problem. For the database tier this almost certainly means it is safer and simpler to stick with an active-standby approach (typically Data Guard) rather than, say, an extended RAC cluster. When you consider how much of the enterprise compute workload is typically database this rather dilutes the active-active philosophy.
- Finally, finding good SOA EDG/MAA skills is already relatively tricky; finding people with a deep understanding and real-world experience of active-active systems will be even more so.
From sessions at last year's OpenWorld conference and other discussions it is clear that Oracle recognises the demand for active-active data centres, from both customers as well as internally for the Oracle Public Cloud. In the future I think we can expect to see fully supported dispersed active-active architectures, which I predict will offer a degree of autonomy between sites but without an administration overhead. I expect such solutions will work across different networks since enterprise customers, and cloud availability zones in different regions, are currently very unlikely to use layer 2 bridging for the reasons described above, though of course that could change over the medium term.
The idea of running SOA Suite in some sort of active-active mode across well-connected data centres is conceptually an appealing one - it sounds simple and seems to offer improved value for money. However this is a relatively new approach (with quite a few subtleties) which may actually reduce SOA platform availability following a data centre disaster as compared to more traditional active-standby techniques. This article is not intended to dissuade architects from considering active-active SOA topologies, but just to make sure their design is carefully considered at all layers in the infrastructure stack.