Saturday, June 9, 2007

Disaster Recovery

#002 Different Approach

Building and maintaining a disaster recovery (DR) site is never an inexpensive and a simple task.

A lot of factors must be taken in calculation to satisfy all the needs or to find consensus. Corporate policy, regulations, budget, etc. …

Considering how much effort and founds needed for such implementation, it would be wise to utilize our investment and not just have an inactive infrastructure on the DR site. For example, we can use our disaster recovery equipment as a backup when we are upgrading our production environment.

But, if we’d like to utilize our DR environment - as a backup for our production (PR) environment or as the part of the day-by-day operation - the failover and failback procedure must be very simple.

This short article brings a real life example of disaster recovery implementation for a BizTalk environment with a semiautomatic failover/failback procedure.

Our production BizTalk environment is a fairly common one. A two node failover cluster in the background for:
· two instances of SQL Server (for BizTalk and custom application)
· clustered WebSphereMQ Server
· clustered MSMQ
· clustered ENTSSO Service
· clustered MQSAgent

In our BizTalk Group there are two front-end (FE) BizTalk servers. The FE servers are also Network Load Balanced.
All the receive locations (MSMQ and WebSphereMQ) are clustered and one MQSeries send queue is also clustered. For all the BizTalk Hosts there is a Host Instances on both FE servers. Our web application is hosted on the load balanced IIS.

This configuration gives us fault tolerance and load balancing.

When we designed our solution the goals were:
· use as little software licenses as it is possible
· implement an automatic failover procedure
· implement semi automatic failback procedure

At the end – as loosing connection between the PR and DR sites without disaster situation could initiate automatic failover – we implemented a semi automatic failover procedure.

Here is the schema of our solution:



To solve the problem of the FE server we added a third BizTalk Server box to the group and also joined it to our load balanced server farm on the DR site. This third server is a cold server from the BizTalk point of view. No BizTalk services are running on it (no license is needed). So, if the FE servers fail on the PR site we just start the services on the third node. As the 3rd node is load balanced the client applications think that everything is OK. The clients just does not know how many servers are behind the load balanced URL. This model is pretty simple.

The DR solution for the back-end cluster is just a bit more complex. We took one domain controller from the corporate network, isolated it and on the isolated network we installed a single node cluster with exactly the same cluster resources just like in the PR environment. Same network names, IP addresses, same message queues, same SQL Server instances, etc. Only the physical server’s name is different. Than we stopped all the cluster resources and the cluster service and connected the server to the corporate network on the DR location. Third we implemented a continuous file level replication – using a third party solution – for each SQL database, MSMQ store, MQSeries store, etc.

I can get back on the file replication implementation if needed.

In that moment we just had to be careful not to start the cluster service and the cluster resources on both locations.

So, the failover procedure for the back-end cluster is as follows:
· Check that the production cluster is down
· Start the services on the DR single node cluster

That’s it! Simple as that!

With “loosing” the production cluster, BizTalk looses the SQL connections and its endpoint locations. A few minutes later the needed resources are alive – on the DR site – and BizTalk reconnects just like as if the PR cluster is alive.

Failover time is measured in minutes! Not hours, not days!

If the complete PR site is lost, we just have to start the services on the single node cluster and on the BizTalk box.
The key point of the solution is the single node cluster.

As the failover procedure is quick and simple, we can utilize the DR solution in different ways.

For example, we use the DR site as a backup if production server upgrade fails. It is nice to know that if the PR server is down there is a fully functional server in few minutes.

Or, once we lost the production cluster storage, and after the storage was up and running, we just took over the files (SQL storage files, MQSeries and MSMQ configuration a storage files) from the DR server and our live environment was functional as before.

Fortunately, there was no real disaster recovery situation, yet.

No comments: