Thursday, June 21, 2007

HPCPAL München – Day 3

#007 High hopes, low expectations

It paid off! If it was only one day training, the third day should be the one.

There was lots of interesting stuff about HPC development on MS platform with MS tools. Migration from Unix to MS.
That was I was waiting for!

The guy from the German MS showed real life programming experience on the subject matter. Experience, not only on MS but other platforms, too.

If I think back it worth the time I spent in München.

(The end)

P.S. Now, it is time for a holiday

HPCPAL München – Day 2

#006 Disaster

The morning session was good but nothing is said we cannot read in the papers (internet or documentation). The lab was as the first day, chaotic and hot.

I came here to see what MS has to offer on HPC and in the afternoon Intel guys took over. We were listening about Intel compilers, libraries, network card setup and other tools. If I think it over, the Intel par was OK.

Hopefully the last day will be better than the first two!

(To be continued…)

HPCPAL München – Day 1

#005 Should be impressed, but not

Not exactly München but Unterschleiβheim. It is a small and quiet town (village) between München and the München Airport.

Location superb! Training is in the hotel where we stay. There was one real problem! No free internet access?! We had to purchase internet access at the front desk of the hotel!

The agenda was:
· Introduction to High Performance Computing
· Windows Compute Cluster Server 2003 Overview
· Intro to verticals using Compute Clusters
· Into to third party applications capable of running on Compute Clusters
· Topologies of Compute clusters
· Intro to Interconnects/networking
· Intro to storage technology for HPC implementations
· Installing and configuring Compute Clusters
· Managing Compute Clusters

As we were late on start the first four (4) bullets were not delivered.

The guys from MS wanted to impress us! They bought eight (8) clusters with them! Eight four (4) node blade systems! Nice, but it made us more problems than use.

The blades were not prepared on time. We could not effectively connect to them. There were not prepared drivers for the system. Big heat and noise produced by the blades. Etc.

It would be better to have a set of virtual machines to do the labs. Later on we got an implicit (never spoken out) answer why the labs were not on virtual machines. MS Virtual Server or Virtual PC does not support 64 bit guest OSs and MS does not use vmware Workstation or Server for their presentations. Logical, isn’t it?!

On the other side, the talks, the knowledge of the guys were superb. Tips and trick, what’s supported and what’s not (but working), what and how, etc. Hats down!

(To be continued…)

My first Windows HPC Cluster

#004 The virtual one

Last week I managed (had enough free time) to set up my first Microsoft Windows Compute Cluster 2003 Environment.

I endorse everybody thinking about building a HPC cluster on MS or non MS technology to try out the virtual approach.

First I installed a domain controller, than built the head node and installed two compute nodes on a single 64 bit machine with 4GB of RAM under Windows XP x64 and vmware workstation. To be exact, on my son’s game station.

For sure, it is not a high performance computer but still you can test every aspect of the platform. Installation, administration, management, application (OpenMP, MPI) development, parallel remote debugging, etc.

The installation was fun and went through smoothly when I figured out how to configure each NIC as I was configuring the most complex network topology. All the cluster nodes are with 3 NICs. One NIC is for public, one for private and one for the MPI network.

I was not using a DHCP server, so configured the NICs as follows.

Public network :
*192.168.0.X (255.555.555.0)
*Configured default gateway
*Configured DNS server

Management Network:
*10.0.0.X (255.555.555.0)
*No default gateway
*Configured DNS server (DC in my case)

MPI Network:
*10.0.0.X (255.555.555.0)
*No default gateway,
*Configured DNS server (DC in my case)

Thursday, June 14, 2007

Started the Campaign

#003 First CCS Presentation

I started my Microsoft Windows Compute Cluster 2003 campaign. I've made and entry level presentation about the mentioned technology on the Croatian main Microsoft event WinDays 2007.

It was an interesting experience. There was only one person in the around 60 people public who has a practical HPC knowledge.


Microsoft WinDays 2007

My next CCS event is Munich. HPC PAL. Be back on that when I am back!

Saturday, June 9, 2007

Disaster Recovery

#002 Different Approach

Building and maintaining a disaster recovery (DR) site is never an inexpensive and a simple task.

A lot of factors must be taken in calculation to satisfy all the needs or to find consensus. Corporate policy, regulations, budget, etc. …

Considering how much effort and founds needed for such implementation, it would be wise to utilize our investment and not just have an inactive infrastructure on the DR site. For example, we can use our disaster recovery equipment as a backup when we are upgrading our production environment.

But, if we’d like to utilize our DR environment - as a backup for our production (PR) environment or as the part of the day-by-day operation - the failover and failback procedure must be very simple.

This short article brings a real life example of disaster recovery implementation for a BizTalk environment with a semiautomatic failover/failback procedure.

Our production BizTalk environment is a fairly common one. A two node failover cluster in the background for:
· two instances of SQL Server (for BizTalk and custom application)
· clustered WebSphereMQ Server
· clustered MSMQ
· clustered ENTSSO Service
· clustered MQSAgent

In our BizTalk Group there are two front-end (FE) BizTalk servers. The FE servers are also Network Load Balanced.
All the receive locations (MSMQ and WebSphereMQ) are clustered and one MQSeries send queue is also clustered. For all the BizTalk Hosts there is a Host Instances on both FE servers. Our web application is hosted on the load balanced IIS.

This configuration gives us fault tolerance and load balancing.

When we designed our solution the goals were:
· use as little software licenses as it is possible
· implement an automatic failover procedure
· implement semi automatic failback procedure

At the end – as loosing connection between the PR and DR sites without disaster situation could initiate automatic failover – we implemented a semi automatic failover procedure.

Here is the schema of our solution:



To solve the problem of the FE server we added a third BizTalk Server box to the group and also joined it to our load balanced server farm on the DR site. This third server is a cold server from the BizTalk point of view. No BizTalk services are running on it (no license is needed). So, if the FE servers fail on the PR site we just start the services on the third node. As the 3rd node is load balanced the client applications think that everything is OK. The clients just does not know how many servers are behind the load balanced URL. This model is pretty simple.

The DR solution for the back-end cluster is just a bit more complex. We took one domain controller from the corporate network, isolated it and on the isolated network we installed a single node cluster with exactly the same cluster resources just like in the PR environment. Same network names, IP addresses, same message queues, same SQL Server instances, etc. Only the physical server’s name is different. Than we stopped all the cluster resources and the cluster service and connected the server to the corporate network on the DR location. Third we implemented a continuous file level replication – using a third party solution – for each SQL database, MSMQ store, MQSeries store, etc.

I can get back on the file replication implementation if needed.

In that moment we just had to be careful not to start the cluster service and the cluster resources on both locations.

So, the failover procedure for the back-end cluster is as follows:
· Check that the production cluster is down
· Start the services on the DR single node cluster

That’s it! Simple as that!

With “loosing” the production cluster, BizTalk looses the SQL connections and its endpoint locations. A few minutes later the needed resources are alive – on the DR site – and BizTalk reconnects just like as if the PR cluster is alive.

Failover time is measured in minutes! Not hours, not days!

If the complete PR site is lost, we just have to start the services on the single node cluster and on the BizTalk box.
The key point of the solution is the single node cluster.

As the failover procedure is quick and simple, we can utilize the DR solution in different ways.

For example, we use the DR site as a backup if production server upgrade fails. It is nice to know that if the PR server is down there is a fully functional server in few minutes.

Or, once we lost the production cluster storage, and after the storage was up and running, we just took over the files (SQL storage files, MQSeries and MSMQ configuration a storage files) from the DR server and our live environment was functional as before.

Fortunately, there was no real disaster recovery situation, yet.

Saturday, June 2, 2007

„Unwanted“ feature - When a feature becomes a bug

#001 Unexpected property demotion in BizTalk Server

A feature should be a category which should not bring ambiguity in our solutions. Also, we should be empowered to willfully turn it on or off.
Property promotion and demotion is - documented feature in BizTalk Server - widely blogged and discussed on news groups when it does not work as expected.
In the following sample Property Demotion works but it's not really what we expected to happen!

Here is my sample schema... (Schema.xml)

The only thing we should notice is that the PP element is promoted using Quick Promotion in the Visual Studio Schema Editor and it is a Distinguished Field.

Our sample orchestration receives a sample message (MsgIn). In the Message Assignment Shape MsgIn is assigned to MsgOut and the "NP" string constant is assigned to the MsgOut.PP distinguished field. At the end, the orchestration sends out the MsgOut message.




I created a FILE Receive Port/Location pair with the XMLReceive Pipeline and a FILE Send Port with the XMLTransmit Send Pipeline and bounded the sample orchestration to those endpoints.

If the MsgIn message is:
(MsgIn.xml)

the MsgOut message arrived at the receive location is: (MsgOut.xml)

Just as expected.

But, if for any reason we change the promoted property's (PP's) Property Schema Base in the property schema to MessageContextPropertyBase the resulted message will be equal to the original message. The PP element value stays "PP" and not "NP". That's not what we expect to happen.

So, what happened? Property demotion.


If we debug the orchestration everything seems fine. The PP value changes to "NP" and the orchestration sends out the MsgOut message with the "NP" value in it. Still the message on the file system shows "PP" for the PP element. The PP value is changed in the pipeline. As we assigned MsgIn to MsgOut with the MsgOut = MsgIn; expression the Message Context is also assigned. With MsgOut.PP = "NP"; we only changed the body data not the context property and the XMLTransmit pipeline is demoted the old value into the message body.

If we change the content of the Message Assignment shape to:


MsgOut = MsgIn;
MsgOut(UnwantedPD.PropertySchema.PP) = "NP";


the resulted message body's PP element value will become "NP" immediately during the execution of the shape. With execution of the second line both the message context property and the body element value are changed in the same time. Later on, if property demotion happens, nothing changes.

So, be careful using distinguished fields when they are also promoted properties.

It would be even better if we could turn on or off property promotion/demotion in our pipelines.

Few days ago, this behavior cost me several hours of debugging. The receive location was a SharePoint view and the send location was the same SharePoint InfoPath form library where the view was also defined. I debugged my orchestration and saw that the SharePoint view filter value in my message was changed but there were no forms in my library and every second a new orchestration started up, endlessly.