For systems integrators who don't necessarily specialise in any one technology, these things may seem more like a dark art and certainly the information made available in the myriad of documents produced by the various authors of WebSphere technical books doesn't seem to shed enough light on what is going on.
Yesterday, I had a misbehaving IBM Tivoli Identity Manager (ITIM) instance which saw all transactions sitting in a pending state with not a single transaction being flushed through to completion.
The setup was: ITIM v5.0 running in a WebSphere v6.1.0.9 clustered environment with two physical servers. Everything about the deployment was fairly vanilla (though the amount of data going through the system is quite large).
The problem started on Sunday night during the automatic regeneration of LTPA keys - it seems. The Deployment Manager lost the ability to control the Node Agents/Clusters which forced the following sequence of events:
- Forced shutdown of the Application Servers and Node Agents
- Removal of Global Security
- Manual synchronisation of the nodes on the physical servers
- Startup of the Node Agents
- Reconfiguration of Global Security
- Resynchronisation of the nodes (via the Deployment Manager)
- Startup of the Messaging Cluster
- Startup of the Application Cluster
Everything looked OK, until transactions started to appear in ITIM and they still sat pending!
The logs showed that the Messaging Cluster would start, then stop, then start, then stop, then start, then stop.....
Communications channels between the local queues and the shared queues weren't as they ought to have been and the root cause seems to be an inconsistency between transactions that ITIM expected to be pending; transactions the Messaging Cluster thought it had; transactions stored in the "physical" storage of DB2 for the messaging cluster and the tranlog.
NOTE: The recovery procedure for this is probably not something anyone should undertake but it was a final straw after 12 hours of trying various tactics which could have saved my pending transactions (which proved futile).
- Stop the clusters
- Stop the Deployment Manager
- Kill any rogue WebSphere processes (of which there were a few)
- Stop the database manager supporting ITIMDB
- Delete the various tranlog files for the two application server instances
- Reboot the two servers (which probably wasn't required, but these were Windows 2003 server instances which seemed to be hanging on to various ports needlessly and I wanted a clean environment to start everything up again)
- DANGER: delete from itiml000.sib000 in the ITIM Database (as well as itiml001 and itims000 and for sib001 as well)
- Start up ITIM
At this stage, all was well. Apart from my physical well being.
CAUTION: The above approach was a very brutal way of clearing out everything to a point where the system was operational once again. I never want to have to repeat this process and having a clean ITIM instance which is well tuned and well looked after is very much more preferable than having to perform this kind of recovery.