The Chinese colleagues in my organization usually go out every Monday to have lunch in a nearby Chinese restaurant. It’s a good time to share some common concerns about the economy, the stock market or the food security in China. Topics like work or technologies seldom emerge in the lunch discussion, unless somebody start the complaint, like what happened this Monday.

One of my colleagues complained that he spent hours trying to decipher the myth that some messages failed in processing with no reason in Production environment. Eventually, he found out the deploy team rolled out a new server last weekend, which was not configured correctly. The new server, participating in a cluster, grasped some of the messages and failed them. That explained why only some messages failed while other went through successfully.

But, why did it take him so long to figure it out? Two reasons, first of all, he wasn’t aware of the environment change (roll out of new server). Second, there weren’t enough logs to show which server processed the messages. He was browsing through the logs of all the known servers trying to find traces of the failure, with no luck. If there are some central logs or database records showing which server in the cluster processed the message, the error will be obvious.

Production Support – The ITIL Way is a presentation I prepared, introducing how to handle Production issues in ITIL way.

There are two ITIL processes related to this topic, Incident Management and Problem Management. This topic covers those two processes in details. In the end,  some suggestions about how to improve the current Production Support processes were introduced.

Production issues are what trouble us most. They seems to consume our time endlessly. How to tame those issues and take the control of our life back is the main focus of this presentation.

