The Chinese colleagues in my organization usually go out every Monday to have lunch in a nearby Chinese restaurant. It’s a good time to share some common concerns about the economy, the stock market or the food security in China. Topics like work or technologies seldom emerge in the lunch discussion, unless somebody start the complaint, like what happened this Monday.

One of my colleagues complained that he spent hours trying to decipher the myth that some messages failed in processing with no reason in Production environment. Eventually, he found out the deploy team rolled out a new server last weekend, which was not configured correctly. The new server, participating in a cluster, grasped some of the messages and failed them. That explained why only some messages failed while other went through successfully.

But, why did it take him so long to figure it out? Two reasons, first of all, he wasn’t aware of the environment change (roll out of new server). Second, there weren’t enough logs to show which server processed the messages. He was browsing through the logs of all the known servers trying to find traces of the failure, with no luck. If there are some central logs or database records showing which server in the cluster processed the message, the error will be obvious.

The first cause demonstrates how important the information of changes is to the resolution of issues. In most organizations, the processes of release management, change management and incident management are disconnected, especially in big organizations when those processes are handled by different groups of people. The remedy will lie in adding necessary actions and tools to link the seemingly independent processes and teams together and to enable the information flow among them.

The second and more fundamental issue, lacking of logs, is very common. All developers who ever provide second line or third line of support to Production issues know the feeling of despair when they cannot find enough logs for the problem. There is nothing they can do! That’s the time they start to complain (or curse) the laziness of the authors of the codes, if not themselves.

But, why the developers who write the codes don’t put in enough logs, if the logs are so important for providing support? I can think of many reasons: lack of time, pressure to deliver, lack of experience (never does a support job), expanded scope (e.g. designed to run in single node but later expand to a cluster), etc. But, the key reason, in my opinion is the disconnection between the development process and the operational process.

When a system is designed, we tend to focus on the requirements of the business functions and neglect the future operations. Developers are evaluated by how many functions they implemented every week in the development cycle, not by how smoothly the codes will operate later in the operation cycle. That’s too far away to worry about today.

We celebrate when each project is “successfully delivered”. The number of production issues and “effectiveness and the efficiency” of the system after the rollout are seldom counted in the measurement of the success.

On the other hand, I am fully aware of the pressure to delivery, especially in a fast-pace business atmosphere like now. There is no time to waste in building unnecessary error-proof functions. It’s hard, if not impossible, to predict all the error conditions that may happen in operation and build preventative methods against them.

So, the ultimate solution is not simply writing as much logs as we can or trying to predict all the error conditions and prevent them from very beginning. Instead, I will advocate a more agile and iterative approach:

  1. Build a good and generic logging strategy for any new project, to start with. It should be independent to business logics, easy to use, and flexible to change or extend.
  2. Setup some basic logging rules at day one and educate the developers about the importance of logs. It’s for their own benefit since they will be required to providing technical support.
  3. Put sensible amount of logs, but don’t overdo it at early phrase of a new project.
  4. Measure the success of each project not only by the development time and budget but also by the quality, which is measured by the number of Production issues, the average time to resolve an issue, the availability and the performance.
  5. For each production issue, analyze not only the root cause of the issue, but also the diagnosis process. Fix not only the bug, but also problems of lacking logs.
  6. Gather success or failed patterns of logs during operations. Write down what works and what doesn’t. Standardize them into product wide or department wide logging rules. Recommend or enforce those rules. Change the logging strategy/framework to facilitate those rules.

Systematically and consistently following those steps will ensure the health of the software system we are designing and supporting. And hopefully, we can go home early instead stuck in another mysterious production issue which leaves no trace in the logs.

Improving the logging is a repetitive and continuous process. It should be considered as one of the key measurements of the software quality. A clearly defined and constantly improving logging strategy will have enormous impact to the daily operations.

Happy logging!

Advertisements