Production Operations: Manual Intervention

Blue Screen - Production Operations

DevOps has come a long way in bringing automation to Production Operations.  There are tools like Nagios for infrastructure monitoring and alerting.  Splunk is there to aggregate logs and analyze real-time data.  In AWS, CloudWatch alarms and actions achieve self-healing environments.  When everything is working like clockwork, it’s business as usual. However, it’s important to remain vigilant and be ready to step in if some unplanned situation makes its way into the system.  This is where manual intervention comes into play.

Even a giant like AWS is not immune to ad-hoc emergencies.  Early in 2017, the S3 component of AWS suffered an unplanned outage (http://www.bizjournals.com/seattle/news/2017/03/02/amazon-aws-outage-cause.html).

Production Operations – Working under Pressure

When the unexpected does occur, that’s the time when engineers need to work as efficiently and quickly as possible.  Everyone is going after the same goal, which is to restore the system to normal.  However, during an outage scramble, making a mistake is almost too easy. The clock is racing, emails are flying, and the team is on high alert. Let’s examine some common scenarios, their classic failures, and how to prevail in these situations as safely and efficiently as possible.  For the scenarios, the OS is Linux, Putty is used to SSH into the systems, and the Bash shell is in use.  Of course the lessons learned apply to all kinds of configurations, but it is easier to illustrate using a specific setup.

Disk Space 100% Full

In this scenario, the disk space monitoring mechanism didn’t work, and a disk has to be freed to make room for new files.  The latest 2 weeks worth of files can be quickly restored from backups, as long as they match a certain size threshold and other such criteria.  The engineer starts Putty, establishes an SSH connection, and fires off a sophisticated one-liner. He then does a space check to see how much space he cleared, but now there is complete silence in the room.  The space available is much larger than expected.  He does a directory listing, and discovers that all the files in the directory are now missing!

It turns out that there was a typo in the one-liner, just one character let’s say.  It would have been nice if the command failed and did nothing, but that’s not what happened.  The command did what it was instructed to do.  One-liners are great.  They are usually efficient, optimal and elegant.  However, firing off one-liners to save a few seconds of time during a production outage is risky business. It’s great when they work, but can be a disaster when they don’t.

Instead of executing the one-liner, which requires 100% accuracy, there is a safer approach.  Break the command up into multiple parts. First, do a dry run to print out (ls, echo) the files to be deleted.  Once happy with the list, only then run the rm (delete) command.

Application Process Restarts

An unexpected condition killed the application processes. Auto restarts did not occur as expected.  The Production Operations team needs to bring the system back up as soon as possible.  Let’s say there are 10 processes, one process per server.  The engineer fires up 10 Putty sessions and logs into each respective server. He then executes the necessary commands to start up the requested processes.  Thinking that he completed all the tasks, he gives the all-clear to the person responsible for the maintenance page. The site is now back online.  A few minutes later, a new outage is reported.  The engineer then realizes that he executed some commands in the wrong Putty terminals.

In this situation, trying to start up multiple processes across multiple Putty terminals, is not necessarily a difficult task.  However, this is when things are calm.  But here, the engineer is working as fast as he can to reduce the outage downtime. To avoid this mistake, he can run the hostname command in each Putty session. He will then be sure as to which server he is connected.  He can position the terminals in logical sequence on the screen and keep a checklist on the side, keeping track of his work.  Granted, he wants to save a few minutes by working extra fast, toggling between Putty windows and executing in parallel.  But his good intentions did not materialize; instead, he caused another outage.

Lessons to Remember

No-one likes to solve production issues manually during an outage. Under pressure, Production Operations engineers can sometimes take short-cuts and try to solve problems as fast as possible.  While this style may work out successfully at times, it puts undue risk on the whole process.

One wrong move, and not only is the original problem still unsolved, but now there are more problems than before.  For this reason, it’s safer to lose a few percentage points of time, and work out the problems slower, but safer.  Of course the end-goal is to automate everything and never have to worry about logging into a Putty session for any Production Operations work.  However, this may not always be the case.