Thou shalt keep Calm!
Real bugs happen in production, when your systems are under heavy load or when you need them the most. Yes your boss is doing a demo of the software to clients. Yes, it is black friday and money is at stake. Yes your company is losing money. Yes you will have to explain yourself. But now is the time to keep calm. Stress will not help you. It will make the matter worse. Pro tip: Kick your boss out of the room, especially if he stresses you.
Thou shalt check the Plug!
If the screen is black, the power might be out. The computer might be unplugged. Do not take anything for granted. Do not assume anything as sometimes the most obvious and the most unlikely is happening.
Check your data sources, are some APIs down? Is there any outage? Do you have a CDN in front of your service? Is it blocking requests? Can your server reach the database? Is your database overloaded?
I can remember during a conference when a sudden pick of traffic from one IP address, the IP address of congress center got pretty much banned by a “smart”security algorithm. Indeed it confused the congress attendees for a DDOS. As you can imagine, everything was working fine, the servers, all dashboards were green, and yet no one of the 25000 people gathered for the conference could use the app, the website or access the program. Only some lucky fews could get one of the 10 concurrent connections the security algorithm allowed for that IP…
Thou shalt reproduce it!
Until you can reproduce it, you won’t fix it. If you did not fix it, it is not fixed. It will happen again. You can buy yourself time by restarting your systems, but you need to reproduce the issue.
Thou shalt divide and conquer!
Way before developers used the “divide and conquer technique to create more efficient algorithms, men of war and other politicians used that exact technique to succeed in their endeavours. None other than Niccolò Machiavelli wrote about it in the Art of War as well as in the Discourses on the First Decade of Titus Livius
[…] it will always happen that, by exercising a little dexterity, the one will be able to divide the many, and weaken the force which was strong while it was united. (Chap. 3.11)
The key here is exercising a little dexterity. When you know it is not the plug, it’s most certainly the code… The debugger is your best friend. Even simple
console.log() will save the day.
Just put a breakpoint in the middle of your code, of the service that is falling, the class that is most likely betraying you. If you can reach the breakpoint, then the issue is happening on the other half. Divide the second half in half and put a breakpoint there. Continue! It is a binary search. When you have isolated the problem, you are almost there!
Thou shalt change one thing at a time!
Do not update packages and change the code, do one or the other not both! You will not know what made the matter worse. You might have fixed the issue with the code, and introduced an even bigger problem upgrading a package. Do not change two things in your code either! Just one and test. Remember you are debugging, not making things nicer, or trying to improve anything. Just squash the bug, release the hotfix. Pro tip: Of course you have tests… run them you are not wasting time!
Thou shalt take notes!
Take notes of everything you do, you will have to explain yourself, or you might need help. If you are able to share what you did quickly people will not try what you already did. Then when the issue is resolved, you will be able to do a nice post mortem and put measures in place so that the issue does not happen again.
Thou shalt share with others!
Your failure is valuable experience! Share it with others. Leave a comment of your most frightful experience. We all have one.