How a Release Failed
Author
Marcus HeldHi,
After 9 hours, the decision was clear: We had to roll back.
The first release in almost 1.5 years had failed.
And it was going so well. The meticulous preparation of the past weeks had paid off. Everything was going according to plan. The necessary changes to the VMs went as expected. The major migration was successfully completed after one hour of runtime. The new content files were successfully deployed with the new system. The new CD pipeline ran for the first time on production – and it was successful.
It took a while, but by 1:00 PM, the new version was online.
It was going too well.
The maintenance page was still switched on.
And we began to test the system. Automated tests were executed, and the new content was tested. Smoke tests were conducted.
I looked at the logs with the developers. And there was something strange:
WARNING: Possible connection leak detected.
The warning was not an isolated case.
It appeared hundreds of times in the logs.
Do we have a problem?
The system seemed fine on its own. No one reported an error. The tests were running through. No one reported timeouts or unexpected behavior. Everything was calm.
So, a look at the monitoring:
If connection leaks are visible in the log, then we should have many active connections in the pool. Hikari had registered an MBean. We could access that. But… everything was fine. We never had more than one connection open.
What’s going on here?
Where is the leak coming from?
A further look at the log revealed the culprits: the 3rd-party API. The authentication of the API required an RPC call to another service. And it was blocking. And we were keeping the database connection open.
We can verify this!
What happens if we generate load here? A developer still had a load test that could simulate 100 clients.
And indeed.
As soon as it started: The system became unusable. Our connection pool was exhausted immediately.
Damn. We should have foreseen this.
What can we do?
We can’t do a rebuild today. The risk is too great.
How many requests per second do we have live on the API? About three per second. What if we make the connection pool big enough to cover that? The TCP timeout is 60 seconds. So: 60 x 3 = 180. Let’s say, we increase the pool to 500. That gives us some buffer.
And we would buy time. For the right fix.
So, let’s go.
Change the configuration. Deploy. Everything looks good. The system is there.
Next load test.
Damn.
The application is unusable again. We only have 100 open connections. But the database is not cooperating. 100% CPU usage.
It’s 4 PM.
Does anyone have another idea? We could set a rate limit on the API in Apache. But it should only allow one parallel call.
That’s unrealistic. It would make the API practically unusable.
We could shut down the API. But that’s not an option. The financial damage would be too high.
What if we deploy the old version alongside the new one?
The API is compatible. There were no changes. We could route the API to the old application.
It’s conceivable that this works. But have we considered everything? We have never tested this. And our infrastructure is not designed for it. The risk is too high.
That’s it.
We can’t go live like this.
It’s 4:30 PM.
We have failed.
We roll back.
After 9 hours – we had the old version running again.
Then the whisky is opened.
Have we failed?
Yes.
But it was also a success.
It was clear that it would be a challenge after such a long time. I expected many more problems until the application was back online.
And that was not a problem. We have proven that we know how to maintain the application. We know how to deploy it. And all our changes from the last few months were successful.
Finding such a mistake in the final stages is annoying. But now is not the time to wallow in self-pity. We will eradicate the error. The fix will not be hard. And then we’ll try again. We will make it. We will get our foot through the door.
And then we can look forward. Then we can make sure to never end up in such a situation again.
I’m looking forward to it!
Rule the Backend,
~ Marcus