A few months ago I made the post about debugging a gate failure. It has been linked around and copied to quite a few places and seems to be a very popular post. (definitely the most popular so far on this blog) I figured since the bug I opened from that was closed as invalid a while ago that I should write an update about the conclusion to the triage efforts for the OOM failures on neutron jobs. It turns out that my suppositions in the earlier post were only partially correct. The cause of the failures was running out of memory but what was leading to the OOM failures wasn’t just limited to neutron. It was just that the neutron jobs ran with more services which used more memory which made failures there more common.
Recently I was helping someone debug a gate failure they had hit on one of their patches. After going through the logs with them and finding the cause of the failures, I was asked to go through how I debug gate failures. To help people understand how to get from a failure to a fix.
I figured I would go through what I did and why on that particular failure and use it as an example to explain my initial debug process. I do want to preface this by saying that I wasted a great deal of time debugging this failure, because I missed certain details at first, but I’m leaving all of those steps in here on purpose.
The log url for this failure is: