A few months ago I made the post about debugging a gate failure. It has been linked around and copied to quite a few places and seems to be a very popular post. (definitely the most popular so far on this blog) I figured since the bug I opened from that was closed as invalid a while ago that I should write an update about the conclusion to the triage efforts for the OOM failures on neutron jobs. It turns out that my suppositions in the earlier post were only partially correct. The cause of the failures was running out of memory but what was leading to the OOM failures wasn’t just limited to neutron. It was just that the neutron jobs ran with more services which used more memory which made failures there more common.
Recently I was helping someone debug a gate failure they had hit on one of their patches. After going through the logs with them and finding the cause of the failures, I was asked to go through how I debug gate failures. To help people understand how to get from a failure to a fix.
I figured I would go through what I did and why on that particular failure and use it as an example to explain my initial debug process. I do want to preface this by saying that I wasted a great deal of time debugging this failure, because I missed certain details at first, but I’m leaving all of those steps in here on purpose.
The log url for this failure is:
Based on some of the comments that were posted on the recent “Which Program for Rally” ML thread I feel that there’s been some continued confusion around exactly how all the projects work together in the QA program. So after discussing it with a wise council of my elders, I decided to start a blog so that I had a place to post more details and I could give a high level overview and clarify how everything works. I’m not really sure how much I’ll be using this blog in the future, as having one is something I’ve resisted for quite some time. But, I felt that making this post warranted me giving in to peer pressure.
Today’s the QA program :
So today in the QA program we have 3 projects, here is a high level over:
- Tempest: The OpenStack Integrated test suite, it’s concerned with just with having tests and running them
- devstack: A documented shell script to build complete OpenStack development environments.
- grenade: Upgrade testing using 2 versions of devstack to test an offline upgrade. Tempest can optionally be used after each version is deployed to verify the cloud.
Each of these projects is independent and is useful by itself. They have defined scope (which admittedly gets blurred and constantly evolves) and when used together along with external tools they can be used in different pipelines for certain goals.
Continue reading QA Program: From Juno into the Future