A few months ago I made the post about debugging a gate failure. It has been linked around and copied to quite a few places and seems to be a very popular post. (definitely the most popular so far on this blog) I figured since the bug I opened from that was closed as invalid a while ago that I should write an update about the conclusion to the triage efforts for the OOM failures on neutron jobs. It turns out that my suppositions in the earlier post were only partially correct. The cause of the failures was running out of memory but what was leading to the OOM failures wasn’t just limited to neutron. It was just that the neutron jobs ran with more services which used more memory which made failures there more common.
Diagnosing the root cause of the OOM failure
Unfortunately because I waited so long to follow up on my original post all the example logs have expired and been removed from the log server. So I’m unable to paste in images and links to examples like I did before. (since the issue was fixed several months ago) So I’ll only be able to explain the cause of the failures without referring to the logs. But, as before all the diagnosis was done by looking through the captured job artifacts available on the log server.
Back in September we started to see a noticeable increase in the failure frequency from elastic-recheck. This also showed it commonly occurring on non-neutron jobs too. Neutron jobs were still the most common failure but not the only ones. Because, of this increased failure frequency the OOM failures started to become a higher priority and got some extra attention. When we started to look at the memory consumption on the devstack node in general outside the context of just neutron we noticed that during an OOM failure 2 things were occuring.
First we were launching way too many api workers, nova-api alone was launching 24 workers, 8 workers each for 3 api services (osapi_compute, ec2, and metadata) The devstack gate nodes only have 8 vcpus and 8GB of ram and having >50 api workers between all the OpenStack services running on that kind of machine was a bit excessive and counterproductive. It also was consuming quite a bit of memory because even if the workers don’t do anything they still consume RAM.
The second issue was that there were unused services running which weren’t actually being installed by devstack. For example, by looking at the dstat and ps output we could see that zookeeper was installed and running but nowhere in the devstack or devstack-gate logs was this being installed. By itself zookeeper ended up being one of the top memory consumers shown by the ps log file (which gets generated at the end of the run). It turns out this was a bug in system-config leading to extra packages being installed on the slaves.
These 2 factors combined were eating ~1 GB of ram (unfortunately I don’t recall the exact breakdown, but I remember that the sum of all the extra workers was what ended up consuming the most) which is what was putting so much extra pressure on the memory constraints of the dsvm nodes.
How the issue was resolved
First Clark Boylan fixed the config side issue to ensure we didn’t install unnecessary packages (including zookeeper) on the dsvm slaves with: http://git.openstack.org/cgit/openstack-infra/system-config/commit/?id=de01e82ec099b07a1da4af0e30cfdb38b7925a84
Secondly Dean Troyer changed how we were spawning api workers in devstack for all of the services with: http://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=05bd7b803d87bbdd1a6f11cfd278eec319c819ea
which also changed the default number of workers in devstack to be nproc/2 for everything, which essentially cut the numbers of workers started during the job in half.
With these 2 changes and the fixes which grew on them over time we successfully reduced the memory footprint of a running devstack so we weren’t in danger of running out of memory during a gate run anymore. Looking at a recent tempest neutron gate job the max memory consumed by a job is now just shy of 7GB. (6972MB)
The original bug I opened as part of the earlier blog post:
was left open for a while even after the above changes were merged. This was just in case the fixes were incomplete we wanted to be able to track it for a few days to make sure. Once the logstash window of 10 days no longer showed any failures the bug was closed.