Cutting OpenStack Gate-Time from 12 Hours to 20 Minutes*
Adapted from the October 2016 OpenStack Summit—Barcelona presentation by Melvin Hillsman (Rackspace) and Isaac Gonzalez (Intel)
In addition to keeping the world’s largest OpenStack development cloud up and running, the OpenStack Innovation Center (OSIC) operations team in San Antonio, TX works hard to make it easier for OpenStack contributors to upstream their code. OSIC team members, Melvin Hillsman and Isaac Gonzalez, explain how they cut OpenStack gate-time from 12 hours to 20 minutes.*
The focus of their effort was OpenStack Infra—the support services (Jenkins, Gerrit, etc.) that test contributed upstream code for the OpenStack community. OSIC provides as many virtual machines (VMs) to Infra as all other infrastructure donor teams, combined.
The OSIC operations team deployed OpenStack Infra on two clusters: a large baremetal cluster of about 300 nodes for development and testing, and a smaller VM-based cluster comprised of 22 physical nodes with the flexibility for experimentation.
Deploying OpenStack Infra is not trivial. The team ran into issues around IPv4 versus IPv6, Raw versus QCOW image formats, and provider network priority. Those issues meant that when Infra launched groups of nodes, a significant amount of time elapsed before those nodes were ready for production-level activity. API calls for getting and creating services took much more time than expected, and job run-times were much longer than expected.
The OSIC team remedied the issues, resulting in an overall 36X* reduction in OpenStack gate time for upstream contributors. Here is a brief summary of the issues and remedies:
Traffic was getting dropped because there were not enough IPv4 addresses—Infra needs a static IP for each VM it launches. Giving Infra 1000 VMs required 1000 IPv4 addresses to match, and that isn’t feasible. Moving to IPv6 fixes that, but it required changes to cloud nodes as well as edge devices.
Infra uses Grafana to visually chart performance and capacity utilization, and at first, some VMs took 30 minutes or more to build. The Raw versus QCOW image format issue emerged in part because the OSIC reference architecture uses Ceph for shared storage, to support live migration testing. The team figured out that every time a VM was launched, Ceph tried to convert QCOW images to Raw format. The fix was to put all images into the Raw format, so no conversion was needed.
Provider network priority issues stemmed from the OSIC upgrade from the Liberty to the Mitaka release. In OpenStack Liberty, with multi-home networks, the first available NIC was set as the gateway, and IPv4 was set, rather than IPv6. In Mitaka, the fastest available NIC is set as the gateway, and the team needed it to be available for IPv6. The team forced the router to set the desired NIC to IPv6, achieving a dramatic reduction in node failures, by a factor of more than 100:1.
For more information, check out the OpenStack Summit presentation, “Using OpenStack Infra to Benchmark Your OpenStack.”
* Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.