Skip to main content

Working towards a more resilient OpenStack with Live Migration

OSIC Team

Over the last few years, many enterprise customers have moved application workloads into public and private clouds, such as those powered by OpenStack. This trend is projected to grow significantly until 2020. Moving to the cloud offers customers lower costs and a consolidation of virtual estates, and they can benefit from OpenStack’s increased manageability.

Host maintenance is a common task in running a cloud—rebooting to install a security fix, patching the host operating system, replacing hardware because of an imminent failure. In these cases, live migration enables the administrator to move a virtual machine (VM) to an unaffected host before such impacting maintenance is performed on the affected host, which ensures almost no instance downtime during the normal operations of the cloud.

During the Ocata release, the OpenStack Innovation Center (OSIC) benchmark tested live migration to discover the best way to move forward with non-impacting cloud maintenance. The team deployed two 22-node OpenStack clouds using OpenStack-Ansible to test two types of live migration:

  • One with local storage where the team could test block storage live migration (migration of both VM memory(RAM) and disk)
  • One with a shared storage back end based on Ceph to test non-block storage live migration (migration of VM memory(RAM) only). The disk is on a remote shared medium and hence does not need copy over during live migration.

Ceph is an open source, software-defined storage solution for scale-out block, object, and file storage. It runs on commodity hardware, is self-healing and self-managing, and has no single point of failure which can be integrated easily with your OpenStack cloud to provide high availability for your data.

The team then used OpenStack's Rally project to build a test suite to serially live-migrate several VMs from one host (host A) to another (host B). The team then live-migrated the VMs back to host A. This test was repeated several times to reduce the level of uncertainty in the results.

Another part of the test ensured that the VMs had a suitable workload running inside them to exercise live migration. For the duration of the test, Spark Streaming was constantly run inside the VMs during live migration to stream processing. The team needed some disk usage to exercise moving the disk between two hosts, and needed a level of memory dirtying to exercise the copying of memory during live migration.

Before the live migration iterations were started, all the benchmarking tests were launched by sending packets to the VM. To rate the performance of the live migration operations, a benchmarker tool configured by the OSIC DevOps team measured the following KPIs: per VM live migration timing, per VM downtime, per VM TCP stream continuity, and per VM metrics (CPU, network bandwidth, disk IO, and RAM).In an upcoming white-paper, we will explain why we chose these KPIs and provide further details on our results.

To have reliable results, 240 VM live migrations were performed for four cases:

  • Block storage live migration with tunneling disabled.
  • Non-block storage live migration with tunneling disabled.
  • Block storage live migration with tunneling enabled.
  • Non-block storage live migration with tunneling enabled.

The average time and standard deviation of VM live migration were recorded for each case. These results gave the team a sense of the variations of the time needed for live migration. The team plotted the following example graph to show these results:

 

The tests revealed two bugs, which the team addressed.

  • In the first bug, the team discovered that the live migrations were incorrectly tracked. The team successfully submitted a fix upstream, which was merged.
  • In the second bug, the team encountered a race condition that was causing failures during the cleanup after the live migration had completed. The team successfully submitted a fix upstream.

After the team applied these fixes, no live migration failures occurred in any of the test runs. The workload was then tuned to help get results that mirror what is observed in production.

The OSIC team continues to experiment with live migration and will release a more comprehensive white paper before the Boston summit.