ThorneLabs

Do Not Use Shared Storage for OpenStack Instances

• Updated December 31, 2014


With the advent of cloud computing came the cloud computing methodology and a different way of doing things. Instead of having high availability just at the infrastructure layer, high availability now needs to exist at the infrastructure layer and the application layer. And even though the infrastructure layer is architected to be highly available, your application should be designed to expect something at the infrastructure layer to fail. And when something does fail, whatever failed should not bring anything else down with it. This is a shared-nothing architecture.

With a shared-nothing architecture, when a compute node fails, only the OpenStack Instances on that compute node fail with it and nothing else is affected. When you introduce shared storage for OpenStack Instances into the cloud environment the following things occur:

  1. You no longer have a shared-nothing architecture
  2. You have created a single point of failure
  3. You have added something that inhibits scale

Perhaps you are running a small OpenStack environment (2 - 8 compute nodes) and do not plan on scaling. Using shared storage for OpenStack Instances still goes against the cloud computing methodology and all you gain from it are the following three things:

  1. Ability to perform faster KVM live migrations
  2. Ability to create OpenStack Instances with larger root disks
  3. Ability to manually evacuate OpenStack Instances from a failed compute node to a healthy compute node

Let’s talk through each of these points.

First, the ability to perform faster KVM live migrations. KVM can live migrate OpenStack Instances with or without shared storage. If you do have shared storage, KVM performs a live migration which is significantly faster because the entire virtual machine image does not need to be copied over the network. However, if you do not have shared storage, KVM performs a live block migration, which copies the entire virtual machine image from the source compute node to the destination compute node, then does a live migration of the remaining virtual machine state. How quickly the virtual machine image is transferred from the source compute node to the destination compute node largely depends on your network link speed. It is worth mentioning that KVM live block migration is frowned upon because of its instability and possible deprecation in favor of something else.

So, what if a compute node requires maintenance? Do you migrate the OpenStack Instances to another compute node? If your application is cloud ready, you should not need to. If your application needs to have the same or additional capacity available while the compute node is down, you should be able to create new OpenStack Instances, deploy your application (with whatever configuration and deployment tools you use), and continue on with your day. Otherwise, bring down the compute node, perform the necessary maintenance, and bring up the compute node. Your application should now be back to the state it was before the maintenance without any disruptions.

Second, the ability to create OpenStack Instances with larger root disks. If you have a storage-heavy application, you can quickly consume an entire hypervisor’s local disk with one or two OpenStack Instances and waste the remaining CPU and RAM. Using shared storage, you more than likely have more storage available than you would on one hypervisor, so you can create an OpenStack Instance with a much larger root disk and not worry about wasting a hypervisor’s CPU and RAM. However, if you are in this situation, you would be better off not creating a large root disk and instead add block storage to the OpenStack Instance using Cinder. Using Cinder, the implications of disk I/O going over the network are the same, you can easily detach the Cinder Volume and attach it to another OpenStack Instance, and you do not have to worry about wasting a hypervisor’s CPU and RAM.

Third, the ability to manually evacuate OpenStack Instances from a failed compute node to a healthy compute node. This functionality is some what similar to VMware High Availability where all virtual machines on a failed VMware hypervisor will automatically start up on a healthy VMware hypervisor. I say “some what similar” because within OpenStack this exact feature does not currently exist, it only exists in a manual, several step process. There is a blueprint (if there are others, let me know) to implement this exact feature, but it has not yet been completed. Why? Because it goes against the cloud computing methodology. If a compute node fails, you should not need to revive the OpenStack Instances that were running on it. As already mentioned, you should be able to create new OpenStack Instances, deploy your application (with whatever configuration and deployment tools you use), and continue on with your day.

So, using shared storage for OpenStack Instances brings these three features to your cloud environment at the expense of violating the cloud computing methodology, creating a single point of failure, and inhibiting your ability to easily scale your cloud environment. If your applications need these features then they probably do not belong in a cloud environment.

Jesse Proudman, Founder and CTO of Blue Box, has written a great post titled Live Migration is a Perk, not a Panacea that further elaborates on why live migration is not prevalent in cloud infrastructures.

At some point in the distant future all applications will probably be cloud ready and not require these or other similar infrastructure level high availability features, but that time is not now. There is still a need for virtual and cloud environments. With that comes the Hybrid Cloud, but that’s a discussion for another time.