EBS, a very well done solution to one of the hardest computer science problems

Real cloud storage lessons from the AWS outage

It’s been very interesting watching the online firestorm over Amazon’s EBS outage.  I was not at the recent Interop show, but apparently, there was an entire panel discussion about it and then a twitter flame war between representatives from VMware and Amazon.  Then there were countless articles and blogs, all of which focused on some questions of mild interest with obvious answers.

  • Is EBS a good or bad service?
  • Will this affect people’s move to public cloud?

The answers to this are pretty uncontroversial, in my opinion. 

  • EBS is a very well done solution to one of the hardest computer science problems out there – how do you construct an infinitely scalable storage service out of commodity disk for read/write transactional data with strict consistency.  The fact that EBS has gone this long without an outage of this magnitude is a tribute to the AWS team.  They clearly made some poor choices but they will surely fix those over time.  However, few seem to be paying attention to what AWS has done correctly.  The naysayers would be hard pressed to name a production service with EBS’ characteristics deployed at EBS’ scale.
  • The best online comment on the topic of the impact to public cloud adoption likened this to an airline crash.  Airplanes crash from time to time, and those crashes always make for sensational news.  But, in terms of cost and safety, air travel remains the best way to travel long distances, so people forget the crash and keep flying.  There is a segment of people who will be nervous for some time with public cloud for some data and apps, but the trend towards public cloud will continue as before.

While all of the uproar is quite entertaining, it is not useful.  As an industry, we can be a bit more thoughtful than this.  This blog post is an attempt to get to the real lessons of this incident, which center on the industry’s transition to scale out storage models for transactional storage in the cloud.

The scale out model is undoubtedly the right architecture for cloud storage in general, especially where eventual consistency is sufficient.  It provides:

  • A highly virtualized interface – it’s one big pool of storage where placement across even thousands of nodes is completely automated
  • Great aggregate performance
  • Flexibility in the face of arbitrary failures
  • The ability to grow steadily in small increments of commodity parts as the cloud itself grows, rather than in massive chunks of proprietary equipment

But, when applied to transactional workloads, there are issues with scale out storage that (a) the vendor community still needs to work out and that (b) customers must be aware of and plan around when they take the leap.

So, why is scale-out storage for transactional workloads so hard?  To be useful to clouds, the scale-out transactional storage system needs to have the following qualities with respect to traditional enterprise storage.

  • Reliability: Cloud storage needs to be almost as reliable as enterprise storage – four or five 9’s is called for.  In this blog, I’ll speak of local availability only, not cross site-DR – that’s a topic for another day.
  • Cost: The cost of scale-out storage is expected to be considerably cheaper than enterprise storage.  Keep in mind that a lot of the enterprise storage price is in software, services, and margin.  For big cloud deals, storage vendors will negotiate down closer to the real cost of the system and/or provide leasing plans.  So, a good scale-out deployment needs to actually be cost conscious and can’t just rely on the promise of commodity parts.
  • Consistency: When using read/write transactional storage, a committed write must really be committed.  Eventual consistency does not cut it.  Any write must be truly safe as even a minute or less of data loss can be a fatal issue.
  • Performance: The transactional performance for both reads and writes must be usable, even if somewhat lower than more expensive enterprise storage systems.

With a requirement for strict consistency, protection of storage against loss and inaccessibility comes from either RAID or synchronous replication across multiple enclosures.  When you really trust an enclosure, like how people (reasonably or unreasonably) trust traditional enterprise storage, you can use RAID and minimize disk proliferation – something like 20% extra disk is a reasonable price to pay for your five 9’s.  When you don’t trust the enclosure because it’s a cheaper commodity system, you start looking at RAID over the network or erasure codes.   The challenge with this strategy is that performance can be abysmal, especially in degraded mode where each read requires too many network accesses and parity calculations.  In order to (a) make sure your data is never lost due to a critical set of commodity disks and/or enclosures being lost before rebuilds can complete while (b) maintaining adequate performance, you start mirroring the data over the network multiple times, usually 3x in scale-out systems targeted at the enterprise.  This drives up the actual cost – think TCO – disks, enclosures, power, cooling, footprint, etc…  It is notable that the mirroring system was the mechanism that brought EBS to its knees during the outage.  So, the scale-out vendor is always balancing cost, consistency, and reliability – you can get any two, but not all three at once.

Even when you achieve an acceptable balance of the first three considerations, performance can still be an issue due to economics.  To keep costs low in the face of the relatively expensive replication system, there is a temptation to pack high-density storage very tightly reducing the effective IOPs per GB and making contention a significant problem.  With all that is good about EBS, you often hear customers complaining about its performance, both in terms of maximum throughput and in terms of variability over time, even when there is no news-making outage.

All of this is just about the characteristics of the storage itself and does not deal with operational issues, which is where EBS really hit some issues.  Again, we’re not talking about TB or PB of storage, we’re talking about operations approaching the exabyte scale.

All the operations need to be completely automated – provisioning, placement, and failure response.  All this automation can be implemented in one of two places

  • Independently on each storage node
  • Through centralized controllers

The storage nodes in most scale out systems generally just store and serve data.   When data is sent to them, they store it.   When they get a read request, they serve it.  When they are given a replication partner, they send their data over.   People try to avoid putting too much logic in the storage nodes because (a) they want storage nodes to focus on streaming data and (b) if storage nodes were doing too much thinking, they’d all need to coordinate making for a potentially unsolvable peer to peer coordination problem.  Therefore, most instruction comes from one or more control systems.

In general, storage provisioning and placement operations (for primary copies, initial replicas, and new replicas after a failure), as well as data lookups, are done through more centralized controllers.  Some requirements here are as follows:

  • With thousands and thousands of users, you can’t have a single control node.  You need to have the control system itself scale-out to many, many nodes (though certainly less than the number of storage nodes). 
  • The control system needs to be even more accessible than the data.  You never want a situation where the control service nodes all die, get confused, lose metadata, or are simply starved for resources as when that happens, all administrators and end users lose the ability to interact with storage system as a whole.
  • The algorithms of the control system need to be very clever since they are controlling thousands and thousands of individual storage nodes, which generally obey even inappropriate and/or heavyweight commands faithfully. 

If you read Amazon’s well written, open, and frank EBS post-mortem, you know how and where these guidelines were violated and where the EBS team will undoubtedly be placing their efforts to improve the service over time.  But, for you, the cloud builder, here is what you need to talk about when you talk to a scale-out storage provider.

1) What tradeoffs were made between reliability, cost, and consistency?   If you need strong consistency on transactional data, find out what the uptime guarantees are, and what the implications are for overall system cost.  Dig deep into any uptime guarantees.  Make sure you understand the assumptions regarding probability of individual failures and adjust those assumptions if they do not apply exactly in your datacenter.

2) What is the price per usable GB and per IOP?   If you are building a big enough cloud and can negotiate a great price from an enterprise storage vendor, make sure the scale-out system is cost-competitive even though they will be using many more disks.   Think about TCO – don’t forget about the power, cooling, and footprint costs that come along!  This is not to say that scale-out is more expensive than traditional storage, or that you should not go for it if you don’t get the savings you hope for.  But you should double check all the math and make sure the TCO is what you expect.

3) What is the performance of the transactional storage– both in a normal mode and in a degraded mode?   Make sure that they are not assuming a lower spindle to IOPs ratio than is reasonable (like EBS) to give you a rosy picture on price.  Assume your transactional storage will actually be accessed forcing you to increase spindles, use less dense storage, and/or have a really good caching/tiering/ILM story.

4) What are the assumptions of the storage system?   The EBS design assumed that their redundant network would always be available and that there would never be a general loss of connectivity from all to all.  Has your scale-out vendor designed for this eventuality and tested it at scale?  What other datacenter assumptions are they making?

5) What happens with split-brain at scale?  Traditional enterprise storage is very simple in this area.  Local availability is handled inside a single chassis and DR is done with dedicated replication partnerships.   It is inflexible and not responsive to changing conditions.  Scale-out storage is way better in this regard, but if not done right, the flexibility of the scale-out system can backfire, just like in the EBS case where all nodes tried to re-establish replicas of all data at once.

6) Does the storage understand temporary vs. permanent outages?  If so, what if something that appeared permanent turns out to be actually temporary?   Can your storage system react to the return of service in a reasonable way, especially when the permanent failure response is very heavyweight?  EBS, unlike traditional enterprise storage, kept re-mirroring to new nodes rather than simply sync’ing back up with old mirrors when they once again became accessible.

7) Can the control system guarantee access to users and administrators?  In the EBS outage, the automated failure response overloaded the control service, which is what actually affected all users, even if they had properly replicated their data between availability zones.

8) Are your availability zones really isolated?  In EBS, there was a shared resource between availability zones.   This is what made the impact of the response to #7 so bad.

9) Does the automation know when to stop trying something?   Once it was clear that no more space was to be had and that the control systems were not responsive, the automated re-protection kept going.  Sometimes, like people, software needs to stop, take a breath and let the situation cool down.  And even though this is cloud, when the storage is in this state, it’s best to have it ask for administrator intervention rather than continuing to try to do the impossible repeatedly.

10) Are failures graceful, even the unlikely ones?   The EBS system had a corner case that crashed the nodes rather than failing an operation gracefully.  In most software, you can get away with letting those corner cases go, but when approaching the exabyte scale, you can’t.  Make sure your vendor has good software engineering practices here.

11) Are there good fail-safes?   The EBS outage started to get better when the EBS admins were able to stop some of the communication and get out of the vicious cycle.  Does your scale-out vendor have similar controls to allow you to manually stop heavyweight operations that you, as the cloud operator, determine need stopping for the sake of the cloud as a whole?

12) Are the requirements for the end customer documented?  After the outage, Amazon put out some excellent documentation on building cloud applications that everyone should read.  Does your scale-out system, due to performance or reliability tradeoffs, require end users to use the storage system in any specific and non-obvious ways?  If so, make sure those are clearly documented so you can educate your end users.

While items 4-10 in this list derive from the EBS problems, this blog posting should not be seen as anti-EBS.   With EBS, Amazon has created something unique in the industry, a massive read/write transactional storage system with strong consistency that can be operated by a reasonable sized IT staff.   Its major outage was the first of this level of seriousness in years and the long-term affects have been quite minimal.   The success of EBS has influenced the rise of a plethora of scale out storage startups that want to give you something EBS-like in your datacenter.   It has also scared the traditional storage vendors on technology and pricing and pushed them to innovate in a way they’ve not done in a long time – see their recent product announcements and M&A activity.  EBS is a great service that will only get better. 

While EBS’ failure in this case was spectacular, in a way it was fortuitous for the cloud industry because it educates us on what to look for in storage vendors.  Hopefully, the scale-out storage vendors have been paying attention as well.  They can learn important lessons about operations at massive scale without needing to do a very expensive real-world QA and without causing an outage for a paying customer.   These lessons should be the focus of our attention, not the drama.

Taking Advantage Of Multi-Tenancy To Build Collaborative Clouds

When one hears of the advantages of cloud computing, the same benefits come up again and again.

  • The IT consumer gets real agility. This means instant response times to provisioning and deprovisioning requests – no red tape, no trouble tickets – just go.  The consumer also gets a radically different economic model – no pre-planning, no reservation, no sunk costs – the consumer uses as much as they want, grow and shrink in whatever size increment they want, and keep hold of the resources for only as long as they want.  Lastly, the consumer gets true transparency in their spending – each cent spent is tied to a specific resource used over a specific length of time.
  • If a proper cloud infrastructure is built, acquired, or assembled, the operations costs for the datacenter administrator are much lower than with traditional IT. Cloud infrastructure software, if done right, gives scale-out management of commodity parts by introducing (a) load balancing and rapid automated recovery of stateless components and (b) policy-based automation of workload placement and resource allocation.  Customer requests automatically trigger provisioning activity, and if anything goes wrong, the system automatically corrects.  The datacenter admin is relieved of the day-to-day burdens of end user provisioning and break/fix systems management.

The challenge in this world stems from the fact that for all this to be delivered, clouds must span organizational units. There needs to be economy of scale to drive down costs. There need to be many workloads from multiple customers peaking at different times to achieve the “law of large numbers” to achieve high utilization and predictable growth. Once you have multiple customers on the same shared infrastructure, you get the inevitable concerns – is my data secure, do I have guaranteed resources, can another tenant through malice or accident, compromise my work.

Clouds, both public and private, strive to provide secure multi-tenancy. Each service provider and each cloud software vendor promise that tenants are completely isolated from each other tenant. Obviously, different providers do this with varying levels of competency and sophistication, but there is no controversy regarding the need for this isolation.

Once you are comfortable with your cloud’s isolation strategy, though, you should turn around and ask, “How do I take advantage of multi-tenancy?”  We live in an ever more interconnected world and different organizations need to collaborate on projects large and small, short-term and long term. If two collaborators share a common cloud, or two or more clouds that can communicate with each other, shouldn’t the cloud facilitate controlled and responsible sharing of applications and data? Shouldn’t we turn multi-tenancy from the cloud’s biggest risk into its biggest long-term benefit?

To answer this challenge, we need to ask

  1. Why would we need to do this?
  2. Are there any specific examples of this today?
  3. How would we go about achieving a more generalized solution?

First, why would we do this?   There are many examples in many sectors.

  • Within large enterprises, different business units generally need to be isolated from one another, for privacy or regulatory reasons, or simply to keep trade secrets on a need to know basis. But, when large cross-functional teams are asked to deliver a complex project together, sharing becomes necessary.
  • Also in business, external contractors are used for some projects. How can they work as truly part of the team for one assignment, while being safely locked out of all other projects?
  • In education, universities collaborate on some projects and compete on others. How can the right teams work together openly while others are completely isolated?
  • In government and law enforcement at all levels, collaboration can save lives and property, but proper separation must be enforced to protect civil rights and personal privacy.
  • In medicine, doctors and insurance need to share certain records and results in order to streamline care, facilitate approvals, and reduce mistakes.   But, privacy must be protected with only the proper and allowed sharing taking place.

Since this seems like a nirvana state, the second question is what is practically being done along these lines today? To this, I would say that the SaaS providers have been on this path for some time. Google calendar allows you to selectively share your schedule in a fine-grained manner – who can see your availability, who can see your details, and who can edit your meetings. LinkedIn allows you to share your profile at varying levels of depth and regulate inbound messages based on your level of connection and common interests.

This leads to the third question – how can we do this more generally? How can a single cloud or a group of clouds facilitate generic sharing of any application or data without breaking the base isolation that multi-tenancy generally requires? Obviously, in a blog we can’t answer in gory detail, but we can discus some high level requirements.

1. Recognize distributed authority and have a permissions scheme that models this well

In all the examples we discussed in the “why” section, there was no shared authority. From the point of view of someone who wants to access something of someone else’s, there are two completely different and independent sources of authority. First, does my manager authorize me to be working on this project with these collaborators? Second, do those collaborators want to share with me, what exactly do they want to share, and what level of control over their objects do they allow me? A cloud that facilitates collaboration must have a permissions system that allows these different authorities to independently delegate rights without the need for an arbitrating force. Imagine if two government agencies needed to go to the president to settle an access control issue.  With doctors and insurance companies, who would a central authority even be? Once you have a permissions system capable of encoding multiple authority sources, you need the ability to apply that system to compute, storage, and network resources. You need to apply it to data and applications. You need to apply it to built-in cloud services and third party services.

2. Provide extremely flexible networking connectivity and security

Permissions speak to who can do what on what objects shared on a cloud network. The next part is about the network traffic itself. The cloud needs to govern connectivity in a secure, but still self-service manner. It will be impossible to build a responsive and agile collaborative environment over legacy VLANs and static firewalls. Once collaboration is setup politically, project owners need to be able to flip the switch to start the communication flow immediately. If a project ends, they need to be able to turn it off just as quickly if not faster. Given a project that already has network connectivity, as that project expands, new workloads added to the project need to be instantly granted the same network access as all the other workloads. For all this to happen, there need to be network policies that govern communications. These policies need to instantly regulate all new workloads on the cloud.  They need to be created, destroyed, and modified by the actual collaborators, not network admins. Lastly, these policies need to be governed by the collaborative permissions system described in requirement #1 so that proper governance is achieved without requiring a common authority.

3. Have a way to extend these systems across clouds

Once you have a permissions model and a networking model that work within a cloud, you need to extend those functions to work across clouds so that multiple organizations can share their resources amongst each other, not just when they share a common public or community cloud, but even when hosted in their own separate private clouds. For this to happen, identity must be agreed upon. User permissions from one cloud must be trusted by the second cloud so that those permissions can be mapped against what has been delegated by that second cloud. The networking policy mechanisms must be transferable across the Internet and take into account various levels of routing, NAT’ing, and firewalling.

Nimbula believes that we are on the path to providing general purpose collaborative clouds. Our flagship product, Nimbula Director, is architected to deliver this value in the long term and has taken substantial steps in this direction in our generally available 1.0 release.

The Cloud Ecosystem

Nimbula’s co-founder and VP of Products, Willem van Biljon, spoke at the recent Cloud Connect event in Santa Clara. Here are some of the points Willem made during his talk. The video of the full talk is available online at http://bcove.me/6fllnnzg

Building a proper cloud, whether it is a private or public cloud, is more than buying and implementing a product. It is a rather complex architecture with many interrelated pieces that need to be considered. Ultimately, it is a about a whole bunch of things that need to work together.

So, what is involved to make this work?

  • Compute and Storage hardware
  • Networking infrastructure
  • A Cloud Operating System, something that will make all of the infrastructure accessible to the outside world. 
  • On top of that, the various services that people are going to need (PaaS, SaaS, etc.)
  • Alongside we need some management infrastructure, billing, external storage or compute resources, etc.

So overall, it is a pretty large ecosystem and many vendors and products come into play.

The Infrastructure as a Service (IaaS) provides the software that gives control of hardware layer. Just like traditional Operating Systems, but with a large set of hardware. The issues we think are important are:

  • Scale: lessons learned from large scale matter at any scale. Large properties like Google, Amazon or Yahoo learned lessons that we can apply to all data centers
  • Automation: low costs implies low human touch
  • Resource management: who gets what
  • Permission / policy management: who can get what

If we look at the hypervisor, the first lesson is that the hypervisor is not the Cloud OS. It is an essential component, but not all of it. In particular, it does not provide resource management across multiple machines. The hypervisor market is rapidly maturing and one should not build applications or a cloud architecture that rely on a specific hypervisor. 

Large enterprises have shown that commodity hardware can lower costs. The magic is in the software, not the hardware: design the application for commodity hardware and you can dramatically lower costs.

In the network, as applications are no longer bound to specific servers, the topology no longer defines security. The network security now needs to be configured automatically and managed dynamically. 

How do I federate to other people’s cloud – whether private or public? There are a number of key challenges around the API, the identity that I need to present, the data that I need to move and the application environment in which the virtual machine will execute. Of all of these, identity is probably the main challenge to address. 

Billing is about getting money back for the resources that are consumed. It generally breaks down to three elements: Firstly you need to be able to properly measure and meter what is used, secondly to assign proper rates to the various resource elements and finally to generate a bill. The important  elements is finding and assigning the appropriate rate for a given resource – that is where data is transformed into business value.

There is a massive amount of data on enterprise systems today and there is an equally massive opportunity to re-architect that storage to use cheaper systems. There is no simple, one-size-fits all answer. The key is balance and figure our where do you need today’s high end enterprise storage and where do you need the lower cost and highly scalable newer storage systems.

So in conclusion, the cloud ecosystem has many components and many issues per component. We believe that one should start by focusing on the key issues per component and find the right answer for each part.

Taking Advantage of Public and Private Clouds Requires the Right Cloud Management Software

Cloud computing is just a few years old, but already has given rise to two separate approaches and architectures; one public, like Amazon’s Web services, the other private, usually inside a corporate data center. Computer users assigned to business units are attracted to the direct access and easy provisioning of the public cloud, since servers can be up and running in a few minutes. IT organizations, on the other hand, value the security and control they associate with private clouds, and worry about the proliferation of public cloud instances and its potential impact on corporate data and security policies. It’s a familiar tug-of-war.

Successful businesses have lately come to realize that both public and private clouds have advantages, and want to make able to use both of them when appropriate. Consider Intuit, the software company does the load testing for its online TurboTax program on servers at Amazon; because real customer data is not being used, there are no regulatory or privacy issues. However, once the software is made available to the public it runs on Intuit’s on-premises machines, as one would expect for information of such a sensitive nature.

Being able to move between public and private clouds in this manner requires the right kind of cloud management software, a true “Cloud Operating System” that doesn’t take a one-size-fits-all approach to cloud architecture. Instead, it must make use of, when appropriate, the growing number of cloud technologies the marketplace is accepting.

In a properly designed Cloud Operating System, an application runs in either the public or the private cloud depending on the application itself, in connection with company policies. These policies might involve, for example, the kinds of data the application uses, or the extent to which the application is mission-critical to the organization.

The actual placement of an individual application’s workload in either the public or private cloud should occur automatically and transparently to end users. Be they in IT or in business units, users should concern themselves only with choosing the proper policy for the workload. Cloud management software should then take over, determining where precisely in the public-private cloud ecosystem the program will run.

This means that to be effective a Cloud Operating System software needs to shield users from the multitude of different command systems they currently need to master to move between public and private clouds. Instead the software must present a unified user experience, with the same authorization, the access control and interfaces regardless of the workload’s final destination. Users can focus on their workload needs using credentials set up centrally by IT. That protects the enterprise from employees disclosing their credentials to others, or worse, taking them with them when they leave the organization.

A Cloud Operating System must also give users a painless way to move data and applications back and forth between public and private clouds. That’s a seemingly straightforward task, but one whose current complexity routinely leads to lengthy and unexpected delays in what IT workers had assumed was going to be a straightforward migration process.

So how might this hybrid public-private blend architectures play out in an enterprise? Traditional mission-critical ERP programs are less likely to migrate to new cloud infrastructures, just yet. That’s because these programs have strict requirements for stability and fault tolerance and their data is subject to stringent regulatory and compliance regimes. In addition, the programs themselves do not require the constant changing and updating that can occur so easily in a cloud environment. ERP customers are much more concerned about keeping the programs running stably than they are with making daily adjustments to the underlying infrastructure. While mission-critical workloads won’t be the first ones that IT will move to cloud infrastructures, they will clearly be candidates for the private cloud in the second phase of cloud adoption.

By contrast, programs built on new generations of Web-based development environments, such as Ruby on Rails, are perfect candidates for internal clouds right away. Whether you are in a development and test environment or beginning work with a new Platform as a Service or Software as a Service offering, a Cloud Operating System technologies will make possible a new level of agility and flexibility into your organization. You can scale your infrastructure as fast as you can stack racks of hardware without having to bother with the lengthy server provisioning cycles once associated with IT deployment.

Of course, you can also use third party cloud resources like Amazon to complement your own infrastructure when doing so makes sense. Intuit used the cloud for testing; some companies move to the cloud to meet seasonal demands, or to run one of the many commercial SaaS offering becoming available. Cloud management software can transform the public cloud from a rogue resource snuck in the back door by business units trying to circumvent IT and make it instead a viable business tool, properly integrated into an enterprise’s systems.

There are a few more things that IT managers need to be aware of when choosing cloud management software besides its ability to handle both public and private clouds. Has the software been designed from the ground up to deal with the complexities of today’s computing environments or are those features bolted-on as an afterthought to software initially designed simply to set up virtual machines? How much does it automate the time-consuming, repetitive manual tasks often associated with creating and configuring virtual machines? And can it scale up as effortlessly as modern IT operations are discovering they need to?

IT managers will need to deal with those issues, too, as they make a decision about cloud management software. But at the very least, they need to make sure that when they ask a cloud management vendor if they are public or private, the answer they hear back is “Yes.”