Containers and persistent data

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing.

May 28, 2015

This article was contributed by Josh Berkus

CoreOS Fest

Increasingly, users want to containerize their entire infrastructure, which means not just web servers, queues, and proxies, but database servers, file storage, and caches as well. In fact, one of the first questions from the audience at the recent CoreOS Fest during the application container (appc) specification panel was: "What about stateful containers?" These services require persistent data, or "state", that can't be casually discarded. As Tim Hockin of Google said: "Ideally everything is stateless. But there has to be a turtle at the bottom that holds the state."

Storing persistent data for containers has been a chronic issue since Docker was introduced, because the initial design for Docker simply didn't deal with the concept of data that needs to outlive the container runtime and that can't be easily moved from one machine to another. Docker's only concession to the need for state in version 1.0 was to allow "volumes": external filesystems that the container could access, but that were otherwise unmanaged by the Docker system. The conventional wisdom was not to put your data into containers.

According to the panel, the appc spec may take this a step further by specifying that the entire container image and all of its initial files be immutable, except for specific configured directories or mounts. The idea is that any data that administrators care about needs to be put in specific, network-mounted directories, anyway. Managing these directories will then be left to orchestration frameworks.

A reader would be forgiven for thinking that both Docker and CoreOS had decided to ignore the issue of persistent data for the time being. Certainly a lot of developers in the "world of containers" seem to think so and are working to fill that gap. They described some of their projects to deal with persistent data at ContainerCamp and CoreOS Fest. One solution is to make distributed data storage trustworthy, so that administrators don't have to worry about data being persistent on any specific node.

The Raft consensus algorithm

A reliable distributed data store needs to have some way of ensuring that the data in it will eventually, if not immediately, become consistent across all nodes. While this is easy on a single-node database, a multi-node database where any of the individual nodes is allowed to fail requires more complex logic to determine how to make writes consistent. This is called "consensus", which is an agreement on shared state across multiple nodes. Diego Ongaro, developer of Raft, LogCabin, and RAMCloud, described to the audience how it works.

The Paxos consensus algorithm was introduced by Leslie Lamport in 1989, and for over two decades was the first and last word in consensus. But it had some problems, the chief of which was extreme complexity. "Maybe five people really, truly understood every part of Paxos," Ongaro said. Students and programmers alike found it difficult or impossible to write tests that validated whether the protocols described by Paxos were correctly implemented. This meant that, despite being based on the same algorithm, different Paxos implementations were radically different from each other, and could not be proven to have correctly implemented the algorithm, he said.

To solve this, Ongaro and Professor John Ousterhout at Stanford created a new consensus algorithm that they designed to be simple, testable, and easy to explain to developers. Ongaro's PhD thesis described the algorithm [PDF], called Raft; the name refers to the desire to "escape the island of Paxos." Ongaro described their reasoning: "At every design choice, we asked ourselves: what's easier to explain?"

Judging by the number of projects that implement Raft, chief among them CoreOS's etcd, they have been successful in their goal of comprehensibility. Ongaro explained the core workings of Raft in a half-hour session, and demonstrated it using a Raft model written entirely in JavaScript.

Each node in the Raft cluster has a consensus module and a state machine. The state machine stores the data you care about, including a serialized log of state changes to that data. The consensus module makes sure that that log is consistent with the log of every other node in the cluster, which requires all state changes (writes) to be initiated by a "leader" node. With each write, the leader sends out a message to all nodes confirming the write, and only makes the write permanent if a majority of all nodes (called a "quorum") confirms. This is a form of "two-phase commit", which has long been a component of distributed databases.

Each node in the cluster also has a countdown clock that waits a random but significant time for a leader to send a message. If the current leader is unavailable, the node that counts down first sends out a message requesting a leader election. If a quorum confirms the leader election message, then the sender becomes the new leader. Log messages sent out by the new leader have a new value for the "term" field in the message that indicates which node was leader when the write happened, preventing log conflicts.

There is obviously more to Raft than that, such as how missing and extraneous entries are dealt with, but Ongaro was able to explain the core design in less than half an hour. This simplicity means that many developers have been able to produce software using Raft, including projects like etcd, CockroachDB, Consul, and libraries for Python, C, Go, Erlang, Java, and Scala.

[ Update: Josh Berkus updates some of the information about Raft in a comment below. ]

Etcd and Intel

Nic Weaver, who works on software-defined infrastructure (SDI) at Intel, spoke about what the company has been doing to improve etcd. Intel has a strong interest in both Docker and CoreOS because it is trying to help users scale to larger numbers of machines per administrator. Cloud hosting has allowed companies to scale to numbers of services where configuration management by itself isn't adequate, and Intel sees containers as a way to scale further.

As such, Intel has tasked his team with helping to improve container infrastructure software. In addition to releasing the Tectonic cluster server stack with Supermicro as mentioned in the first article of this series, it also put some work into the software. The component Intel decided to start with was etcd — looking at what it would take to build a really large etcd cluster. As container infrastructures managed with tools based on etcd grow to thousands of containers, the number of required etcd nodes grows and the number of writes to etcd grows even faster.

The problem that the team observed with etcd was that the more nodes you had in a cluster, the slower it would get. This is because, per the Raft algorithm, a write would require more than 50% of the etcd nodes to sync to disk and return success. So even a few nodes with chronic slow storage issues can hold up the cluster. If disk sync was not required, it would eliminate one source of cluster slowdown, but would put the entire cluster at risk of corruption in the event of a data center power loss.

Their solution to this was to make use of a facility added to Xeon processors called "asynchronous DRAM self-refresh" (ADR). This is a small designated area of RAM that is preserved when there is a crash and restored on system restart. It was created to support dedicated storage devices. There is a Linux API for ADR, though, so applications like etcd can use it.

Modifying etcd to use the ADR buffer as the write buffer to its logs was a success. Write time went from 25% to 2% of overall time in the cluster, and it was able to double throughput to 10,000 writes per second. This patch will soon be submitted to the etcd project.

CockroachDB

One of the natural steps to take with the Raft consensus algorithm is to go beyond the etcd key-value store and to build a full-service database around it. While etcd is adequate for configuration information, it lacks many of the features users want in application databases, such as transactions and support for complex requests. Spenser Kimball of Cockroach Labs explained how his team was doing so. The new database is called CockroachDB, because it is intended to be, in his words, "impossible to stamp out. You kill it, and it pops up again somewhere else."

CockroachDB is designed to be similar to Google's Megastore, a project Kimball was quite familiar with from his time at Google. The idea is to support consistency and availability across the whole cluster, including support for transactions. The project is planning to add a SQL-compatible layer on top of the distributed key-value store, as Google's Spanner project did. By having transactions as well as both SQL and key-value modes, it can enable most of the common uses of databases. "We want users to build apps, not workarounds," said Kimball.

The database is deployed as a set of containers, distributed across servers. The key-value address space for the database is partitioned into "ranges", and each range is copied to a subset of the available nodes, usually three or five. Kimball calls this cluster of individual Raft consensus groups "MultiRaft". This allows the entire cluster to contain more data than is present on any individual node, helping to scale the database.

Each node runs in "serializable" transaction mode by default, which means that all transactions must be replayable in log order. If a transaction has a serialization failure, it is rolled back on the originating node. This permits distributed transactions without unnecessary locking.

From this, it sounds like CockroachDB might be the answer to everyone's distributed infrastructure data persistence issues. It has one major fault though: the project isn't yet close to a stable release, and many of the planned features, such as SQL support, haven't been written yet. So while it may solve many persistence issues in the future, there are other solutions for right now.

High-availability PostgreSQL

Since highly distributed databases aren't yet ready for production use, developers are taking existing popular databases and making them fault-tolerant and container-friendly. Two such projects build up high-availability PostgreSQL: Flocker from ClusterHQ and Governor from Compose.io.

Luke Mardsen, CTO of ClusterHQ, presented Flocker at ContainerCamp. Flocker is a data volume management tool designed to help host databases in containers. It uses volume management that supports migrating database containers from one physical machine to another. This means that orchestration frameworks can redeploy database containers in almost the same way they would stateless services, which has been one of the challenges to containerizing databases.

Flocker is able to support migrating containers between physical machines by making use of ZFS on Linux from the project of the same name. Flocker creates Docker volumes on specially managed ZFS directories, allowing the user to move and copy those volumes by using exportable ZFS snapshots. Operations are performed via a simple declarative command-line interface.

Flocker is designed as a plugin for Docker. One challenge for the Flocker team is that Docker doesn't currently support plugins. The team created a plugin infrastructure called Powerstrip, but that tool has yet to be accepted into mainstream Docker. Until it is, the Flocker project can't provide a unified management interface.

If Flocker solves the container migration problem, then the Governor project from Compose, presented by Chris Winslett at CoreOS Fest, aims to solve the availability problem. Governor is an orchestration prototype for a self-managing replicated PostgreSQL cluster, and is a simplified version of the Compose infrastructure.

Compose is a Software-as-a-Service (SaaS) hosting company, which means that the services it offers need to be entirely automated. In order for Compose to deploy PostgreSQL, it needed to support automatic database replica deployment and failover. Since users have full database access, Compose also needed a solution that didn't require making any changes to the PostgreSQL code or to users' databases.

One of the things Winslett figured out quickly was that PostgreSQL could not be the canonical store of its own availability and replication state, because the master and all replicas would have identical information. This led to implementing a solution based on Consul, a distributed high-availability information service. However, Consul requires 40GB of virtual memory for each data node, which wasn't practical for tiny cloud server nodes. Winslett abandoned Consul for the much simpler etcd and, in the process, substantially simplified the failover logic.

Governor works by having the governor daemon control PostgreSQL in each container. Governor queries etcd to find out who the current master is and replicates from it on startup. If there is no master, it attempts to seize the leader key on etcd, and etcd ensures that only one requester can win that contest. Whichever node gets the leader key becomes the new master and the other nodes start replicating from it. Since the leader key has a time-to-live (TTL), if the master fails, a new leader election will follow shortly, ensuring that there will quickly be a new master.

This means that Compose can treat PostgreSQL almost the same as it treats the multi-master, non-relational databases it supports, like MongoDB and Elasticsearch. In the new system, PostgreSQL nodes can be configured in etcd, and then deployed using container orchestration systems, without hands-on administration or handling those containers differently.

Conclusion

Of course, there are many more projects and presentations than the ones mentioned above. For example, Sean McCord spoke at CoreOS Fest about using the distributed filesystem Ceph as a block device inside Docker containers, as well as running Ceph with each node in a container. While this approach is fairly rudimentary right now, it offers another option for containers that need to run services that depend on large file storage. Cloudconfig, CoreOS's new boot-from-a-container tool, was also introduced by Alex Crawford.

As the Linux container ecosystem moves from test instances and web servers into databases and file storage, we can expect to continue to see new approaches to solving these kinds of problems, increasing scale, and integrating with other tools. If anything is clear from CoreOS Fest and ContainerCamp, it's that Linux container technology is still young and we can expect many more new projects and dramatic changes in approaches over the next year.

Index entries for this article
GuestArticles	Berkus, Josh
Conference	CoreOS Fest/2015

Containers and persistent data

Posted May 29, 2015 11:32 UTC (Fri) by ms (subscriber, #41272) [Link] (2 responses)

(Slightly off topic - apologies. Pet peeve of mine...)

Paxos is not particularly complex and this myth that it's hard to understand is very unfortunate. Anyone who thinks so has not tried hard enough to understand it. There's a paper which demonstrates Raft is not really much easier to understand that Paxos itself (Raft refloated: http://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf). Most of this myth is due to Lamport's original paper and even the subsequent "Paxos made simple" and various others such as "Paxos made Live" either elide details or due to page count limits leave the reader to do some thinking. The facts are that distributed consensus is a hard problem generally, so you need to think hard to understand the problem first before you understand the solution; and that Paxos is a very general tool, but a very low level tool. It can be used in many different ways (which is often overlooked), where as Raft targets just one, fairly common, use of Paxos.

Containers and persistent data

Posted May 30, 2015 4:32 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (1 responses)

I read that paper and I don't think it supports what you ascribe to it. They actually explicitly day that raft delivered the goal of being easier to understand?

I think neither paxos not raft are simple. Especially not when implemented in real world software.

Containers and persistent data

Posted May 30, 2015 9:09 UTC (Sat) by ms (subscriber, #41272) [Link]

Well, the more complete quote would be:

> In our experience, Raft’s high level ideas were easy to grasp—more so than with Paxos. However, the protocol still has many subtleties, particularly regarding the handling of client requests. The iterative protocol description modularizes and isolates the different aspects of the protocol for understandability by a reader, but this in our experience hinders implementation as it introduces unexpected interaction between components (see previous section). As with Paxos, the brevity of the original paper also leaves many implementation decisions to the reader. Some of these omissions, including detailed documentation of the more subtle aspects of the protocol and an analysis of the authors’ design decisions, are rectified in Ongaro’s PhD thesis on Raft. However, this only became available towards the end of our effort. Despite this, we believe that Raft has achieved its goal of being a “more understandable” consensus algorithm than Paxos.

There seems to me to be little scientific evidence for their final statement, and it does seem to me to be consistent with my earlier claim that "Raft is not really much easier to understand that Paxos", though you may disagree. In the original Raft paper ("In search of an understandable consensus algorithm") there is quite a lot of scientific effort put in to measuring understandability and comparing Raft and Paxos, and Raft does much much better. I completely agree that understandability is incredibly important in algorithms - I'm not trying to dismiss the effort or approach of the Raft authors. I'm simply suggesting that the reputation Paxos has isn't entirely deserved.

Containers and persistent data

Posted May 29, 2015 23:37 UTC (Fri) by jberkus (guest, #55561) [Link] (2 responses)

Some updates based on feedback from Diego Ongaro. Apparently learning Raft in 1/2 hour doesn't make me an expert. ;-)

First, the quote about Paxos was apparently him quoting someone else, a paper reviewer. The full quote is as follows:

This came from an anonymous NSDI reviewer in a
paper rejection: "I'm really glad that the authors have created a more
understandable algorithm for distributed consensus. The dirty little
secret of the NSDI community is that at most five people really, truly
understand every part of Paxos ;-)."

Second, he requested a link to his full list of publications: https://raftconsensus.github.io/#pubs

Third: Raft is designed for *immediate* consistency, not *eventual* consistency.

Fourth, explanation for the Paxos validation issues which I mention tersely above. Ongaro provided me with a quote from the paper Paxos Made Live which explains it:

"There are significant gaps between
the description of the Paxos algorithm and the needs of a real-world
system. In order to build a real-world system, an expert needs to use
numerous ideas scattered in the literature and make several relatively
small protocol extensions. The cumulative effort will be substantial
and the final system will be based on an unproven protocol."

Containers and persistent data

Posted May 30, 2015 9:23 UTC (Sat) by ms (subscriber, #41272) [Link] (1 responses)

> "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol."

Yes, completely agree. Irritatingly, the "Paxos Made Live" paper also misses out all sorts of concerns and issues too. My own write up on Paxos can be found at http://wellquite.org/blog/paxos_notes.html

Now in terms of proofs and so forth, if you're working in this area then IMO, you really need to be model checking your implementation at a minimum. I.e. exhaustively searching every possible permutation of events and checking the right thing happens. Learning TLA+ and building a Paxos model in it takes at most a couple of days. The model you're left with is reasonably code-like in that you could get your TLA+ model and your code looking pretty similar and then convince yourself line-by-line that it's the same thing. Or, the alternative is you build your own simulation environment for your programming language of choice and then you can drive you real production-ready code through your simulator and demonstrate that that exact code does the right thing in every scenario you've generated and tested.

Obviously, model checking is not the same as a proof. TLA+ does support some proof extensions but otherwise you're looking at the usual candidates for proof assistants (https://en.wikipedia.org/wiki/Proof_assistant) or pencil and paper. I would suggest though that even if an algorithm has been proved correct, that does not really help prove that your implementation matches the algorithm. The typical extensions to Paxos (Fast Paxos, Cheap Paxos) are papers by Lamport and I would expect proofs or TLA+ models to be available online (certainly Fast Paxos has a TLA+ implementation at the back of the paper). Finally, there is an argument to me made that if you're writing distributed systems software and you are unable to reason mentally about how and why your software works then you might not wish to continue working in this area. You only need to look at aphyr's work at https://aphyr.com/tags/Jepsen to see how hopelessly awful almost all software is in this area. It mainly always fails to work as claimed.

Containers and persistent data

Posted May 31, 2015 8:00 UTC (Sun) by paulj (subscriber, #341) [Link]

On designing distributed network protocols, I'd say that if you can't show how your protocol is self-stabilising, then it is broken. (Even just for a weaker, "easier" form of self-stabilisation that doesn't have to worry about arbitrary bit-flips in stored state within nodes - just partitions, heals, {dropped,re-ordered,delayed} messages at any arbitrary time / any length of time / any point).

DRAM self-refresh

Posted Jun 1, 2015 12:25 UTC (Mon) by robbe (guest, #16131) [Link] (4 responses)

> Their solution to this was to make use of a facility added to
> Xeon processors called "asynchronous DRAM self-refresh"
> (ADR). This is a small designated area of RAM that is
> preserved when there is a crash and restored on system
> restart.

Self-refersh is actually a feature of the RAM modules. What the Xeon (and probably other CPUs containing a memory controller) support is turning it on in case of a detected error condition.

While this will save your data across a reboot or crash, self-refresh obviously still needs power. If you experience an outage of a few seconds, it’s time to say goodbye to your bits. I’m not convinced to trust anything more than crash-dumps to it.

DRAM self-refresh

Posted Jun 3, 2015 7:28 UTC (Wed) by pixelpapst (guest, #55301) [Link]

I agree that this hardware feature is not something to trust data to generally. It would be fine, though, if in the raft algorithm a node could recover from loss / corruption of data in whose two-phase commit it had actively participated. From what little I know about etcd, I'm not convinced it has that property.

DRAM self-refresh

Posted Jun 8, 2015 17:25 UTC (Mon) by lynxbat (guest, #103035) [Link] (2 responses)

With pure ADR this is 100% true. Normally you would have to go with something like battery-backed memory.

But in this case we are using "features" of ADR for something new specifically for small memory blocks. The restoration is not necessarily to RAM even...
To be clear this is not pure ADR - it is something new.

We are actively working on it now and excited to help CoreOS and others. More details as get firm things up and get it out where people can play with it.

DRAM self-refresh

Posted Jun 10, 2015 11:58 UTC (Wed) by ykorman (guest, #101556) [Link]

Don't you need to flush the CPU cache (clflush) or use non-writeback caching policy to make sure that the buffer data is in the ADR domain?

DRAM self-refresh

Posted Jun 11, 2015 20:15 UTC (Thu) by jberkus (guest, #55561) [Link]

Some people in the PostgreSQL community would be interested in this feature as well. Is there someone we can work with at Intel?