Out in the Open: The Abandoned Facebook Tech That Now Helps Power Apple

Matt Pfeil drove from Austin to San Antonio with only one thing in mind: stopping Jonathan Ellis from quitting his job at the cloud computing company Rackspace. Ellis had emailed his colleagues, including Pfeil, to say he planned on leaving Rackspace to start a new company around Cassandra, a sweeping open-source database originally created by […]
opensource
Abstract color technology backgroundGetty

Matt Pfeil drove from Austin to San Antonio with only one thing in mind: stopping Jonathan Ellis from quitting his job at the cloud computing company Rackspace.

Ellis had emailed his colleagues, including Pfeil, to say he planned on leaving Rackspace to start a new company around Cassandra, a sweeping open-source database originally created by Facebook to help juggle the scads of digital information generated by its popular social network. Pfeil had worked with Cassandra at Rackspace, so he knew the value of the project, but he didn't want Rackspace to lose Ellis. When he wasn't writing code, Pfeil helped with the company's recruiting efforts, and that made it especially difficult to see Ellis leave.

The two met for lunch at a tiny Thai restaurant in San Antonio, and Pfeil brought a long list of reasons why Ellis shouldn't quit. But soon, his plan started to backfire. When Pfeil pointed out that Ellis didn't have anyone to run the business side of the startup, Ellis invited him to join the new company too. "When he asked me, I started thinking about what I wanted to do with my life," Pfeil remembers. "I was in my 20s. I hadn't started a family. It was the best time to do a startup. I didn't decide right then, but he planted the seeds."

>The difference is that software like Cassandra is specifically designed to run across a vast cluster of machines, juggling huge amounts of data.

Soon, even Rackspace was behind the plan. The company not only gave its blessing to their new venture, but provided seed money as well. Having seen the power of Cassandra first hand, Rackspace---and two of its key employees---knew how useful it could be to other operations struggling to accommodate ever increasing amounts of online data. And, now, four years later, their leap of faith has paid off in big ways.

Jonathan Ellis.

DataStax

Today, their startup, DataStax, is part of a growing flock of companies that are remaking the multi-million database market and slowly loosening the grip of software giant Oracle. Unlike traditional databases, like those from Oracle, Cassandra and its peers are specifically designed to run across a vast cluster of machines, juggling huge amounts of data, and that's what the modern world needs.

Though Facebook has all but abandoned Cassandra, the technology has gone on to power critical web infrastructure at companies like Twitter, Netflix, even Apple. And DataStax has built a version of the tool for all sorts of other businesses. Having raising over $84 million, the startup now spans over 300 employees, and it's well on its way to an IPO, landing over 500 customers, including 25 of the Fortune 100, according to Ellis.

The Birth of Cassandra

Facebook engineers Avinash Lakshman and Prashant Malik originally built Cassandra to power the engine that let you search your inbox on the social network. Like other so-called "NoSQL" databases, it did away with the traditional relational model---where data is organized in neat rows and columns on a single machine---in order to more easily scale across thousands of machines. That's vitally important for a growing web service the size of Facebook. Lakshman had worked on Amazon's distributed data storage system called Dynamo, but the two also drew inspiration from a paper Google published in 2006 describing its internal database BigTable.

Mark Zuckerberg and company open sourced Cassandra in the summer 2008, and it helped kick off the now enormous NoSQL movement, along with other databases like CouchDB and MongoDB. Rackspace hired Ellis that very year to evaluate options for a next-generation database, and he tried all the various NoSQL databases available at the time. None, he says, could top Cassandra. "Facebook open sourced it, but weren't moving it forward," he says. "But the technical foundations were ahead of everyone else."

Facebook hadn't built a community around Cassandra, which was was both a liability and an opportunity. Ellis could tailor the open source project to meet Rackspace's needs---build and guide the community himself. But the idea to start his own Cassandra company didn't come until 2010. Cassandra was already gaining traction outside of Facebook and Rackspace, but when an engineer at another company told Ellis it had decided to use a competing NoSQL database because there was a startup that would provide technical support for the software, he knew he had to act.

Keep On Chuggin'

Even as Cassandra grew behind the scenes, the initial buzz wore off. Today, there are too many NoSQL databases to keep track of. And when Facebook decided to use Hbase instead of Cassandra for its messaging system, it took a little sheen off the database. But even as the NoSQL hype faded, Cassandra kept chugging along, picking up new users along the way. According to data compiled by Austrian consulting firm Solid IT, Cassandra is the second most popular NoSQL database in the world, after MongoDB, and the third fastest growing database overall.

Matt Pfeil.

DataStax

DataStax is a big part of this, offering service and support for a proprietary version of Cassandra called DataStax Enterprise. "A lot of companies have more time than money, so they use the open source Cassandra and contribute back," Ellis says. "But other companies prefer to trade money for time, and they pay for the enterprise version. Personally, though the sales team would disagree, I'm happy to work with people from either camp."

At the time, the larger Cassandra community has continued to grow, with many other companies supporting its development. Apple is now the second largest contributor to the project, though it's quiet about how it uses the database. Ellis couldn't confirm whether Apple is a DataStax customer, but three Apple engineers are speaking at the annual Cassandra Summit in September. And Cassandra has found its way back into Facebook thanks to the company's acquisition of Instagram, which is a heavy user of the database.

Chasing the Future

The tech community has reached a point where one database product from one company will no longer dominate the market. From now on, there will be multiple different approaches to storing and working with data. But the big data landscape has evolved since 2008. Since then Google has unveiled numerous new tools, such as Dremel, which it uses to query data at insanely fast speeds, and Spanner, its internal replacement for the database that inspired Cassandra.

The open source community is trying to keep up with these advances. MapR started building a Dremel clone Drill in 2012, and a startup called Databricks has been developing an analytics system called Spark that is now in use by Yahoo. More recently, a team of ex-Google engineers began building a Spanner clone called CockroachDB.

Ellis says the strategy for Cassandra and DataStax will be ensuring that its technology can work with any new technology that can come along. For example, DataStax recently released a connector for Spark that will enable developers to easily use Spark to analyze data stored in Cassandra. "We're trying to be the database that drives our application, not necessarily the analytics," he says. "There's nothing that marries us to one of those platforms."

Correction 8/4/2014 7:15 PM EST: An earlier version of this story said that Yahoo was developing Spark, but it's actually being developed by a company called Databricks.