Google Unleashes More Big-Data Genius With a New Cloud Service

Google continues to share the wealth of the uniquely powerful software systems it erected to run its enormous online empire.
MG8436edit
Ariel Zambelich/WIRED

Google continues to share the wealth of the uniquely powerful software systems it erected to run its enormous online empire.

On Tuesday morning, at its Google I/O developer conference in San Francisco, the tech giant introduced a cloud computing service it calls Google Cloud Dataflow. Based on two software systems that have helped Google drive its own online operation for years--Flume and MillWheel--the service is a way of more easily moving, processing, and analyzing massive amounts of digital information. As he unveiled the service, Google's Urs Hölzle--the man who oversaw the creation of Google's global network of data centers--said it's designed to help companies deal with petabytes of data--a.k.a. millions of gigabytes.

"Cloud DataFlow is the result of over a decade of experience in data analytics," he said. During the conference keynote, one Googler showed how the system could be used to analyze reactions to World Cup matches posted to Twitter.

This is just the latest way that Google is sharing its unprecedented online infrastructure with the world at large through its cloud services. Google Compute Engine and Google App Engine--cloud services that let companies and independent developers build and run large software applications--are based on internal Google infrastructure, as is BigQuery, a way of almost instantly asking questions of massive datasets. Following the lead of Amazon--the company that pioneering modern cloud computing--Google sees cloud computing as a potentially enormous market, one that might even eclipse the market for online ads, its primary business today.

Long ago, with a sweeping software system called MapReduce, Google set the standard for processing "big data." A tool that ran across hundreds of servers, MapReduce is what the company used to build the enormous index of webpages that underpins its search engine. Thanks to an open source clone of MapReduce--Hadoop--the rest of the world now crunches data in similar ways. But Hölzle says that Google not longer uses MapReduce. It now uses other Flume, aka FlumeJava, for this kind of massive "batch processing."

After Hölzle's keynote, Google director of product management Greg DeMichillie told us that Flume essentially removes much of the pain that came with MapReduce. It lets the company more easily build complex "data pipelines," meaning the entire processor of ingesting, cleaning, and analyzing data.

Ariel Zambelich/WIRED

Now, DeMichillie says, Google is not only sharing this system with the rest of the world. In doing so, it's also combining Flume with MillWheel, a similar system that handles "stream processing." Whereas batch processing is a way of crunching data that has already been collected, stream processing involves analyzing data in near real-time as it comes off the net. Many companies require both types of data analysis, and Cloud Dataflow brings both under one umbrella.

Others have built similar tools. Twitter, for instance, has created an open source contraption it calls Summingbird. But Dataflow is a little different in that Google is offering it solely as a cloud service, something that anyone can access over the internet. The company is not distributing software that you could install on your own machines.

At today's conference, Google also introduced new tools for monitoring and debugging applications that you build and run on Compute Engine and App Engine. DeMichillie showed off a tool called Google Cloud Trace, which helps you find particular performance bottlenecks that may plagues your applications. He tells uses it uses the same concepts as DTrace, a tool originally developed at Sun Microsystems, but he says that the Cloud Trace technology was developed entirely at Google.