"That schema-less approach, which lets you just store the data and then figure out what you want to do with it, is much more appropriate for unstructured and semi-structured data like Web log data, as well as for data that you know has value for the organization, but you may need to do some experimentation to figure out what that value is," Aslett says. "The cost of doing that in an enterprise data warehouse would just be prohibitive."
Return Path, an email certification and reputation monitoring company, started experimenting with Hadoop in 2008, attracted by its enormous storage potential and the ability to easily scale the platform by adding servers. Return Path collects massive amounts of data from ISPs and analyzes it to establish email sender reputations, pinpoint deliverability issues or monitor potentially harmful messages, for instance.
In the early days, signing on a new ISP or two could result in a quadrupling of its data. The company found itself in a position where it couldn't keep data as long as it wanted to, nor could it process the data as fast as it wanted to, recalls CTO Andy Sautins. Over the years, he and his team tried a few custom solutions to augment the company's traditional enterprise data warehouse. "These worked fairly well but required much more time and investment in software development than made sense," Sautins says.
Hadoop was a game-changer. "It let us change the conversation around what it meant to retain data. It wasn't in terms of weeks, it was years," Sautins says. "Hadoop really helped us be able to weather the storm of retaining and processing more data."
Moving out of the shadows
Apache Hadoop includes two main subprojects: the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data, and Hadoop MapReduce, which is a software framework for distributed processing of large data sets on compute clusters. It's augmented by a growing group of Apache projects, such as Pig, Hive and Zookeeper, that extend its usability.
Hadoop's emergence as an enterprise platform mirrors in many ways the arrival of Linux: Deployments were preceded by shadow IT projects, or skunk works, to test the merits of the software before adopting it on a wider scale.
Adoption is growing largely through developers "who've got an ear to ground, figuring out what the other companies are doing," 451 Research's Aslett says. "It's just as we saw Linux move in to enterprises through the IT department and internal projects, when the CEO/CIO didn't necessarily know that it was in there. It's exactly the same with Hadoop," Aslett says.
The emergence of vendors with commercial, enterprise-oriented Hadoop distributions -- including support, management tools and configuration assistance -- has further accelerated adoption in the enterprise realm. Key players in this arena are Cloudera, MapR Technologies and Hortonworks, which was spun out of Yahoo last year to develop its own distribution of Hadoop.
Sign up for Computerworld eNewsletters.