Hadoop is coming out of the shadows and into production in IT shops that are drawn to its ability to store, process and analyze extremely large volumes of data. But the relative newness of the open-source platform and a shortage of experienced Hadoop talent [link to sidebar TK] pose technical challenges that enterprise IT teams need to address.
Hadoop grew out of the work of Doug Cutting and Mike Cafarella, who originally developed it to support Apache Nutch, an open-source search engine. It became an Apache project when Cutting and a team of engineers at Yahoo split the distributed computing code out of the Nutch crawler to create Hadoop.
Today Hadoop powers every click at Yahoo, where the Hadoop production environment spans more than 42,000 nodes. That kind of scalability is a sweet spot of Hadoop, which is designed to handle data-intensive distributed applications spanning thousands of nodes and exabytes of data, with a high degree of fault tolerance.
Hadoop pioneers in the online world -- including eBay, Facebook, LinkedIn, Netflix and Twitter -- paved the way for companies in other data-intensive industries such as finance, technology, telecom and government. Increasingly, IT shops are finding a place for Hadoop in their data architecture plans. The appeal, in a nutshell, is that Hadoop can enable massively parallel computing on inexpensive commodity servers. Companies can collect more data, retain it longer, and perform analyses that weren't practical in the past because of cost, complexity and a lack of tools.
At Concurrent Computer, the decision to use Hadoop was driven in large part by volume.
"Scalability was the biggest concern. With a traditional relational database, every time you want to scale or get bigger, you end up paying a premium," says Will Lazzaro, director of engineering at Concurrent, which provides video-on-demand systems and processes billions of records a day related to viewers, content consumption and platform operations.
"When it comes to the heavy lifting of getting yesterday's data into our system, or plugging through gigabits-big log files, [Hadoop] is the opportune technology to bring in that data, whether it's structured, semi-structured or even unstructured," Lazzaro says.
Playing with big data
Hadoop lets enterprises store and process data they previously discarded -- log files, for example -- because it was too hard to process and didn't fit cleanly into traditional database schemas. That's the crux of so-called big data, says Matt Aslett, research manager, data management and analytics, at 451 Research. "It's about doing things with data that was previously thrown away in a way that enables new applications and new projects."
In addition to being scalable, Hadoop computing systems are flexible. Hadoop is schema-less, which lets users join and aggregate data from disparate sources for more complex analyses. New nodes can be added as needed, and Hadoop's built-in fault tolerance features allow the system to redirect work to another location if a node is lost.
Sign up for Computerworld eNewsletters.