Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Hadoop: How open source can whittle Big Data down to size

Rohan Pearce | March 5, 2012
In 2011 'Big Data' was, next to 'Cloud', the most dropped buzzword of the year. In 2012 Big Data is set to become a serious issue that many IT organisations across the public and private sectors will need to come to grips with.

In 2011 'Big Data' was, next to 'Cloud', the most dropped buzzword of the year. In 2012 Big Data is set to become a serious issue that many IT organisations across the public and private sectors will need to come to grips with.

The challenge essentially comes down to this: How do you store the massive amounts of often-unstructured data generated by end users and then transform it into meaningful, useful information?

One tool that enterprises have turned to to help with this is Hadoop, an open source framework for the distributed processing of large amounts of data.

Hadoop lets organisations "analyse much greater amounts of information than they could previously," says its creator, Doug Cutting. "Hadoop was developed out of the technologies that search engines use to analyse the entire Web. Now it's being used in lots of other places."

Dries' vision for Drupal 8

The road to a successful open source project: Learning lessons from Drupal

Open Source Ecology: Can open source save the planet?

Python vs. PHP: Choosing your next project's language

In January this year Hadoop finally hit version 1.0. The software is now developed under the aegis of the Apache Software Foundation.

"The releases coming this year will effectively become Hadoop 2.0," Cutting says. "We're going to see enhanced performance, high-availability and an increased variety of distributed computing metaphors to better support more applications. Hadoop's becoming the kernel of a distributed operating system for Big Data."

Hadoop grew out of Nutch, a project to build an open source search engine Cutting was involved in. Development of Nutch is also conducted under the patronage of the Apache Software Foundation.

"The Hadoop ecosystem now has more than a dozen projects around it," says Cutting. "This is a testament to the utility of the technology and its open source development model. Folks find it useful from the start. Then they want to enhance it, building new systems on top.

"Apache's community-based approach to software development lets users productively collaborate with other companies to build technologies from which they can all profitably share."

Hadoop setups are available from big names in the Cloud computing space, including Amazon (through Amazon Elastic MapReduce) and IBM; in December Microsoft announced a "limited preview" of Hadoop on its Windows Azure Cloud service. Hortonworks, a company set up by Yahoo (which runs a 42,000-node Hadoop environment and is a key driver of the project), and Cloudera, which employs Cutting as chief architect, also offer Hadoop-related services.

Cloudera offers a distribution of Big Data software called CDH -- Cloudera's Distribution Including Apache Hadoop. "This is open-source, Apache licensed software," Cutting says. "Folks can develop their applications against these APIs without fear of ever being locked into paying any one vendor.

 

1  2  3  Next Page 

Sign up for Computerworld eNewsletters.