Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

What's the big deal about Hadoop?

Todd R. Weiss | Feb. 15, 2012
Hadoop is all the rage, it seems. With more than 150 enterprises of various sizes using it -- including major companies such as JP Morgan Chase, Google and Yahoo -- it may seem inevitable that the open-source Big Data management system will land in your shop, too.

For example, when it comes to transactions, "it makes total sense to use a relational database system," he says. But overall the idea is to remain "flexible in what technologies we use at eBay; we don't see a world where there will be one unifying technology."

Tech tips

eBay's Williams offers these strategies when dealing with Hadoop:

Learn how to manage Hadoop efficiently by learning its organizational structure. "If you have large numbers of people using a Hadoop cluster, they'll likely be trying to do some of the same things at once," Williams says. "That means they'll probably be generating the same intermediate data sets to analyze, and that's a waste."

Instead, he suggests, run common data queries once a morning and save the results in one place where anyone who needs them can use them, saving large amounts of processing time and related resources. "Think very hard about what data sets are useful for your users and create those data sets."

Cleaning up your Hadoop cluster is a key maintenance item. "This is really important," Williams says. "You'll probably run a lot of Hadoop jobs and you'll create a lot of data. Often, though, the people doing the work with the files will just walk away. That's pretty typical for users. If you do that, though, you'll end up with lots of extra Hadoop files.

"So you really have to create a strategy to keep your Hadoop cluster neat so you don't run out of disk space. Have people clean up what they don't need. Those kinds of things turn out to be pretty important if you've got a large Hadoop cluster."

The same is true at Concurrent. Hadoop hasn't replaced the company's use of traditional relational databases, including MySQL, PostgreSQL and Oracle. "It is a combined solution," Lazzaro says. "We use Hadoop to do the heavy lifting, such as large-scale data processing. We then use Map/Reduce within Hadoop to create summary data that is easily accessible through a traditional RDBMS."

What tends to happen in relational databases, he explains, is that when the system gets too large -- to, say, 250 million records a day -- the database becomes "non-responsive to data queries." "However," he says, "Hadoop at that scale is not even breaking a sweat. Hadoop therefore can store, say, 5 billion records and with Map/Reduce we can create a summary of that data and insert it into a standard RDBMS for quick access."

In general, Williams says, "I don't think too much" about Hadoop's limitations. "I think about the opportunities. You can find solutions to any problems pretty quickly" through the open source community. "Some people do gripe about different aspects of Hadoop, but it's a reasonably new thing. It's like Linux was back in 1993 or 1994."


Previous Page  1  2  3  4  5  6  Next Page 

Sign up for Computerworld eNewsletters.