Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How an online real estate company optimized its Hadoop clusters

Thor Olavsrud | April 21, 2016
Every night, Trulia crunches more than a terabyte of new data and cross-references it with about 2 petabytes of existing data to deliver the most up-to-data real estate information to its users. Here's how it ensures consistent quality of service from its Hadoop clusters.

"We're watching how every application on the cluster is actually using the hardware," says Sean Suchter, co-founder and CEO of Pepperdata. "If there is any contention between some high priority thing and some computationally expensive ad-hoc thing, we'll detect that and slow down or otherwise affect the low priority thing just enough to give a consistent, high quality of service to the high priority thing."

"The performance gains we get scale pretty well with the chaos of the cluster," he adds. "The more chaos you have, the more applications you run, the more different tenants you have, the better we can do. We're able to react in a second-by-second fashion and do a lot of optimization. The opportunity goes higher the more complex the environment is."

Pepperdata has used the alerting feature to create detailed notifications to proactively track performance metrics across its Hadoop environment. Between dashboards and the alerting functions, Trulia is now able to identify problems much easier and faster. With the new visibility, the company has been able to optimize its Hadoop usage and maximize utilization.

"We rolled out Pepperdata last year," Williamson says. "It's been an amazing tool for us to diagnose problems. Within hours, rather than days, we could zoom in on what was going on and make changes."

He notes that Trulia now uses Pepperdata to manage five different Hadoop clusters, range from a dozen nodes to more than 40. There's about 2 petabytes of data across all the clusters. The company also has a number of clusters on AWS that are not yet managed by Pepperdata because they're used for batch-driven EMR workloads that aren't persistent. But he's working with the Pepperdata team to bring those clusters under Pepperdata management too.

"It's definitely on my roadmap," he says. "I feel like I'm running blind here."


Previous Page  1  2 

Sign up for Computerworld eNewsletters.