Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How an online real estate company optimized its Hadoop clusters

Thor Olavsrud | April 21, 2016
Every night, Trulia crunches more than a terabyte of new data and cross-references it with about 2 petabytes of existing data to deliver the most up-to-data real estate information to its users. Here's how it ensures consistent quality of service from its Hadoop clusters.

San Francisco-based online residential real estate company Trulia lives and dies by data. To compete successfully in today's housing market, it must deliver the most up-to-date real estate information available to its customers. But until recently, doing so was a daily struggle.

Acquired by online real estate database company Zillow in 2014 for $3.5 billion, Trulia is one of the largest online residential real estate marketplaces around, with more than 55 million unique site visitors each month.

Hadoop at heart

With so much data to store and process, the company adopted Hadoop in 2008 and it has since become the heart of Trulia's data infrastructure. The company has expanded usage of Hadoop to an entire data engineering department consisting of several teams using multiple clusters. This allows Trulia to deliver personalized recommendations to customers based on sophisticated data science models that analyze more than a terabyte of data daily. That data is drawn from new listings, public records and user behavior, all of which is then cross-referenced with search criteria to alert customers quickly when new properties become available.

To make it all work, throughout each night, the company must complete dozens of workflows and hundreds of complex jobs on time. With many teams writing Hadoop jobs or using Hive or Spark concurrently, Trulia has to ensure reliability in its multi-tenant, multi-workload environment. Delayed or unpredictable jobs throw a wrench in the works and can seriously affect the bottom line. Until recently, that meant Trulia had to intentionally underutilize its Hadoop clusters to ensure jobs completed on time.

"We process, on a daily basis, over a terabyte of new information: public records, listings, user activity," says Zane Williamson, senior DevOps engineer at Trulia. "We process this data across multiple Hadoop clusters and use the information to send out email and push notifications to our users. That's the lead driver to get users back to the site and interacting. It's very important that it gets done in a daily fashion. Reliability and uptime for the workflows is essential."

"It's been a pretty painful process, I think," Williamson adds, noting that he joined Trulia relatively recently. "It's been a pretty big challenge to reliably run this data cycle, maintain uptime and troubleshoot issues. Troubleshooting issues could sometimes take days to dial in on."

Sprinkle with Pepperdata

To ease that pain and achieve more reliable Hadoop job completion, Trulia turned to Pepperdata, a specialist in adaptive Hadoop performance that guarantees quality of service on Hadoop.

Pepperdata provides a granular view of everything happening across your Hadoop clusters, actively governing use of CPU, memory, disk I/O and network for every task, job, user and group. For Trulia, the pièce de résistance was Pepperdata's newest feature — the capability to turn any trackable metric into an alert defined at any level of granularity, from cluster, to node, user, queue, job or task.


1  2  Next Page 

Sign up for Computerworld eNewsletters.