Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Etsy gets crafty with big data

Ann Bednarz | Feb. 26, 2013
Handmade marketplace Etsy has grown to 800,000 sellers and 40+ million monthly visitors. All that activity generates enormous quantities of data, which Etsy uses to drive site improvements.

"We have our own custom event-logging frameworks, and we store all the data in [Hadoop Distributed File System (HDFS)]. We process the data into ETL using a data flow language known as Cascading, and then we push it downstream to a data warehouse, which is Vertica," Mardenfeld says.

Etsy also uses Elastic MapReduce clusters to analyze the data and perform predictive analytics. "Hadoop is an important part of our pipeline. I don't think we'd be able to do any of this without it," Mardenfeld says.

To digest the data, Etsy has built a number of homegrown tools. "We write a bunch of custom UIs for this, for our internal tools. One of them is what we call the A/B Analyzer, which allows us to easily do analysis on experiments that we run. We also have our own internal funnel tool and our own dashboard tool," Mardenfeld says.

The homegrown presentation tools make it easier for teams throughout Etsy to access and make use of data for experimentation and to inform product development, even if they don't have statistical expertise. A launch calendar keeps track of all the current, active experiments at Etsy, and Etsy employees can simply click on an experiment and, using the homegrown dashboards, see the results to date of that experiment.

"We had a lot of questions that were of the same type, so we've generalized those so it's easy for people to get the answers to those questions without doing a lot of work," Mardenfeld says. "For more custom questions, you can answer questions in Vertica, you can use SQL, and for more in-depth data mining and analysis and building products, then you can jump down to writing things in MapReduce and Cascading."

Big data, little changes

Etsy's continuous deployment approach sets up an ideal scenario for tying single, isolated site changes to experiments, and it makes it easy to identify the culprit if a code change causes problems.

"When you make multiple changes to a site or a page, it's hard to figure out what's not working the way you want it to," Mardenfeld says. "When you change one thing at a time, you're able to see where you went down the wrong path and can backtrack very easily."

Another benefit of continuous deployment is the ability to pull the plug on a code change that didn't live up to expectations. "We're more likely, we think, to notice that we're doing something that's bad and to stop," McKinley says. "Whereas operationally and emotionally, if you work on something for many months and then release it, there's nothing that will stop you from releasing it because you're invested in it."


Previous Page  1  2  3  4  5  Next Page 

Sign up for Computerworld eNewsletters.