The user data was due to be available for download Thursday through Yahoo Labs' Webscope data sharing program, a library of anonymized data sets for non-commercial use.
It's based on user interactions with Yahoo News, Sports, Finance, Movies, and Real Estate. The data was gathered over four months early last year from 20 million Yahoo users. In addition to the interaction data, it includes categorized demographic information, like age range and gender, for a subset of the users. It's also releasing the title, summary, and key-phrases of the related news articles.
Yahoo says the previous largest data set, released last year by the online marketing firm Criteo, was 1TB in size and included some 4 billion events.
It says its goal is to level the playing field a bit for academic researchers, who often have more freedom to pursue long range projects than their peers at corporations, but who lack the real world data to do it with.
"They might be able to solve problems in a way that we can make use of at Yahoo, or come up with new research problems that we haven't even thought of yet," Rajan said.
Sign up for Computerworld eNewsletters.