The Panama Papers have become a worldwide phenomenon. Contents aside, the technical aspects of the related investigations, involving 2.6 terabytes of data -- 11.5 million documents! -- are intriguing. How did journalists manage, organize and analyze this huge amount of data?
The International Consortium of Investigative Journalists (ICIJ) over the past year established the technical foundations for the worldwide mass media investigation. The ICIJ also laid the technical grass roots of the "offshore leaks" involving international tax-shelter accounts in 2013, the "Lux Leaks" of Luxembourg's tax rulings in 2014 and the "Swiss Leaks" of cash shelters in Switzerland in 2015, and so has a lot of experience in data analytics for journalistic purposes. Mar Cabra works at the data and research unit at the ICIJ and knows the Panama Papers' technical challenges better than anyone else. I had the possibility to talk with the Spanish journalist a few days ago.
Computerwoche: The Panama Papers involves the biggest amount of data journalists have ever worked with in one research project: 2.6 terabytes of e-mails, pdf documents, images, database sets. How was it possible to analyze the data and to make it searchable?
Mar Cabra: This is our fourth investigation based on a leak from the offshore world. We've learned throughout the years that in the beginning of each investigation like this we need to spend time understanding the data we have in front of us. So we spent a good couple of months understanding the data and its different formats to understand how we could process it. We used platforms that we used in previous investigations, but improved them to work with these amounts of data.
The first thing we knew was that we needed a platform to host all the documents. Unfortunately a third of the documents were images in PDFs or in TIFs. So we had to set up a complex processing chain in optical recognition to extract text from those documents. And then we basically indexed those documents and put them on a cloud platform that allowed us to search the documents from everywhere in the world.
At the same time we realized that we also had documents from the internal database of Mossack Fonseca which included more than 200,000 companies in tax havens in 21 jurisdictions. Therefore we also knew that we needed another tool to visualize the data. In that sense we actually decided to move that database into Neo4j and then feed the Neo4j database into Linkurio.us, a software that allowed us to visualize graphs very easily and see the connections between companies, beneficiaries, shareholders and all their addresses. Those were the two main platforms we had for the reporters to mine these 2.6 terabytes of information.
Sign up for Computerworld eNewsletters.