CW: Which parts of the work could be automated, which had to be done manually?
Cabra: Nothing was automatically done. I mean, processing 2.6 terabytes of information in 11.5 million files takes a long time. We had to spend many resources improving the platforms that we were using. One thing we have to have in mind at ICIJ when using software to help us with the investigation, is our very wide range of users.
On the one hand we have the journalists who are very good at their job, but are not good with technology. And on the other hand we have very tech-driven journalists, who know everything about encryption and computers and of whom some are even developers themselves. So with every tool, we need to cater to these audiences.
For the document search platform we needed something that just allows people to search in a search box -- like they do in Google -- but also allows more complex queries like searching for regular expressions and patterns like bank accounts, IDs, passports. Same thing with Linkurio.us and Neo4j.
The good thing about Linkurio.us is that it can visualize graph data very easily. Everybody can work on dots. So our journalists that aren't very techie can just click on dots and then several other dots and connections appear. They find it very useful because it's very intuitive. However Linkurio.us and Neo4j are integrated in such a way that the more advanced users can make queries in Cypher -- Neo4j's language -- which actually looks like "show me all the people connected to this person within two steps" or "show me all the persons who are connected to more than twenty companies." So that was very important for us -- to set up the platform in a way so that both types of journalists could work with it. We invested in one full-time programmer to improve the document platform and process the documents for one full year.
CW: What was the biggest challenge in the process?
Cabra: The processing side. We had to set up a very complex chain that would basically take documents and look at those to see if the machine could structure the text. If it couldn't it would send the documents to OCR for recognition and then it would send it to the index. We did this by parallel processing with 30 to 40 machines in the cloud. If we only had one queue of documents it would have taken us forever.
CW: You talked about improvements of the platforms. How did you improve them?
Cabra: Since there are so many documents in many different formats, it was something special for ICIJ. For example some journalists wanted to have a feature that allows them to feed the platform with a list of names from their country and get a list of those out who are part of the documents. So we developed this feature, "batch searching." You put a spreadsheet of names in and a couple of minutes later you get a result list out. We had to improve the tools with features that we didn't need in previous investigations.
Sign up for Computerworld eNewsletters.