"It's worth remembering that there's a multi-stage process here," Monash added. "For example, a PDF can be converted to text (and image) data, (Name, value) pairs can be extracted. Those can have their spelling corrected. Then the company names can be regularized. In real life, there can be tens of steps."
As for the hackathon's potential value, "a large fraction of the world's interesting information is on paper, or in paper-like formats such as PDF," he added. "Of course it's worthwhile to make all that more accessible."
Sign up for Computerworld eNewsletters.