At first glance, Microsoft's acquisition of document e-discovery software maker Equivio looks like another key addition to Office 365's growing portfolio of document management tools for major industry verticals. Equivio supports law firms with a set of tools that automatically generate relevance-based indexes from large streams of text, helping attorneys locate information vital to their cases.
But a big key to today's acquisition may have just been delivered to Equivio by the U.S. Patent and Trademark Office. Patent number 8,938,461, issued just today, describes a "Method for organizing large numbers of documents" that may very easily apply to a future document management system for premium Office 365 subscribers.
Today's patent, applied for in July 2010 by Equivio CEO Amir Milo and vice president for engineering Yiftach Ravid, describes a system for detecting and sorting near-duplicate documents, especially e-mails. While it's relatively easy, even with a "big data" store, to identify identical documents and eliminate duplicates, e-mails often contain fragmented quotes and partial excerpts, often broken up by ">" characters at the beginnings of lines to indicate the fact that they're excerpts. In the e-discovery process, attorneys need to be able to connect threads of discussions compiled from multiple streams of e-mails, almost none of which contain absolutely identical excerpts.
"Businesses and governments around the world generate enormous volumes of data every day," reads http://blogs.microsoft.com/blog/2015/01/20/microsoft-acquires-equivio-provider-machine-learning-powered-compliance-solutions/">a blog post this morning by Microsoft Outlook and Office 365 corporate vice president Rajesh Jha. "Sifting through that data to find what is relevant to a legal or compliance matter is costly and time consuming. Traditional techniques for finding relevant documents are falling behind as the growth of data outpaces people's ability to manually process it."
Theoretically, a cloud-based service incorporating this technology could draw these associations between discussions among related parties in real-time. This could become useful for any number of Office 365 paying subscribers, perhaps on a premium tier, above and beyond the legal profession.
The "Method for organizing" patent explains how such a system analyzes streams of document data and compiles fingerprint information from those streams. Those fingerprints are continually checked for similarities, and when they arise, the system constructs "presumed documents" that may later correspond to the original documents from which fragments are excerpted. Such documents are assemblies of nodes, each of which is described by metadata. When multiple properties of this metadata appear to correspond, the system compares fingerprints to see whether separate fragments actually match. If they do, they're incorporated into the "presumed documents."
It's a way of compiling a multi-dimensional index from several huge streams of data, without the system having to have "root documents" — files known to be the originals from which the other copies are made — in advance.
Sign up for Computerworld eNewsletters.