Corporate fraud, excesses and subsequent scandals, loss of public trust and even bankruptcies are a recurring phenomenon. We witnessed such excesses as recently as in the period leading up to the 2008 “mini depression”.
However, one of the most memorable examples of corporate fraud comes from mid to late 90s. Fortune called this company “The most innovative company in corporate America”. Goldman Sachs called it an "extraordinary" and "unique" business in early 2001. It was a pioneer in “asset-light” business model.
The company was Enron. However, soon things went south and it went bankrupt on December 2, 2001.
Subsequently, half-million Enron email messages were released to public which still serve as a data-set for research. This case highlighted, perhaps for the first time, the tremendous value of text data in identifying and potentially preventing fraud.
It is estimated that almost 80% of data in an enterprise is unstructured - most of it text. It is no wonder that text analytics continues to take such a prominent position.
Some 13 years after Enron, auditors still can’t stop managers cooking the books. If accounting scandals no longer dominate headlines as they did when Enron and WorldCom imploded in 2001–2002, that is not because they have vanished but because they have become routine - The Economist, Dec 13th, 2014
60% Savings on Legal Investigation Costs and 500% Time-Savings in Classifying and Identifying Confidential Data
Application of text mining in compliance, fraud prevention and protection of sensitive information is a thriving business. For example, a startup, Dathena, founded by ex-Tier 1 bank employees is creating one such solution. The estimate for ROI for their platform is a whopping 60% Savings on Legal Investigation Costs and 500% Time-Savings - in FSS sector.
In case of Enron, the legal fees were astronomical $688M and as per estimates by legal experts, the discovery costs can be as high as 90% of the total legal effort. Even a minor reduction in legal effort spent on discovery can translate to massive dollar savings.
Application of Text Mining to Fraud Detection
This topic does attract research attention. A recent paper explains variety of ways to diagnose fraud in financial filings by enterprises. It also includes a good literature review on the work in this domain.
One of the approaches in this paper (“approach 2”) leverages a classic text-mining technique, that of creation of Document-Term Matrix as feature set for learning.
The text is put through a variety of pre-processing steps before analysis. Then slight modification of traditional technique is used in creating Document-Term matrix. Bigrams and trigrams i.e. 2 and 3 words occurring together, are used as the features instead of the unigrams or individual words, as is done routinely. Feature selection is applied on top of it before training a variety of machine learning classifiers.
While the paper has all the details, some of the key takeaways are:
- Patterns of word usage are stable in the financial domain as it constitutes an identifiable genre of text. Where there are differences, significance can be attached to them.
- Deception in text is manifested by dense syntactic structure to reduce readability and comprehension.
- There is also a difference in the use of adverbs and adjectives and there is a difference in the use of connectives when fraud is present vs. absent.
Sign up for Computerworld eNewsletters.