Comparing the time it takes an attacker to compromise an organization with the time it takes an organization to discover a breach. Data graphed with the R package ggplot2. Click on image to enlarge. Credit: Verizon
Analyzing 200,000 records may not seem like a big task. But when those records are security incidents with potentially hundreds of attributes each -- types of bad actors, assets affected, category of organization and more -- it starts getting a little complex for a spreadsheet. So Verizon's annual security report, which was initially done in Excel, is now generated "soup to nuts" in R.
In fact, the Verizon Data Breach Report is somewhat of "a love letter to R," Bob Rudis, managing principal and senior data scientist at Verizon Enterprise Solutions, told the EARL (Effective Applications of the R Language) Boston conference yesterday.
R is "a lot of fun to work with," he said.
One of the main issues in deciding to move from a spreadsheet to R was the complexity of the data format. Verizon researchers receive incident data from contributing organizations as nested JSON, which means numerous categories also have subcategories. Importing and analyzing all that with Excel was problematic.
There were other advantages in using R, Rudis said. Because R's ggplot2 package can produce sophisticated publication-quality graphics, the company saved an estimated $15,000 to $20,000 by no longer needing an external graphics design firm. The only change made to the R-created graphics prior to release was swapping in new type fonts. "R [stinks] at fonts," Rudis said.
However, R has great tools for modeling, clustering and other statistical analysis that Verizon wants to do beyond counting, such as examining what attackers are likely to do depending on the type of organization. Even within financial services, he pointed out, top threats are considerably different for, say, banks compared with insurance companies.
The report team also used R to create interactive visualizations such as one that explores which industries have similar threat profiles.
Security data is in an open-source format called VERIS, the Vocabulary for Event Recording and Incident Sharing. For those who would like to analyze publicly reported breach data, there is a VERIS Community Database as well as an R package called verisr to easily work with that data. Rudis and Jay Jacobs also authored a book, Data Driven Security, which details how to use the VERIS schema and R to record and analyze security incidents.
There is considerably more data analyzed in the Verizon report than is available in the public database, including incidents sent in by agencies such as the U.S. Secret Service and FBI, Rudis said.
Sign up for Computerworld eNewsletters.