* Be aware of the signal-to-noise ratio. The phrase “garbage in, garbage out” looms large in big data. Some data sources are poor quality and have a low signal-to-noise ratio. Application logs are a great example of this problem. Many applications throw exceptions and log errors as part of normal operation. Enabling verbose logging can provide good information, but it comes with a huge amount of irrelevant noise.
Another example of this problem is threat detection systems, which have been in the news in association with high-profile data breaches. Threat detection systems generate thousands of alerts every day, far more than IT and security teams can actually investigate. The low signal-to-noise ratio of these systems means that alerts are often ignored altogether, and actual threats are missed amidst the chaos.
Finding the signal in all that noise can be hard, and when time is of the essence, cutting through the noise can become mission-critical. If you’re sifting through garbage, the chances of finding what you need in time drop dramatically.
* Consider the motion of your data. Is the data you’re trying to analyze at rest or in flight? The answer to this question has a huge impact on how you process, view, and analyze the data, as well as the value you can derive from it.
Most big data is at rest and analyzed post hoc in batch processes that rely on indexing and parallel processing using techniques based on sharding or MapReduce. At its core, this approach is all about volume and variety, and enterprises are leveraging multiple frameworks and data stores – such as Hadoop, MongoDB and Cassandra – for a variety of structured and unstructured data. While multiple data sources provide context and insight, this approach is always going to be retrospective.
Recently, greater attention is being paid to data-in-flight as the need for greater agility and adaptability drives demand for higher velocity analysis. Imagine that you’re the CIO of a major retailer. It’s Cyber Monday and your website's page load times are averaging more than 30 seconds. The post-hoc analysis isn't going to save your company's Cyber Monday. Being able to tell the CEO that it won't happen again next year will be cold comfort.
If you’re that CIO, you need insight into what’s going wrong and guidance about how to fix it immediately. For these situations, high velocity data-in-flight is of paramount importance, giving IT the ability to see how systems are behaving in the moment, compare that behavior to established baselines, and drill down to find the root cause of a problem.
While data-in-flight can provide incredible value, analysis of this data requires a fundamentally different approach based on stream processing and summary metrics. In many cases, the data volume is such that it must be processed in-flight. In other cases, real-time information is more valuable, while old data is less valuable. For example, wire data is too voluminous to be stored but is extremely valuable to answer questions about what is happening in the IT environment in real time.
Sign up for Computerworld eNewsletters.