With social data, though, it's only the extracts that can be practically handled in a pure cloud warehouse.
The underlying details-for ad hoc queries, hypothesis testing, and extract formulation -will almost certainly have to be done with on-premises databases. Fortunately, disk and memory capacities continue to fall while capacity expands. (The laptop I'm writing this article on, for example, has more than a TB of internal disk space).
The real cost of the on-premises warehouse, though, will be the software and the data analyst. While there is good news in terms of analytical power, neither the people nor the software is likely to come down in price any time soon.
Prune Social Data Early, Often
I'm usually a card-carrying data pack-rat, but with social data warehousing, there's not much point in keeping detailed data around for too long.
The first reason is information value. Much of your social data becomes obsolete as the rules of the game continue to change.
- Cyber-social mores are evolving rapidly. There's not much profit to be gained by understanding how people interacted in MySpace or SecondLife. Let's face it: Some social-network behaviors are vulnerable to fads. Leave the pure research projects to the academics.
- Advertising platforms and tactics are evolving rapidly, particularly for mobile audiences. The threshold values for click-through ratio-and the importance of it in understanding conversion ratios-are hardly fixed.
- Your competitors' actions affect your results, and your goals will change from year to year. Given the size and complexity of the data space, it will be almost impossible to normalize analytics over the long term. There's not much hope of discovering universal coefficients and algorithms that will be good over long periods, so focus on the here and now.
The second reason is signal-to-noise ratio and the costs of processing it.
- A significant proportion of the social data you collect will be noise. The initial data points may look promising, but in many cases the user you're tracking took no action or simply disappeared from view. In some example data sets, we were able to throw out all the data from more than 95 percent of the prospects we were tracking.
- Even if you get your data warehouse software or service for free, there's a non-zero cost for the time and effort of managing-let alone analyzing-the avalanche of data. We've seen some real-time systems that were not able to even delete more than a month's worth of data per query. Imagine how long it would take to aggregate all your data.
Social data warehousing is so new that we have to reinvent the practice, as well as the tools, to be effective. Make sure your social data warehousing project has a clear (and probably short-term) goal, as well as tight management, so it doesn't become a money pit.
Sign up for Computerworld eNewsletters.