If you think the storage systems in your data center are out of control, imagine having 450 billion objects in your database or having to add 40 terabytes of data each week.
The challenges of managing massive amounts of data involve storing huge files, creating long-term archives and, of course, making the data accessible. While data management has always been a key function in IT, "the current frenzy has taken market activity to a whole new level," says Richard Winter, an analyst at WinterCorp Consulting Services, which analyzes big data trends.
New products appear regularly from established companies and startups alike. Whether it's Hadoop, MapReduce, NoSQL or one of several dozen data warehousing appliances, file systems and new architectures, the segment is booming, Winter says.
Some IT shops know all too well about the challenges inherent in managing big data. At the Library of Congress, Amazon and Mazda, the task requires innovative approaches for handling billions of objects and peta-scale storage mediums, tagging data for quick retrieval and rooting out errors.
1. Library of Congress
The Library of Congress processes 2.5 petabytes of data each year, which amounts to around 40TB each week. And Thomas Youkel, group chief of enterprise systems engineering at the library, estimates that the data load will quadruple in the next few years, thanks to the library's dual mandates to serve up data for historians and to preserve information in all its forms.
The library stores information on 15,000 to 18,000 spinning disks attached to 600 servers in two data centers. More than 90% of the data, or over 3PB, is stored on a fiber-attached SAN, and the rest is stored on network-attached storage drives.
The Library of Congress has an "interesting model" in that part of the information stored is metadata -- or data about the data that's stored -- while the other is the actual content, says Greg Schulz, an analyst at consulting firm StorageIO. Plenty of organizations use metadata, but what makes the library unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and other media, Schulz explains.
The actual content -- which is seldom accessed -- is ideally kept offline and on tape, Schulz says, with perhaps a thumbnail or low-resolution copy on disk.
Today, the library holds around 500 million objects per database, but Youkel expects that number to grow to as many as 5 billion. To prepare, Youkel's team has started rethinking the library's namespace system. "We're looking at new file systems that can handle that many objects," he says.
Gene Ruth, a storage analyst at Gartner, says that scaling up and out correctly is critical. When a data store grows beyond 10PB, the time and expense of backing up and otherwise handling that much data go quickly skyward. One approach, he says, is to have infrastructure in a primary location that handles most of the data and another facility for secondary, long-term archival storage.
Sign up for Computerworld eNewsletters.