A current “big thing” is “big data.” Big data grew out of a real or perceived need of organizations to know more about what is happening inside the organization. Companies like Wal-Mart or Kmart keep track of every item that moves within their company, from the time it arrives at a warehouse until it leaves the store with a customer. They know exactly how many Graco SnugRide Baby Car Seats have been sold at each store, the inventory in each store and warehouse, and the number ordered from the manufacturer and on the way to one of their warehouses. They use this information to predict future sales, make sure that no store ever runs out of the item, and yet keep their inventory as low as possible.
Most companies are growing their data storage requirements by about 20% a year. At this rate and with the rate the cost of storage is coming down, a company should have relatively flat storage costs into the future, whether they maintain their own storage farms or use the Cloud. Some government agencies recently in the news are growing their storage requirements by 20% a month or more, or almost nine times in one year. Amazon storage offered to their Cloud customers (AWS S3) tripled in 2011, about a 10% per month growth.
But few of these large databases are really “big data.” The term “big data” applies to collections of data that are so large they are difficult to handle: difficult to capture, validate, store, search, move, or even analyze. New software products come out periodically that make analysis faster, and some can even look for correlations between multiple sets of big data. Big data collections are exceedingly useful in science and research. For example, the Large Hadron Collider generates 40 million data points per second from each of 150 million sensors. The Sloan Digital Sky Survey started collecting data in 2000. At the rate of 200 GB per night, in a few weeks it collected more data than all of the data collected in the history of astronomy. The Large Synoptic Survey Telescope, scheduled to go active in 2016, is expected to collect the equivalent of 2.5 years of Sloan data each day.
Between data being collected by companies, research organizations and governments, 90% of all the data in world has been generated in the last two years.
If your company is collecting, or even thinking of collecting, large amounts of data, here are a few questions to ask.
- Why are you collecting the data? What will you do with it?
- What is the value of the data? How much increased revenue will it bring in? How much will is cost to store, replicate, analyze and report?
- What is the potential cost of someone stealing some or all of the information?
- If your customers, partners and shareholders knew exactly what data you were collecting and what you were doing with it, would they be happy, resigned, or furious?
- Do you have a policy to get rid of obsolete data, and verify the destruction?
- Is any of the information linked to a specific individual?
That last question is important, because that information can open your organization to civil and criminal penalties plus severely impact your reputation. Some additional considerations:
- What laws and certifications cover what parts of the data?
- Do you provide a way for individuals to opt out so their data is not collected?
- Do you provide a way for individuals to see and correct data about themselves?
- Do you tell them what data you are collecting?
Big data can be very valuable, but can be a big expense and a big risk. Beware of too much optimism. Even Wal-Mart gets surprised by a run on Graco SnugRide Baby Car Seats. Don’t become overly complacent that you really know what is going on. Big data can only answer the questions you ask.
The last word:
Most uses of big data are beneficial or at worst benign. However, data that provides information about individuals can be dangerous, both to the individual and to society. Utilities are starting to collect more and more information about individual customer actions, with Smart Meters, DVR devices that report what a household watches, electronic medical records, and security systems that allow you or a potential cybercriminal to monitor your house. Some email providers mine emails looking for ways to sell advertising targeted to individuals, and insurance companies are trying to monitor where and how you drive so they can set an individual rate.
All of this data is vulnerable to cybercriminals and governments, and many of the companies collecting and storing this data are not very good at protecting it, and none of them are very good at destroying it when it is no longer needed.
But the biggest risk is from government. They have demonstrated their vulnerability to cybercriminals, plus a recent set of reported abuses of data by their own employees and contractors. The biggest problem is the government’s ability to simply say they can’t talk about it and will admit no responsibility for any outcome.
Keep your sense of humor.