Big Data: The Four Vs

Special Section

Big data is a big topic with a lot of potential. Before realizing this potential, however, we need to get on the same page about what big data is, how it can be analyzed, and what we can do with it.

The term big data is somewhat misleading, as it is not only the size (volume) of the data set that makes it big data. Size is just one aspect, and it describes the sheer amount of data available. A study conducted by Peter Lyman and Hal R. Varian of the Univ. of California, Berkeley, estimates that the amount of new data stored each year has increased by 30%/yr between 1999 and 2002, to 5 trillion gigabytes. Ninety-two percent of the new data was stored on magnetic media, mostly on hard disks. For reference, 5 trillion gigabytes is equivalent to the data stored in 37,000 libraries the size of the Library of Congress, which houses 17 million books. And, according to IBM, the amount of data created each day is expected to grow to 43 trillion gigabytes by 2020, from about 2.3 trillion gigabytes of data per day in 2005. In the chemical process industries (CPI), data are coming from many sources, including employees, customers, vendors, manufacturing plants, and laboratories.

In addition to volume, big data is characterized by three other Vs — velocity, variety, and veracity. Velocity refers to the rate at which data are coming into your organization. Data are now streaming continuously into servers in real time. IBM puts this in context — the New York Stock Exchange captures 1,000 gigabytes of trade information during each trading...

