In order to do science, a scientist must collect information about the phenomena he is studying. In the days before computers, information collection was laborious, and in many instances, a scientist was faced with the problem of too little information about the phenomenon under consideration to conduct a sound analysis. Often, data collection consumed much more time and energy than the analysis itself. With the advent of computers and automated data collection, the situation has reversed. In many instances, collecting the data consumes much less time than analyzing it. “Big data” is a catchphrase coined to describe this situation.
So what is big data? The answer to this question, like the answer to so many others in science, is, it depends. In general, big data is that amount of data that strains the data processing capacity and processing equipment available to a researcher. The problem arises because data collection equipment is more efficient than data processing equipment, and the data collected is unstructured. It is the scientist’s job, with the aid of the data processing equipment, to sort through all of the unstructured data to identify a subset that is germane to the problem under consideration. When that process consumes inordinate amounts of time, money and other resources, big data has become a big problem.
For example, the Sloan Digital Sky Survey collected more data in the first few weeks of operation than had ever been collected in astronomy before, in an effort to map the observable universe. It has produced a digital image of the sky composed of more than one trillion pixels, containing over 500 million objects. New telescopes that will collect even more data are in the planning stages.
Another example of a project that produces huge amounts of data is the Large Hadron Collider (LHC). Scientists hope that data produced by this massive particle accelerator will answer many fundamental questions in physics. The LHC produces over 600 million collisions per second, during runs lasting as long as 10 hours. About 100 of these collisions each second will be of interest to scientists. The collider is expected to produce about 15 petabytes of data per year.
So why collect so much data? For some questions, it is necessary if one wants a scientifically sound evaluation of a problem. For example, the U.S. Environmental Protection Agency (EPA) is tasked with monitoring potentially .harmful environmental chemicals. In the U.S., there are more than 85,000 commercial chemicals ̵ of these, over 2,500 are produced in quantities exceeding one million pounds per year. Each year, 2,000 more commercial chemicals are introduced. It is humanly impossible to test all of these chemicals rigorously for every possible adverse health effect. Prohibiting or restricting the production of any chemical has economic consequences for all who produce and use the substance, so it is not wise to do this based on limited information. To approach this problem, the Tox 21 program was established, which will use robotic technology test 10,000 environmental chemicals in numerous biochemical assays, to assess their potential to disrupt biochemical reactions necessary to life. As in the examples given above, only a fraction of the data generated by this endeavor will be relevant, and it is the task of the scientists involved to use modern bioinformatics technology to separate the wheat from the chaff. The goal of the project is to identify which substances may be hazardous substances, and then subject these to more intense scrutiny.
As our technology evolves, our capacity for data collection will only increase, so big data will continue to present big challenges. Those who can rise to meet these challenges will become the leaders in this new, exciting world.