Big data is everywhere – social media, companies, public administration, and research in various fields. The prevalence of the term “Big Data” might even exceed that of actual big data. Nonetheless, the age of big data started. Lots and lots of data are collected on the go while we digitalize more and more parts of our life. The capacities to store that data are expanding. And a new scientific and practical discipline is evolving.
As the idiom points out: The more trees there are, the easier it gets to miss the forest for the trees – to miss the obvious. For several reasons, I see this as a serious thread accompanying the many advantages of big data. First, the huge amount of data requires automated data processing, which necessarily creates more distance between the data analyst and the data. This magnifies the opportunities for mistakes and misinterpretations. Second, with the increasing variety of data considered (numeric, text, pictures, videos, network structures) more and more scientific fields are concerned. Third, the expansion of applications newly confronts a lot of people with questions and issues of data analysis.
How to navigate when entering the age of big data?
Big data might be new – however, data analysis is not a new field. The laws of statistics apply to both, small and big data. Thus, we do not need to start from scratch. In the following, I will discuss some general threads to the validity of data analysis and relate them to big data. Some of them can be avoided by using big data, others might even be magnified. The aim is to provide some basic guidance for big data analysis and decision-making based on big data.
Which conclusions can be drawn?
The threads to this internal validity are various. Here, are just a few examples. First and foremost, context matters. When analyzing data from publicly available messages, like twitter for example, this is of particular importance. The same word can be used in different contexts. To draw on an example from Susan Etlinger’s TED talk, “smoking” can refer to smoking cigarettes, smoking weed, smoked food or even smoking hot women. Referring back to the real life context is one of the major challenges coming with big data.
On the other hand, the problems of small data sets and very costly data collection belong to the past. In this light, it is tempting to see big data as THE solution. Of course, pre-election surveys and customer feedback-surveys are easier to conduct. However, statistics remains statistics. Already now, often wrong conclusions are drawn. As Michael Seemann points out, surveys cannot forecast specific individual behavior. At most, they allow predictions, assigning probabilities to events and comparing group averages.
Furthermore, conducting the analysis itself might influence people’s attitudes and choices. As in case of pre-election surveys, publishing the results enters the decision making process of voters. Finally, data size does not replace data quality. The latter should always be critically considered. Sometimes, a slightly smaller data set with more complete and more reliable information, might be the better choice.
Can the results be generalized?
This question of external validity is twofold: Does the conclusion hold for everyone who is of interest, e. g. all customers? And does it generalize across environments, like countries and cultures?
A huge advantage of digitalization for data science is the data collection “on the go”. A lot of data are collected without people even noticing. They do not feel observed and do not change their behavior. Even when surveys are made explicit, many of them are much more integrated into structures, such as social media, which people would use anyway. This also allows more often to directly observe behavior, rather than relying on attitudes and answers regarding hypothetical behavior. The observed behavior is more natural and less distorted. Also comparisons between different countries became a lot easier. The digital infrastructure is used across borders. Also developing countries are catching up on that with mobile phones becoming universal devices.
However, almost more important than who is observed, is the question who is not observed. Relying solely on digital infrastructure to collect data leaves out specific groups. For example, the criterion “being on facebook” is likely related to age. But it could well relate to other characteristics, too, such as being concerned about privacy. When conclusions are drawn based on a specific group, a careful interpretation only holds for that group. Whether the result generalizes is a question of how representative that group is for the question of interest.
What would I like to know?
This is the main question, above all others. However, it is tempting to put less emphasis on this now. In the pre-big data era, data had to be collected specifically, data storage was expensive and capacities restricted, there was more pressure to know in advance where an analysis would lead and which questions should be addressed. The age of big data is one of exploratory data analysis. Often data are collected first, patterns are only discovered by automated algorithms, and question of utilization and valorization are answered last.
This is a great endorsement. However, when setting up a data strategy, many decisions have to be made in advance. In order to make the best out of the data, these decisions should be informed by setting an aim for the data analysis. Otherwise, important characteristics might be missing in the end or specific analysis will not be possible. The more sophisticated the questions are, the more important it is to know them before collecting the data. Especially, if you want to go beyond observing correlations and identify causal relationships.