“Knowledge is power only if man knows what facts not to bother with.” – Robert Staughton Lynd
Knowledge is power, but sometimes ignorance is bliss. As we become more and more interconnected through mobile devices, computers and the internet, the amount of data gathered in the world keeps increasing.
Introduced in 1997, the concept of “Big Data” has been around for decades. However, the majority of people are still confused about its meaning. Chances are you will get a different definition every time you ask someone new. For the purpose of this article, we will define Big Data as a term used to describe voluminous amounts of data–whether the data is structured, semi-structured or unstructured–that render traditional data processing application inadequate.
But Big Data is not just about large amounts of data; it’s about what businesses can do with it.
At ByteCubed, we use Big Data to expand our customers’ intelligence. By examining a broad range of data from various sources, we improve operational efficiencies and create predictive analytics that translate into successful everyday business transactions. We find and visualize our customers’ data to help them gain real-time visibility into their operations, and customer experience and that’s not always easy…
Being able to analyze this data is extremely valuable to companies as it can lead to the development and creation of new products and services, predict customers’ preferences, improve online interactions, and help in the making of important business decisions. But while Big Data has big potential, it also comes with big dilemmas. “A commonly overlooked issue in Big Data systems is that they can incorporate and even reinforce discriminatory stereotypes to the detriment of both users and the effectiveness of the system itself, ” says Saul Dorfman, data scientist at ByteCubed. A perfect example is Pokemon Go. The once so popular game shows how technology reflects real-world biases. People have started noticing that there were fewer Pokemons and PokeStops in predominantly African American neighborhoods as pointed out by Kendra James, a writer from New Jersey.
From selecting the data set used to make forecasts to making decisions based on the results of Big Data analysis, there are a thousand ways errors and biases could be introduced. This raises a question of transparency: how can companies maximize the benefits of Big Data and limit its harms?
Data and Data Sets as inputs
Just because Big Data techniques are data-driven doesn’t mean that they are objective. As companies use Big Data to make decisions that affect our daily lives (from lending and credit, to employment opportunity, to education), it is important to make sure that they provide benefits and opportunities to consumers while protecting them from any detrimental impact. The first step in applying Big Data fairly is to ask yourself the right questions:
1) Is your data representative, complete, reliable, correct, and up-to date?
Addressing missing values is very important as it may help prevent inaccuracies or gaps in the data. If the answer to parts of the question above is no, you should first try to understand the distribution of your missing data. Why is data missing? Is there attrition due to social/natural processes? Is there intentional/unintentional missing information as part of the data collection process or is there a skip pattern? Why is it incomplete or incorrect?
Image highlighting the need for a representative sample size.
Image courtesy of Vertical Measures.
After identifying the source of the problem, identify the patterns: what is the distribution probability? Are certain responses likely to be missing and why? Select your missing data approach and consider the impact it has on your research or study: is the information missing completely at random (MCAR), is it missing at random (MAR) or is it missing not at random (MNAR)?
Lastly, you should decide on the best strategy to overcome the missing information challenge, whether that means using the deletion method, single imputation method, or the statistical model-based method. Making sure that your data set is as representative as possible makes a huge difference when it comes to extracting truth from it.
2) Is your data model accounting for biases?
Whether it happens during the collection or analytics stage of the process, it is important to avoid incorporating biases. Data is the creation of human design and thus often succumbs (unintentionally) to the same (human) biases resulting in bad conclusions and negative business outcomes. Let’s review the 6 most common biases:
- Confirmation Bias: Confirmation bias is a tendency for people to look for or interpret information (consciously or unconsciously) in a way that confirms one’s preexisting beliefs leading to error.
- Selection Bias: Selection bias occurs when selected data is not random enough to draw a conclusion.
- Outliers: Outliers are values that differ greatly from the overall pattern of data. They can sometimes skew the results of an analysis.
- Simpsons’ Paradox: Simpsons’ Paradox is a trend that appears in different groups of data when viewed individually, but disappears or reverses when the groups of data are combined.
- Underfitting and Overfitting: A model is underfitting when it is unable to capture the relationship between X and Y. A model is overfitting when it is unable to generalize unseen examples (accounts for too much noise).
- Confounding Variables: A confounding variable is an unrelated variable that affects both the dependent variable and the independent variable so that the results obtained do not properly reflect the actual relationship between the two variables studied.
Here is a very short (but funny) video that illustrates confirmation bias:
All credit goes to the original “If Google Was a Guy (Part 3)’” video by CollegeHumor.
Each of these types of biases needs to be handled differently. Your data scientists should find out where the data came from, how it was gathered, and what biases might have been brought into its interpretation.
3) Did you check your algorithm?
Poorly designed algorithms may lead to discriminatory outcomes. What if a credit card company lowered a customer’s credit limit because he or she shopped at a store where people with low credit scores usually shop instead of that customer’s actual payment history? What if you were never able to see your dream job posting because of your degree or social network?
Algorithms and Machine Learning
Similar to data, algorithms are not free of human influence. Everybody knows people have biases, so, is it really surprising that the computer programs they (humans) write have them too? Here’s an example of the way algorithms can reinforce prejudice:
Bing Autofill Search Engine on July 6th, 2016
See how Bing’s search finishes the queries? While algorithms are considered to be neutral, they are designed to account for the trends and data they work with. Carelessly trusting these new systems without questioning and testing the mechanics behind them might lead to obfuscated bias problems impacting people’s lives negatively.
The solution to reduce bias?
Well, while bias is sometimes unavoidable, the first step to reduce it is through increased awareness. The more you know about bias, the more you lookout for it and try to reduce its impact on your work. Using the principle of “equal opportunity by design” and teaching your data scientist and engineers to increase inclusion by building systems that support fairness and transparency could help achieve the most out of your Big Data. Always be critical; challenge your assumptions and designs. Understanding the different types of biases and their origins will take you a long way in getting to the right conclusions because in the end, it is less about making sure your data has no bias and more about understanding what biases you are willing to accept.
Written by Julie Crauet