Updated: Aug 1
Researchers who study race in the social sciences are sometimes hamstrung by a lack of useful data. If a survey does not ask respondents about their race or ethnicity, for example, how can we make use of the survey to draw inferences relating to race? One method for overcoming this problem is to use a classification algorithm to predict individuals' race from other available characteristics found in the data. I am currently working on improving the existing algorithms used to accomplish this classification problem—and I wanted to detail some of my current progress here.
The algorithm I'm using comes from the category of machine learning methods called Naive Bayes classifiers. If you're not familiar with the terminology, Naive Bayes is a way to classify observations from your data into groups, or "classes", based on the probability distribution of other variables for each observation. A common example of Naive Bayes is an email spam filter. Let's say you've collected a bunch of emails and counted the probability of seeing various words in a spam email or non-spam email. If you then receive an email containing the sentence "Please provide your credit card number and social security number to verify your identity", the Naive Bayes algorithm can incorporate the prior information you have about these word probabilities to give you a posterior probability that the email should or should not be labeled spam.
Rather than predicting whether an email is spam or not, my goal is to predict which of the following six racial or ethnic groups an individual is more likely to identify with: Black, White, Hispanic, Asian and Pacific Islander, Native Indian and Alaska Native, or Mixed-race. Note: these are the primary categories used by the United States Census, which I use as a source for much of my data on prior probabilities. There are certainly many more racial groups in the US than these, and there is important overlap among certain groups (particularly for those who identify as Hispanic—many of whom also identify as Black, White or American Indian). These are considerations which I hope to incorporate into later versions of this project.
Statewide voter files provide the perfect setting to test the prediction algorithm. These are data sets which contain the voting histories for every registered voter in a state, as well as personal information such as first name, last name, address, age, gender, and party affiliation. They don't, however, typically ask for an individual's race. So if you are interested in whether certain policies, such as voter ID laws, have a disparate impact on voter turnout for different racial groups (as I am), then you will have to impute the missing race variable somehow.
The Naive Bayes algorithm I use builds off earlier work from Imai and Khanna (2016) and Voicu (2018). I incorporate the prior probability distributions of the six race categories over first names, last names, Census blocks (geographic areas containing around 100 households), political parties, age, gender, and occupancy in multi-unit housing to calculate the posterior probability for each race for each individual in the voter files. Whichever category ends up with the highest posterior probability is selected as the final "classification". The main improvement my method has over these earlier works comes from incorporating more predictor variables. Imai and Khanna only use last names, Census blocks, and political parties, and Voicu only uses first names, last names, and Census blocks. Below is the formula I use for computing the posterior probability for a given race:
Here the left hand side p(r|s, f ,g ,p ,a) is read as "the probability of race r given surname s, first name f, geolocation (Census block) g, party p, and apartment occupancy a (whether or not someone lives in multi-unit housing)." I omitted the age and gender variables both to save space and because they provide very little predictive power on top of these five. The output of the formula above is a number between 0 and 1 that tells us the probability an individual with those characteristics identifies with a particular race. Despite all the notation, there is nothing very fancy or clever about this equation. It is simply an extension of Bayes's formula that you may have encountered in a statistics course. The technical difference between this and Bayes's formula is that I am assuming independence among the predictor variables to aid the calculation (i.e. someone's first name should not provide information about which political party they belong to).
The tricky part is thinking about which variables to use in the state voter files, and then finding the prior probability distributions for these among racial groups in the United States. A lot of these distributions come from the Census which provides racial distribution data on last names, age, gender, location, and apartment occupancy. The racial distribution among fist names comes from a nationwide data set on mortgage applications and the party distribution is based on a 2012 Pew survey.
One unsavory aspect of this project is that it involves taking advantage of some of the terrible history of racial discrimination in this country. Maybe not so much for the first name, surname, or political party variables, but the fact that geographic location and multi-unit housing occupancy provide such useful predictive power regarding an individual's race is disturbing. The legacy of systematic and widespread housing discrimination, such as redlining, is why so many Census blocks today are racially homogenous, and why the proportion of African Americans who live in apartments is twice as high as the proportion of whites. I hope in the future these variables will have less predictive applications for race.
To test the predictive performance of my algorithm there are two states, North Carolina and Florida, whose voter files do provide race or ethnicity information for their registered voters. This gives me the opportunity to calculate the predicted race for all 12 million individuals in these data sets and compare my predictions against the true racial categories individuals selected. There are dozens of different metrics for measuring predictive performance. Below I show my method compared to Imai and Khanna on the overall error rate, as well as the false negative and false positive rates for each racial group.
The overall error rate is the proportion of incorrectly classified individuals over the entire data set. It can be a little misleading, however, if your data is high imbalanced towards one group. For example, if I wrote an algorithm that simply classified everyone in Vermont as white, I would have an overall error rate of only 3.8 percent because the state is very white to begin with! The false negative and false positive rates give a better sense of how well the algorithm performs among particular groups. An example of a false negative is classifying a someone as a race other than Black when they are in fact Black. And an example of a false positive is classifying someone as Black who is in fact not Black. Ideally we would like to minimize both the false negative and false positive rates, but there is an inherent trade off between the two. You can imagine that if we classify everyone as Black our false negative rate for Black people would disappear, but that would drastically increase our false positive rate for the Black people! As you can see, however, adding more prior information into the algorithm like I have provides better prediction amongst almost all groups.
Thanks for reading! There is still a lot more work to do on this and I am in the process of uploading my code on GitHub if you want to check out how the implementation works under the hood. Eventually I will apply the algorithm to as many state voter files as I can get my hands on and finally be able to do some substantive analysis on the impact of various policies on voter turnout among different races.