• Bertrand

Updated: Aug 1

Researchers who study race in the social sciences are sometimes hamstrung by a lack of useful data. If a survey does not ask respondents about their race or ethnicity, for example, how can we make use of the survey to draw inferences relating to race? One method for overcoming this problem is to use a classification algorithm to predict individuals' race from other available characteristics found in the data. I am currently working on improving the existing algorithms used to accomplish this classification problem—and I wanted to detail some of my current progress here.

The algorithm I'm using comes from the category of machine learning methods called Naive Bayes classifiers. If you're not familiar with the terminology, Naive Bayes is a way to classify observations from your data into groups, or "classes", based on the probability distribution of other variables for each observation. A common example of Naive Bayes is an email spam filter. Let's say you've collected a bunch of emails and counted the probability of seeing various words in a spam email or non-spam email. If you then receive an email containing the sentence "Please provide your credit card number and social security number to verify your identity", the Naive Bayes algorithm can incorporate the prior information you have about these word probabilities to give you a posterior probability that the email should or should not be labeled spam.

Rather than predicting whether an email is spam or not, my goal is to predict which of the following six racial or ethnic groups an individual is more likely to identify with: Black, White, Hispanic, Asian and Pacific Islander, Native Indian and Alaska Native, or Mixed-race. Note: these are the primary categories used by the United States Census, which I use as a source for much of my data on prior probabilities. There are certainly many more racial groups in the US than these, and there is important overlap among certain groups (particularly for those who identify as Hispanic—many of whom also identify as Black, White or American Indian). These are considerations which I hope to incorporate into later versions of this project.

Statewide voter files provide the perfect setting to test the prediction algorithm. These are data sets which contain the voting histories for every registered voter in a state, as well as personal information such as first name, last name, address, age, gender, and party affiliation. They don't, however, typically ask for an individual's race. So if you are interested in whether certain policies, such as voter ID laws, have a disparate impact on voter turnout for different racial groups (as I am), then you will have to impute the missing race variable somehow.

The Naive Bayes algorithm I use builds off earlier work from Imai and Khanna (2016) and Voicu (2018). I incorporate the prior probability distributions of the six race categories over first names, last names, Census blocks (geographic areas containing around 100 households), political parties, age, gender, and occupancy in multi-unit housing to calculate the posterior probability for each race for each individual in the voter files. Whichever category ends up with the highest posterior probability is selected as the final "classification". The main improvement my method has over these earlier works comes from incorporating more predictor variables. Imai and Khanna only use last names, Census blocks, and political parties, and Voicu only uses first names, last names, and Census blocks. Below is the formula I use for computing the posterior probability for a given race:

Here the left hand side p(r|s, f ,g ,p ,a) is read as "the probability of race r given surname s, first name f, geolocation (Census block) g, party p, and apartment occupancy a (whether or not someone lives in multi-unit housing)." I omitted the age and gender variables both to save space and because they provide very little predictive power on top of these five. The output of the formula above is a number between 0 and 1 that tells us the probability an individual with those characteristics identifies with a particular race. Despite all the notation, there is nothing very fancy or clever about this equation. It is simply an extension of Bayes's formula that you may have encountered in a statistics course. The technical difference between this and Bayes's formula is that I am assuming independence among the predictor variables to aid the calculation (i.e. someone's first name should not provide information about which political party they belong to).

The tricky part is thinking about which variables to use in the state voter files, and then finding the prior probability distributions for these among racial groups in the United States. A lot of these distributions come from the Census which provides racial distribution data on last names, age, gender, location, and apartment occupancy. The racial distribution among fist names comes from a nationwide data set on mortgage applications and the party distribution is based on a 2012 Pew survey.

One unsavory aspect of this project is that it involves taking advantage of some of the terrible history of racial discrimination in this country. Maybe not so much for the first name, surname, or political party variables, but the fact that geographic location and multi-unit housing occupancy provide such useful predictive power regarding an individual's race is disturbing. The legacy of systematic and widespread housing discrimination, such as redlining, is why so many Census blocks today are racially homogenous, and why the proportion of African Americans who live in apartments is twice as high as the proportion of whites. I hope in the future these variables will have less predictive applications for race.

To test the predictive performance of my algorithm there are two states, North Carolina and Florida, whose voter files do provide race or ethnicity information for their registered voters. This gives me the opportunity to calculate the predicted race for all 12 million individuals in these data sets and compare my predictions against the true racial categories individuals selected. There are dozens of different metrics for measuring predictive performance. Below I show my method compared to Imai and Khanna on the overall error rate, as well as the false negative and false positive rates for each racial group.

The overall error rate is the proportion of incorrectly classified individuals over the entire data set. It can be a little misleading, however, if your data is high imbalanced towards one group. For example, if I wrote an algorithm that simply classified everyone in Vermont as white, I would have an overall error rate of only 3.8 percent because the state is very white to begin with! The false negative and false positive rates give a better sense of how well the algorithm performs among particular groups. An example of a false negative is classifying a someone as a race other than Black when they are in fact Black. And an example of a false positive is classifying someone as Black who is in fact not Black. Ideally we would like to minimize both the false negative and false positive rates, but there is an inherent trade off between the two. You can imagine that if we classify everyone as Black our false negative rate for Black people would disappear, but that would drastically increase our false positive rate for the Black people! As you can see, however, adding more prior information into the algorithm like I have provides better prediction amongst almost all groups.

Thanks for reading! There is still a lot more work to do on this and I am in the process of uploading my code on GitHub if you want to check out how the implementation works under the hood. Eventually I will apply the algorithm to as many state voter files as I can get my hands on and finally be able to do some substantive analysis on the impact of various policies on voter turnout among different races.

  • Bertrand

The Route

In 1983 my mom rode her bike from England to Istanbul. To do this she had to cross the Iron Curtain and cycle into eastern bloc countries such as Czechoslovakia, Hungary, Yugoslavia, and Bulgaria. I'm not as brave or adventurous as her, but I wanted to do something similar during my last summer of freedom before beginning my PhD program. Paris was chosen as the starting and ending point because it was the only airport where I could find direct flights with ample leg room. When you're 6'9" (206 cm to my friends I met on the trip), leg room on a ten hour flight is a matter of life and death. If only I had planned the start of my tour with the same attention to detail I had bought my tickets with. I had a vague idea that once I landed in Paris I would ride to France's northern coast and then head east to eventually reach Scandinavia. From there I thought I might wander up north to the Arctic Circle in Norway and catch a glimpse of the midnight sun. As you can see from the map above, it didn't quite work out that way.

The black pin on the map is Paris and my journey started on the teal line heading west (teal for cycling, red for trains/ferries). Due to a combination of disorientation and jet lag, I ended up sleeping on a park bench on my first night in France. Things didn't improve much from there. After awaking at 4:00 am, I tried to use the river Seine as my guide to reach the coast. Obviously a river flows to the ocean right? So all I have to do is keep it on my right hand side and it'll take me straight to where I want to go! Well it turns out that Paris has two rivers going through it (???). This resulted in me riding around in circles all morning. Nearby to the park where I slept I noticed a statue of Gandhi in a town square. Two hours later I came upon a statue of Gandhi again and thought to myself, "Wow, the French must really love this guy." Turns out it was the same exact statue. Two hours of riding later.

Clearly I needed to sort out my navigation issues. But how could I do this without cellular-data-fueled Google Maps to guide me?

Relaxing on the Seine

Phase 1:

McDonald's Hopping

All McDonald's in the world (*except Germany) have WiFi! So if I could find one of those, now I could pre-load routes onto my phone, and use the GPS capability to make sure I was following them! McDonald's also conveniently solved some problems with language anxiety I was feeling at the time. Ordering food in a foreign language was daunting for me throughout the trip, but I had seen Pulp Fiction enough times to know that the magic words at any French McDonald's were "Royale with Cheese". In retrospect I ate way too much McDonald's food when I should have been enjoying the local food...

It probably says something about my addiction to the internet that McDonald's became my oases. They were social media sanctuaries in what was otherwise an Airplane-mode desert. Even upon returning to the US, whenever I notice a McDonald's along the road my brain gets a little tingle of endorphin-fueled anticipation. But I don't expect I'll ever eat at one again.

From one McDonald's to the next I leapfrogged my way around France and into Belgium. The bike infrastructure improved dramatically once I crossed the Belgian border and it was a real joy to ride on.

The French countryside in Normandy

Dieppe, on the English Channel

Into Belgium, land of waffles

Fairy tale town of Bruges

On my way out of Belgium

Phase 2:

Dutch Daydream or Netherlands Nightmare?

On the sixth day of the journey I rode into the Netherlands, aka the Mecca of cycling. What could be better than a country that is completely flat, picturesque, and is carpeted in protected bike lanes? In fact, it appears to be possible to ride from one end of the Netherlands to the other without ever leaving a bike path or bike lane. What should have been easy, stress-free, riding for a few days, however, turned into one of the most harrowing experiences of my whole trip.

My mistake was in not planning a sleeping spot for my first day in Holland. I had easily wild camped one time in France already, so I figured I could find similar spots here. Or, failing that, I would come upon a cheap hotel or hostel to stay at on my way north from the southern border of Holland to Amsterdam. I cycled for hours and hours but could not find any accommodations that met either criteria. The countryside was too developed with farmlands or suburbs to wild camp (the ideal locations for setting up an illegal tent are in secluded wooded areas), and every hotel I passed in southern Holland was far too bougie for a disheveled and sweat-drenched cyclist like myself. At about 11:00 PM I gave up and tried to rest on a bench beside the Rotterdam harbor for a few hours. In keeping with my bad luck from earlier, it began to pour rain at 3:00 AM and I had no choice but to pack up and continue riding in the dark and in the rain.

Without my carefully planned out McDonald's network, I cycled in a general northward direction—towards what I thought was the city glow of Amsterdam far off in the distance. When I found out that the glow was actually from the very bright lights of an artificially-lit greenhouse, I nearly lost all hope. Somehow I meandered into Amsterdam at around 4:00 PM and immediately checked into a cheap hotel to escape the persistent rain. The upside to being confined inside my room was that it gave me plenty of time to plan out my next several nights of accommodation—as well as expand my crucial McDonald's map network.

Cycling towards the glow

And this is what I found

When it stopped raining for 5 minutes in Amsterdam

Cycling north out of Amsterdam

A campground in northern Netherlands

Phase 3:

Do I Even Like This?

I spent the next week or so riding through Germany, then taking a ferry to Denmark, riding up the coast to Copenhagen, taking another ferry across to Sweden, and then riding up the western coast of Sweden on my way to Gothenburg. At this point in the trip I was finally settling into the routine. McDonald's in Germany don't provide WiFi for some reason, so I had to adapt and plan my routes out further than normal. I also became better at wild camping and found great spots to sleep in Germany and Sweden. The actual cycling became easier as well, as my physical fitness improved. My longest day in terms of mileage (130 miles in about 13 hours of riding) was during my second day in Germany when the weather finally became pleasant again.

While I enjoyed certain aspects of the trip—for example, Copenhagen was a beautiful city and was a wondrous display of proper urban biking infrastructure—the interminability of the trip began to weigh down hard. At every prior stage of the trip I had thought to myself "I might not be loving this at the moment but as soon as I get to X, everything is going to be awesome!" (Replace X with "the French coast", "the great bike paths in the Netherlands/Denmark", or "the rugged natural beauty of Scandinavia in Sweden"). About halfway up the coast of Sweden it finally hit me that there wasn't going to be a dramatic improvement in anywhere I went, and that this was basically it in terms of the day-to-day experience of cycling across Europe. This depressing realization was enough to get me to hop off my bike and grab the next train north to Gothenburg (see red section in Sweden on map above).

It was only towards the end of my whole trip that I realized the error in my "it'll get better once I reach X" mentality. If, instead of killing myself cycling 100+ mile days in the hopes of reaching a particular country or region faster, I had taken my time and simply enjoyed wherever I might be, the first few weeks of the trip would have been much more enjoyable. My main piece of advice to anyone thinking of doing a similar long-distance bike tour would be to forcibly impose leisure upon yourself. Stop at a proper campground early in the afternoon and enjoy wandering around the nearby town. Don't do what I did and cycle all day until just before dark (usually around 9:30 or 10:00 PM) and then desperately search around for wild camping locations.

Wild Camping in Germany

Taking a ferry across the Elbe

Cycling north in Denmark out of Copenhagen

The rocky western coast of Sweden

Phase 5:

Troll Country

Gothenburg was a major turning point in the trip. The two days I spent there reinvigorated my excitement for the trip and raised my spirits considerably. This was probably due to a few reasons. First, the fact that Sweden and Norway permitted wild camping basically anywhere that wasn't explicitly private land took a ton of stress out of each day. Because these countries were so wooded, I knew I'd always be able to find a great place to sleep every night. Second, the natural splendor of the Scandinavian scenery kept increasing as I rode north. I realized that this was the sort of outdoors I really enjoyed—not the bucolic fields of France and Germany. And thirdly, I began listening to the fantastic audio book version of The Lord of the Rings during this time. The rugged scenery matched the events in the book perfectly and I began to strongly identify with the hobbits Frodo and Sam. Like them, I was lost and out of my element, and all I wanted was to be at home with a proper warm-cooked meal.

I found Oslo to be a miserable city (poor biking infrastructure, super crowded, and 40% of the downtown was under construction at the time I was there), but that was only a minor hiccup during my growing contentment while cycling through Norway. Mile after mile of riding alongside fjords, with steep green slopes to my right and crystal blue water to my left, made for a perfect journey. Even when I had to finally climb up and out of the valleys (covering 3000 feet of elevation gain in about 10 miles of distance) I had nothing to complain about. But the further north I traveled, the more I had to ride alongside busy highways packed with holiday-makers in their RV's. Scandinavian drivers are usually extremely courteous towards cyclists, but by the time I reached Trondheim I had reached my limit in terms of cars zooming away inches from me as they passed.

Trondheim would prove to be the northernmost point I'd reach on the trip (217 miles south of the Arctic Circle). I decided to turn towards the southeast at this point because it looked like I would be cycling alongside busy highways if I wanted to go any further north in Norway. One downside of the beautiful Norwegian fjords and mountains is that all vehicle traffic is then increasingly concentrated in just a few roads. So I took a train just over the border back to Sweden and planned to continue my trip south to Stockholm. This phase of the trip in Sweden turned out to be a lot better than the last time I'd been there. As I coasted down the mountains on my way towards the Baltic Sea, my spirits were considerably higher than those of my friends on their way to Mordor.

The authentic IKEA

Sleeping on a moss slope in Sweden

The fjord on the Swedish/Norwegian border

Overlooking a fjord

The beautiful Norwegian highlands


Cycling through the Swedish highlands near Storlien

A lovely place for a picnic

Back to civilization in Stockholm

Phase 6:

Cruising Through the Baltics

I had been cycling for over a month by the time I left Stockholm on August 7th. You'd think I would've lost some weight during that time, but if I did, it wasn't noticeable. Evidently it is possible to offset the burning of 5000+ calories a day if you subsist on mostly croissants, donuts, and candy bars. I also consumed about a week's worth of food on the overnight ferry I took from Stockholm to Tallinn. Having six plates at a buffet make the 30 euro price to eat worth it!

My route through the Baltics took me south from Tallinn, in the north of Estonia, through Riga in Latvia, and then to Vilnius in Lithuania. Despite noticeably worse cycling infrastructure in these countries (what would have been smooth paved farm roads in Western Europe were generally bumpy gravel here in the East), my week in the Baltics was great. Coming from Scandinavia, whose countries have some of the highest standards of living in the world, Eastern Europe was incredibly cheap. Food seemed to cost about 60% of the Scandinavian prices and I could find beds in hostels for under $10 a night. I want to give a particular shout-out to Vilnius. This beautiful city in a remote corner of Europe was probably my favorite place to stay on the whole trip. You saw old Soviet architecture mixed with modern sleek high-rises, all surrounding a gorgeous medieval downtown. After Vilnius I cycled another two days over the Polish border and officially ended my cycling journey in the town of Augustow.

My quarters on the ferry across the Baltic Sea

The park and art museum in Tallinn

Tallinn. Like other E. European cities has a great juxtaposition of the new and the old

The Gulf of Riga

Crossing from Latvia to Lithuania

Vilnius at night

The final terminus

The Last Phase:

Coming Home

From Augustow I took trains back westward. I stopped for a night in Warsaw, Berlin, and Cologne along the way. And then I completed one last mini tour from Cologne to Brussels, via Masstricht in the Netherlands, on my bike. The flat, paved bike paths of Western Europe that I had grown so bored with during the first half of my trip were a welcome joy after dealing with the Scandinavian and Baltic wildernesses for the past three weeks. I finally felt like this might be something I would want to do again.

If I were to take another bike tour, I'd do several things different. The first would be riding a more heavy-duty bike with wider tires. Unless you only plan to ride around the Netherlands and Belgium, you're going to have to go off-road at some point on your trip. There were extended periods of riding on dirt roads in Scandinavia and Eastern Europe, and it would have even helped with the annoying cobblestones they have everywhere in France and Germany. Next, I would absolutely get a SIM card and European data plan. The reliance on intermittent WiFi for my navigation led to so much stress and frustration during the early days of the trip. As we have seen, relying on McDonald's is tempting fate. Lastly, like I mentioned earlier, if I were to go on a long bike trip again I would pace myself much more conservatively. It sounds obvious in retrospect, but the whole point of traveling is to enjoy your surroundings. To do this you have to stop pedaling every once and a while.

The site of the 1992 Masstricht Treaty. Creating the modern European Union

Good luck Ursula von der Leyen!

Back in Paris after 7 weeks

Packing up at the Paris airport

  • Bertrand

Updated: May 28

Board gaming can be an expensive hobby. I own over 130 games, and when you consider that each cost somewhere between $30 and $60, that's a lot of money spent! Therefore I try to put a lot of time and effort into researching specific games before I buy them. I read reviews and watch YouTube videos about the game, as well as asking friends whether they have heard of or played the game. So far this strategy has reduced the amount of forgotten board games in my collection. Ones which I excitedly bought, unwrapped and read the rule book for, then set on my bookshelf to languish, unplayed, forever.

But for one category of board games it is impossible to collect this kind of information before making a purchase. These are the games which are funded through the website Kickstarter. With Kickstarter the idea is that you're pre-ordering a game before the game has even been made. So you have to put your money down before there are any reviews or word-of-mouth to go on! I've only backed a couple games on Kickstarter—expansions to games I already own—but I wanted to see if I could use maching learning to make predictions about how good a new Kickstarter game would likely be.

To start with, I would need lots of data on existing board games. I decided to collect this by web-scraping the website BoardGameGeek using Python. BoardGameGeek contains information for pretty much every game ever made and allows users to rate each game on a scale from 1 to 10. This rating would be my target attribute (or dependent variable) on which I would test the predictive ability of various other features of a given game. Below are some screenshots showing the data I collected from BoardGameGeek's database of games (the extracted information is highlighted in red).

BoardGameGeek has close to 105,000 games in its database but I decided to stop after collecting data on only 2,500 for a few reasons. First, after a few thousand deep (starting from the most highly-rated), the games began to lack key pieces of information. These are games very few people have ever played or even heard of, so I wanted my sample to contain only relatively popular games. Second, web-scraping takes a ton of time. In order not to overload BoardGameGeek's servers, I set a timer to wait five seconds between each page request while scraping. So this meant it took over seven hours to gather data on just 2,500 games!

It was now time to start analyzing. After converting my all my categorical features into dummies (e.g. creating a new feature called "Economic" for which each game had a value of 0 or 1 depending on whether it fit into the Economic category of games), I was left with a data set that had 1261 columns. Evidently there are lots and lots of way to categorize games! The statistics gods will smite you if you try to plug 1260 features into a linear regression equation, so I would need a way to select only those which were most import for predicting the average rating of a particular game.

The solution was using a random forest regression to rank each of my dummies based on importance (for predicting a game's average rating). The graph below shows the top ten features my random forest model found:

I then took the first seven features from the list above and plugged them into a linear regression model to see how they performed. The result was an R-squared with a paltry value of 0.215 (a medium or strong relationship would have been upwards of 0.6 or 0.7). So although I found the features which "best" predicted whether a game would have a high rating or not, there remains a lot of unexplained "noise" in the model.


Sadly I was left with a model that produced no useful predictive power for deciding whether to back a Kickstarter campaign. You may already be thinking of a few things that were wrong in my assumptions. First, does the average rating, as defined by BoardGameGeek users, really tell us anything about whether any one individual will enjoy a game? Certainly not for me! Sure I enjoy many of the games in the top 100, but there are some I can't stand! Pandemic Legacy: Season 2, perhaps the worst board game experience I have ever had to endure, is rated #33. And my beloved 878 Vikings (from my top 10 list) is way down at #741!

This reveals the real problem with my attempted analysis: taste in board games is far too variable to be predicted by any collection of features a game might have. For example, I have a friend whose name starts with a "C" and rhymes with "Blonnor" who hates word games, yet to me games like Codenames, Decrypto, or Crosstalk (rated #2274?!) are some of my all-time favorites! There are so many different board games and so many different types of people who enjoy them.

Just for fun I also decided to see whether a game's weight (i.e. complexity) on scale of 1 to 5 could predict its average rating. The results show you just how much variance there is in people's preference of games. Sad to say that, although still inadequate, the weight of a game does a better job at predicting its rating than my fancy random forest algorithm earlier.

Each dot represents one board game