```
# Packages used
library(dplyr)
library(brms)
library(tidybayes)
library(ggplot2)
library(ggdist)
```

This post is organized as follows:

- Section 1 provides a conceptual overview of IRT ideology models.
- If you want to skip all the Greek letters and math notation, go to Section 2 for the coding workflow in R.
- And if you only want to read the interpretation of the model results through my uninformed and half-baked analysis of the Supreme Court, skip to Section 3.

Item-Response Theory (IRT) models are a class of statistical models used to measure latent traits in individuals.^{1} These traits are characteristics which we cannot observe directly, such as height or weight, but which we instead have to infer indirectly through observed actions. For example, a student’s responses to questions on an exam might give us some idea about their latent intelligence—or a politician’s votes in Congress might give us some idea about their underlying political ideology.

Say we want to determine where a Supreme Court justice lies on a left-right ideological scale. We will call this variable . One place is start would be to qualitatively code each Supreme Court decision as either being liberal (0) or conservative (1), and then look at the proportion of times each justice sided with the conservative outcome. Expressed as a statistical model we get:

Where whether each justice sides with a conservative decision () is based probabilistically on the (scaled) proportion of conservative positions (). The Standard Normal cumulative distribution function () is there to add some random noise in the model. We don’t want our ideology measurements to be deterministic based on past decisions. Instead, we want to allow some room for some idiosyncratic errors to occur. On even the most conservative possible decision, we allow for *some tiny* probability that Clarence Thomas takes the liberal side. The Bernoulli distribution turns the probabilities produced by the function into observed 0’s and 1’s (liberal or conservative votes). See my post on Probit regression models for more on this.

The model in Equation 1 has at least one major flaw. Because there are only parameters for justices () and none for cases, it treats all cases before the Supreme Court as interchangeable. Additive index variables such as these implicitly assume that each “item” (i.e. case) contributes the same amount of weight towards measuring the latent construct in question. In the example of the Supreme Court this is clearly a bad assumption to make since some cases clearly have more ideological importance than others.^{2}

Let’s fix this flaw by adding a case-level parameter () to the model:

Equation 2 is commonly known as the *1-Parameter IRT Model*.^{3} Each case now has an independent latent variable for how likely *every* justice is to vote in the conservative direction. For IRT models within the context of standardized tests, is called the “difficulty” parameter—questions on exams vary in how difficult they are to answer correctly.

The 1-Parameter IRT model in Equation 2 is a big improvement over the additive index model in Equation 1, but if we want to be serious about measuring Supreme Court justice ideology we need to go further.

The *2-Parameter IRT model* in Equation 3 adds one more case-level parameter () which allows the *ideological valence* of each case to vary. In the test-taking context, is referred to as the “discrimination” parameter. What this means in the context of the Supreme Court is that we expect certain cases to more strongly separate liberal justices from conservative justices.^{4}

The *2-Parameter IRT model* in Equation 3 was originally developed and applied to Supreme Court justices by Martin and Quinn (2002). For an excellent overview on the latest in judicial ideology measurement methods see Bonica and Sen (2021).

Now let’s turn to coding up the IRT model in Equation 3, and use it to measure the ideology of Supreme Court justices. There are three steps to this process:

- Prepare the data
- Build the model (and check the model)
- Extract the ideology estimates

```
# Packages used
library(dplyr)
library(brms)
library(tidybayes)
library(ggplot2)
library(ggdist)
```

The Washington University Law Supreme Court Database is a fantastic resource for data on Supreme Court cases. We will be using the justice centered data because ultimately it is justice characteristics we care about.

`votes <- readr::read_csv(here::here("posts", "irt-brms", "data-raw", "SCDB_2023_01_justiceCentered_Vote.csv"))`

The `votes`

data frame contains justice voting data stretching back to 1946. It is already in “long format”, which is great because that’s what works best with our modeling approach using the brms R package. By long format we mean that every row contains a unique justice-case pair.^{5}

```
votes_recent <- votes |>
filter(term == 2022) |>
mutate(direction = case_when(direction == 2 ~ 1,
direction == 1 ~ 2,
.default = NA))
```

Next we will filter out all years except for the 2022 term because this is where the 3-3-3 vs 6-3 debate is taking place. Lastly, we will recode the outcome variable, `direction`

, such that `2`

represents the conservative position and `1`

represents the liberal position. This helps align liberal with “left-wing” and conservative with “right-wing” on the unidimensional ideology scale we are building. The method behind coding a decision as liberal versus conservative is explained in more detail here.

With our data ready to go it is time to translate the model from Equation 3 into R code. The brms R package makes constructing the model, as well as extracting the results, relatively straightforward.^{6}

```
irt_formula <- bf(
direction ~ gamma * theta + xi,
gamma ~ (1 | caseId),
theta ~ (1 | justiceName),
xi ~ (1 | caseId),
nl = TRUE
)
```

We start with writing out the formula for our ideology model: `irt_formula`

. The top line `direction ~ gamma * theta + xi`

translates Equation 3 into code with the `direction`

variable—whether a justice took a conservative or liberal position on a case—swapped in for . Each of `gamma`

, `theta`

, and `xi`

are modeled hierarchically using either the case variable `caseID`

or justice variable `justiceName`

. Hierarchical modeling allows each of these three parameters to partially pool information from other cases or justices, which imposes regularization on the estimates and improves out-of-sample fit. This should be the default practice whenever building an IRT model. Lastly, the line `nl = TRUE`

is necessary because the term `gamma * theta`

means that our model is “non-linear”.

Priors are important in all Bayesian models, but they are especially important for IRT due to these models’ inherently tricky identification problems. A model is “properly identified” if, given a specific set of data, the model will produce a unique set of plausible parameter values. As it currently stands this is not the case for either Equation 3 or its code-equivalent `irt_formula`

. Identification is difficult for IRT models because there is no inherent center, scale, or polarity for latent variables. It might be natural to think of 0 as the center for ideology, but nothing in Equation 3 makes that so. Likewise, there is no one way of telling how stretched out or compressed the ideology scale should be. And finally, there is nothing to tell us whether increasing values should correspond to ideology becoming more liberal or to becoming more conservative (polarity).

```
irt_priors <-
prior(normal(0, 2), class = b, nlpar = gamma, lb = 0) +
prior(normal(0, 2), class = b, nlpar = theta) +
prior(normal(0, 2), class = b, nlpar = xi)
```

We will solve each of these three identification problems by setting a few priors on the parameters. Each of `gamma`

, `theta`

, and `xi`

will get relatively narrow Normal(0, 2) priors. These encode a default center and scale into the model. Lastly we set `lb = 0`

on `gamma`

which means that its lower-bound cannot be less than zero, and therefore `gamma`

must be positive for all cases. This, in conjunction with defining the `direction`

variable such that higher values = conservative and lower values = liberal, fixes the polarity identification problem.

```
get_prior(
formula = irt_formula,
data = votes_recent,
family = bernoulli(link = "probit")
)
```

For help setting priors in brms you can use the `get_prior()`

function with your formula, data, and model family. It will tell you what the default priors are for this model. To solve the identification problems in `irt_formula`

we only need to set priors on the `class = b`

intercepts, but if you wanted to get a little more fancy you could add custom priors to the `class = sd`

scale parameters (the default Student t(3, 0, 2.5) seems fine to me).

```
irt_fit <- brm(
formula = irt_formula,
prior = irt_priors,
data = votes_recent,
family = bernoulli(link = "probit"),
backend = "cmdstanr",
cores = 8,
threads = threading(2),
control = list(adapt_delta = 0.99,
max_treedepth = 15),
refresh = 0,
seed = 555
)
```

```
Running MCMC with 4 chains, at most 8 in parallel, with 2 thread(s) per chain...
Chain 1 finished in 33.6 seconds.
Chain 3 finished in 34.4 seconds.
Chain 2 finished in 48.7 seconds.
Chain 4 finished in 51.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 41.9 seconds.
Total execution time: 51.1 seconds.
```

Let’s finally add our IRT formula, priors, and data into the `brm()`

function and fit the IRT model. The `brm()`

function takes these inputs and translate them in Stan code which is run using `backend = "cmdstanr"`

.^{7} The default four chains will sample in parallel if you set `cores = 4`

or greater. Combining `cores = 8`

with `threads = threading(2)`

allows two of your cores to work on each chain, which can help speed up the sampling. The `adapt_delta = 0.99`

and `max_treedepth = 15`

options give the sampler a bit more *oomph*, to use a technical term. This will help make sure things don’t run off the rails due to identification issues during sampling—which can still creep up in IRT models despite our best efforts in setting priors.

`summary(irt_fit)`

```
Family: bernoulli
Links: mu = probit
Formula: direction ~ gamma * theta + xi
gamma ~ (1 | caseId)
theta ~ (1 | justiceName)
xi ~ (1 | caseId)
Data: votes_recent (Number of observations: 623)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~caseId (Number of levels: 55)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(gamma_Intercept) 1.70 1.11 0.38 4.60 1.00 1026 1677
sd(xi_Intercept) 2.75 0.69 1.57 4.28 1.00 540 1752
~justiceName (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(theta_Intercept) 1.43 1.03 0.32 4.02 1.00 1687 2130
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
gamma_Intercept 1.10 0.71 0.22 2.94 1.00 1560 2082
theta_Intercept -1.05 1.06 -3.28 1.13 1.00 358 439
xi_Intercept 0.94 0.75 -0.69 2.32 1.00 293 567
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

Inspecting the output in `summary(irt_fit)`

won’t tell us much about the substantive results, but it is crucial for ensuring that the model has fit properly. If your IRT model is poorly identified, Stan’s Hamiltonian Monte Carlo (HMC) sampler will likely yell at you about a number of things:

- If you get more than a handful of divergent transition warnings, there is likely something seriously wrong with the model.
- Check for high Rhat values for some or all parameters. Rhats above ~1.02 signify that the four HMC chains do not share a lot of agreement regarding where the posterior distribution should be. Typically this comes from poor identification—for example if polarity is not fixed, the
*same data*will produce negative values for some chains and positive values for other chains. - Make sure the Effective Sample Size numbers (Bulk_ESS and Tail_ESS) are sufficiently large (ideally several 100’s) for all parameters.

After fitting the model and checking the sampling diagnostics we are finally ready to extract the ideology estimates (posterior distributions for `theta`

) for each justice. This can be done directly in brms, but I prefer to use the tidybayes R package because it is specifically built for working with post-estimation quantities from Bayesian models.

`get_variables(irt_fit)`

We start by identifying the names of the parameters we’re interested in using `get_variables()`

. In this case they are `r_justiceName__theta`

.

```
justice_draws <- irt_fit |>
spread_draws(r_justiceName__theta[justice,]) |>
ungroup() |>
mutate(justice = case_when(justice == "SAAlito" ~ "Alito",
justice == "CThomas" ~ "Thomas",
justice == "NMGorsuch" ~ "Gorsuch",
justice == "ACBarrett" ~ "Barrett",
justice == "JGRoberts" ~ "Roberts",
justice == "BMKavanaugh" ~ "Kavanaugh",
justice == "KBJackson" ~ "Jackson",
justice == "EKagan" ~ "Kagan",
justice == "SSotomayor" ~ "Sotomayor"),
theta = r_justiceName__theta,
justice = forcats::fct_reorder(justice, theta))
```

Draws from the posterior distribution for each justice’s `r_justiceName__theta`

can be extracted using tidybayes’s `spread_draws()`

function. The `[justice,]`

part gives us draws for each justice and names the new variable distinguishing justices as `justice`

. In this code chunk we also rename the justices to only their last name, and we reorder them by their median `theta`

value using `forcats::fct_reorder()`

.

```
p <- justice_draws |>
ggplot(aes(x = theta,
y = justice)) +
stat_slabinterval(aes(fill_ramp = after_stat(x)),
fill = "green",
density = "unbounded",
alpha = .75) +
scale_fill_ramp_continuous(from = "blue", guide = "none") +
xlim(c(-3.5, 3.5)) +
labs(x = expression("Idealology Estimate," ~ theta),
y = "",
title = "Supreme Court Justice IRT Model Results",
subtitle = "2022 Term") +
theme_minimal()
```

The ggdist R package contains many excellent options for graphing distributions of values and plays very nicely with tidybayes (Matthew Kay is the author of both packages). In this case we’ll use `slab_interval()`

to show us the full posterior distribution for `theta`

, along with median and 66% + 95% intervals.

What should we take away from the ideology estimates from the model above? First, the ordering roughly matches intuition. We have the three liberal, Democrat-appointed, justices Sotomayor, Kagan, and Jackson receiving left-wing ideology scores. Kavanaugh and Roberts are considerably more right-wing than those three, followed by Barrett and Gorsuch. And Thomas and Alito are even more extreme in their conservatism compared to their other four Republican-appointed colleagues.

A second takeaway is that these estimates contain a lot of uncertainty. The `theta`

posteriors for each justice are quite wide, especially for those on the ideological periphery. This is largely due to a lack of data. We are only examining a single year of Supreme Court cases (55 total in the model), and we only have nine individuals who are taking positions on these cases. IRT models produce more confident results as both items and responses increase. In principle we could extend this analysis back further in time by incorporating data on more Supreme Court terms. However, this is not necessarily a good idea because the ideological composition of the Court’s docket changes every year.

This leads us to the third takeaway. Be careful when extrapolating these results to the broader political context. An ideology score of 0 on this scale should not be construed as “centrist” or “moderate”! The Supreme Court docket is *not* a representative sample of the political issues facing the country each year. Justices on the Court choose to grant *certiorari* to only a small proportion of potential cases—a process which biases the ideological landscape of cases in a given term. If justice ideology impacts how they decide cases, it should also impact how they select which cases to decide in the first place. Furthermore, selection bias can occur at the lower court stage. Conservative activists are more likely to appeal extremist cases up to this incarnation of the Supreme Court because they know they have a better shot at winning on these issues compared to past terms. Conversely, liberal activists may not bother trying to get favorable cases on the court’s docket because they know they stand no chance.

So what do these results say about the 3-3-3 vs 6-3 debate?^{8} Perhaps it’s actually more of a 3-4-2 Court. That’s not to say that the four in the middle are true centrists though—they are simply slightly more moderate than Thomas and Alito (a *very* low bar).

To conclude, pack the court.

`sessionInfo()`

```
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggdist_3.3.2 ggplot2_3.5.1 tidybayes_3.0.6 brms_2.21.0
[5] Rcpp_1.0.12 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] gtable_0.3.5 tensorA_0.36.2.1 xfun_0.44
[4] QuickJSR_1.1.3 htmlwidgets_1.6.4 processx_3.8.4
[7] inline_0.3.19 lattice_0.22-6 tzdb_0.4.0
[10] ps_1.7.6 vctrs_0.6.5 tools_4.4.0
[13] generics_0.1.3 stats4_4.4.0 curl_5.2.1
[16] parallel_4.4.0 tibble_3.2.1 fansi_1.0.6
[19] cmdstanr_0.8.0.9000 pkgconfig_2.0.3 Matrix_1.7-0
[22] data.table_1.15.4 checkmate_2.3.1 distributional_0.4.0
[25] RcppParallel_5.1.7 lifecycle_1.0.4 farver_2.1.2
[28] compiler_4.4.0 stringr_1.5.1 Brobdingnag_1.2-9
[31] munsell_0.5.1 codetools_0.2-20 htmltools_0.5.8.1
[34] bayesplot_1.11.1 yaml_2.3.8 crayon_1.5.2
[37] pillar_1.9.0 tidyr_1.3.1 arrayhelpers_1.1-0
[40] StanHeaders_2.32.8 bridgesampling_1.1-2 abind_1.4-5
[43] nlme_3.1-164 posterior_1.5.0 rstan_2.32.6
[46] tidyselect_1.2.1 digest_0.6.35 svUnit_1.0.6
[49] mvtnorm_1.2-4 stringi_1.8.3 reshape2_1.4.4
[52] purrr_1.0.2 labeling_0.4.3 forcats_1.0.0
[55] rprojroot_2.0.4 fastmap_1.1.1 grid_4.4.0
[58] here_1.0.1 colorspace_2.1-0 cli_3.6.2
[61] magrittr_2.0.3 loo_2.7.0 pkgbuild_1.4.4
[64] utf8_1.2.4 readr_2.1.5 withr_3.0.0
[67] scales_1.3.0 backports_1.5.0 bit64_4.0.5
[70] rmarkdown_2.26 matrixStats_1.3.0 bit_4.0.5
[73] gridExtra_2.3 hms_1.1.3 coda_0.19-4.1
[76] evaluate_0.23 knitr_1.47 V8_4.4.2
[79] rstantools_2.4.0 rlang_1.1.3 glue_1.7.0
[82] vroom_1.6.5 rstudioapi_0.16.0 jsonlite_1.8.8
[85] plyr_1.8.9 R6_2.5.1
```

IRT can also be used on non-individual units, such as organizations, but most examples use individual people.↩︎

The no-ideological-difference-among-items assumption is pretty much always wrong, yet researchers continue to use additive index scales of latent variables in the social sciences all time. Do better! It’s not that hard!↩︎

Which is confusing because there are two parameters in the model: and . Note that in Equation 2 is not formulated exactly the same as the additive index in Equation 1. In Equation 2 is simply an arbitrary parameter for the latent variable as opposed to the scaled proportion of conservative votes as in Equation 1. We can, however, still interpret larger values of as more conservative and lower values of as more liberal.↩︎

A note on notation: in the dozens of books/articles I’ve read on IRT modeling, I have not found even two which share the same Greek letters for the ability, difficulty, and discrimination parameters. Sometimes is in place of . Sometimes is in place of . The parameter can be any number of letters. I have decided to contribute to this ongoing mess and confusion by using my own “”, whose exact permutation I have not seen anywhere else.↩︎

Long data is in contrast to “wide” data in a vote matrix—where the rows are justices and the columns are cases. Older IRT estimation packages, such as pscl, prefer data in the form a vote matrix.↩︎

See Bürkner 2020 for a comprehensive introduction in IRT modeling using brms.↩︎

CmdStanR is not the default backend for brms, but I prefer it to RStan because the output is more concise and it seems to sample faster.↩︎

Technically the 3-3-3 advocates are trying to put the nine justices on a two-dimensional scale, as opposed to the unidimensional left-right scale in our IRT model. They call their second fake second scale “institutionalism”. Technically we could add another dimension to our IRT model, but there is nothing in the data that explicitly codes cases as either pro-institutionalist or anti-institutionalist so there is not really a principled way of going about this.↩︎

In this project I develop the *NYC Rat Index* using geospatial analysis in R. We will walk through some GIS wrangling steps and then employ Bayesian modeling to measure rat activity across New York. Use these results to figure out which neighborhoods to avoid—or which to seek out—depending on your overall disposition towards wild rodents.

```
# Packages used:
library(dplyr)
library(lubridate)
library(tidyr)
library(tidycensus)
library(purrr)
library(ggplot2)
library(MetBrewer)
library(spdep)
library(sf)
library(INLA)
library(leaflet)
```

Our data for constructing the **Rat Index** comes from the NYC Rat Information Portal (RIP).

```
rats <- readr::read_csv(here::here("posts", "nyc-rats", "data-raw", "Rodent_Inspection_20240528.csv")) |>
janitor::clean_names()
```

Each of the 2.5 million observations is a rodent inspection from the year 2010 to present. A lot of inspections don’t find evidence of rats, so we will focus only the rows where `result == "Rat Activity"`

. Mapping rat populations with this data is a bit problematic because rodent inspections are not random. The RIP discusses this in their disclaimer:

Notes on data limitations: Please note that if a property/taxlot does not appear in the file, that does not indicate an absence of rats - rather just that it has not been inspected. Similarly, neighborhoods with higher numbers properties with active rat signs may not actually have higher rat populations but simply have more inspections.

We can deal with some of this bias by building a geospatial model of rat activity—rather than by simply using the raw data on its own. The idea here is that we infer rat activity in areas with low inspection rates by how proximate they are to areas with high rat activity/inspection rates in the data. Given the fact that rats love scurrying around from place to place, let’s try to account for this spatial correlation statistically.

`data_years <- 2015:2021`

Although the rat data ranges from 2010 to present, we’ll restrict our analysis to between 2015 and 2021 because we’ll need to incorporate Census data later which is only available for this period.

```
rats_zip <- rats |>
mutate(year = year(mdy_hms(inspection_date)),
# some ZIPs got put into the wrong boroughs
borough = case_when(zip_code == 10463 ~ "Bronx",
zip_code == 10451 ~ "Bronx",
zip_code == 11370 ~ "Queens",
zip_code == 11207 ~ "Brooklyn",
.default = borough)) |>
summarise(n_rats = sum(result == "Rat Activity", na.rm = TRUE),
.by = c("zip_code", "borough", "year")) |>
filter(year %in% data_years, !is.na(borough))
```

Let’s start by creating a new data frame, `rats_zip`

, which collapses the number of rat activity observations (`n_rats`

) down to the ZIP code-year level. We’ll use ZIP code as the level of spatial aggregation because most people know which ZIP code they live in (as opposed to Census tract or other designation). This will make it easier for New Yorkers to know the *Rat Index* value where they live. The downside of using ZIP codes, however, is that they are constructed to facilitate mail delivery—not balanced geospatial analysis.^{1}

```
get_rat_predictors <- function(year) {
zip_data <- get_acs(
geography = "zcta",
variables = c("n_kitchens" = "B25051_002",
"n_food_workers" = "C24050_040",
"buildings_before_1939" = "B25034_011",
"total_population" = "B01003_001"),
year = year,
survey = "acs5",
output = "wide",
progress = FALSE
)
return(zip_data)
}
```

Next, let’s grab ZIP code covariates using the `get_acs()`

function in the **tidycensus** R package. These are variables from the Census American Community Survey (ACS) which I believe can help predict rat activity in our model. Because most of my knowledge about rat psychology comes from the movie Ratatouille, the number of kitchens and the number of food workers (e.g. chefs) seem like they will be very important. The number of old buildings (built before 1939) in a ZIP code also seems relevant. I envision rats having an easier time infiltrating older buildings compared to new ones. And lastly we’ll use the total population variable because apparently rats like living near us humans.

```
nyc_rat_vars <- split(data_years, data_years) |>
map(get_rat_predictors) |>
list_rbind(names_to = "year") |>
mutate(zip_code = as.numeric(GEOID),
year = as.numeric(year)) |>
filter(zip_code %in% rats_zip$zip_code)
```

Because our rat data is a time-series, we’ll run the partial function `get_rat_predictors()`

over each year in `data_years`

using `purrr::map()`

. Ideally we would use the ACS 1-year file for each of these years, but ZIP code data is only available in the ACS 5-year file.

```
nyc_zips <- tigris::zctas(progress = FALSE) |>
mutate(zip_code = as.numeric(ZCTA5CE20)) |>
filter(zip_code %in% rats_zip$zip_code)
```

In addition to the outcome variable `n_rats`

, and the Census covariates used to predict rat activity, we need the geospatial geometry of each ZIP code. Ideally we could get this information using the option `geometry = TRUE`

in `get_acs()`

. But I found that the ZIP code geometries had minuscule differences from year to year, which made any subsequent analysis with them a huge pain. So instead we will pull the spatial geometry data just once using the `zctas()`

function in the **tigris** package.

```
# Merge the ZIP data in with the rats data
rats_all <- rats_zip |>
inner_join(nyc_rat_vars, by = c("zip_code", "year")) |>
left_join(nyc_zips, by = "zip_code") |>
# model functions like data to be indexed as 1, 2, 3, etc
mutate(zip_code_code = as.integer(as.factor(zip_code)),
year_code = as.integer(as.factor(year)))
```

Lastly we merge all three data sources (rats, rat predictors, ZIP code geometry) into a single data frame called `rats_all`

.

To infer rat activity from proximate ZIP codes, we need to know which ZIP codes are neighbors.

```
# Using only zip codes, boroughs, and geometry to illustrate the neighbor graph
zips_only <- rats_all |>
st_as_sf() |>
select(zip_code, borough) |>
distinct()
# Neighbor adjacency graph
zips_adj <- poly2nb(zips_only)
```

The function `poly2nb()`

from the **spdep** package takes our data containing spatial geometry polygons and returns a neighbor adjacency graph. This tells us, for each ZIP code in our data, which ZIP codes touch it at at least one point. We could take a look at this network graph using the base `plot(zips_adj)`

method, but I prefer the look and flexibility of ggplot.

```
nb_to_sf <- function(nb_obj, sf_obj) {
sf_out <- as(nb2lines(nb_obj, coords = coordinates(as_Spatial(sf_obj))), "sf") |>
st_set_crs(st_crs(sf_obj))
return(sf_out)
}
```

The function `nb_to_sf()`

can be used to convert `zips_adj`

back into an `sf`

data object which plays nicely with ggplot functions.

```
ggplot(zips_only) +
geom_sf(color = 'white', aes(fill = borough)) +
geom_sf(data = nb_to_sf(zips_adj, zips_only)) +
scale_fill_manual(values = met.brewer("Hokusai3")) +
theme_void()
```

Here is our beautiful NYC ZIP code adjacency graph! Unfortunately for us, however, the `poly2nb()`

function only connected ZIP codes if they were terrestrial neighbors. This leaves all the boroughs except Brooklyn and Queens disconnected from one another. The Rockaway beach area is also isolated from the rest of the city, and poor Roosevelt Island is all alone with zero neighbors. We need to do some adjustments if we want to properly model rat activity. While I’m not sure whether a rat could swim across the East River from Manhattan to Brooklyn, I have first hand knowledge of them traveling across the city’s various tunnels and bridges.

```
add_neighbors <- function(nb_obj, links, node_vec) {
for (i in seq_along(links)) {
nb_obj[[match(names(links[i]), node_vec)]] <- setdiff(as.integer(sort(c(nb_obj[[match(names(links[i]), node_vec)]], match(links[i], node_vec)))), 0)
nb_obj[[match(links[i], node_vec)]] <- setdiff(as.integer(sort(c(nb_obj[[match(links[i], node_vec)]], match(names(links[i]), node_vec)))), 0)
}
return(nb_obj)
}
```

It turns out that adding manual connections to a `nb`

spatial neighbors object is extremely annoying. The **spdep** package supposedly has a function for this: `edit.nb()`

, but you will get an error saying “do not use in RStudio” (???). I refuse to work out of the Console like a caveman, so instead I wrote a function `add_neighbors()`

to help with this. Enter your original `nb`

spatial neighbors object, a vector of neighbor pairs you wish to connect, and the geography variable from the data frame you used to create the `nb`

object—and you will get out a new `nb`

spatial neighbors object with all those nodes connected. The `links`

argument in this example will be ZIP code pairs and the `node_vec`

argument will be the ZIP code column in `zips_only`

, but `add_neighbors()`

will also work if you want to connect Census tracts or any other geographic unit.

```
connect_nyc_neighbor_zips <- partial(
add_neighbors,
links = c("11414" = "11693", # Cross Bay Blvd
"11234" = "11697", # Marine Parkway Bridge
"10305" = "11209", # Verrazzano Bridge
"10004" = "11231", # Brooklyn-Battery Tunnel
"10038" = "11201", # Brooklyn Bridge
"10002" = "11201", # Manhattan Bridge
"10002" = "11211", # Williamsburg Bridge
"10017" = "11109", # Queens-Midtown Tunnel
"10022" = "11101", # Queensboro Bridge
"10044" = "11106", # Roosevelt Island Bridge
"10035" = "11102", # Triborough Bridge
"10035" = "10454", # Triborough Bridge
"10037" = "10451", # Madison Ave. Bridge
"10039" = "10451", # 145th St. Bridge
"10033" = "10453", # Washington Bridge
"10034" = "10468", # University Heights Bridge
"10034" = "10463", # Broadway Bridge
"10465" = "11360", # Throgs Neck Bridge
"10465" = "11357") # Whitestone Bridge
)
```

Because we will use `add_neighbors()`

multiple times in this project, I created a `purrr::partial()`

version of it with the major NYC bridges and tunnels connected by default. Now I know what you are thinking. What about the *SUBWAY*?! Rats *LOVE* the *SUBWAY*! Sorry, but it was tedious enough looking up all these bridges. A serious GIS specialist could probably do something fancy like overlay a shapefile of the NYC subway system on top of the ZIP code shapefile, and add connections that way. But for now we will just assume that rats are banned from riding the subway.

```
zips_adj_c <- zips_adj |>
connect_nyc_neighbor_zips(node_vec = zips_only$zip_code)
```

With that caveat out of the way let’s create a new spatial neighbor object, `zips_adj_c`

, with the ZIP codes that we want connected.

```
ggplot(zips_only) +
geom_sf(color = "white", aes(fill = borough)) +
geom_sf(data = nb_to_sf(zips_adj_c, zips_only), color = "red") +
geom_sf(data = nb_to_sf(zips_adj, zips_only)) +
scale_fill_manual(values = met.brewer("Hokusai3")) +
theme_void()
```

Plotting these two graphs on top of each other confirms that we fully connected the entire city! The rats will now be free to roam from borough to borough in our model.

Let’s return to the ZIP-year level data, `rats_all`

for constructing the *Rat Index*.

```
zips_adj_long <- rats_all |>
st_as_sf() |>
poly2nb() |>
connect_nyc_neighbor_zips(node_vec = rats_all$zip_code)
zip_mat <- nb2mat(zips_adj_long, style = "B")
```

We’ll create an `nb`

spatial neighbor object from this long data using `poly2nb()`

as before, and add the tunnel and bridge connections using `connect_nyc_neighbor_zips()`

. We will then encode the neighbor relations in `zips_adj_long`

as an adjacency matrix using `nb2mat()`

with `style = "B"`

for “binary”. This creates a square matrix of 1’s and 0’s where the 1’s denote adjacency between ZIP codes. The adjacency matrix is how we encode neighbor relations in the modeling step.

```
rat_mod = inla(
n_rats ~ total_populationE + buildings_before_1939E + n_kitchensE + n_food_workersE +
f(zip_code_code, model = "bym2", graph = zip_mat),
data = rats_all,
family = "poisson"
)
```

Time to build the rat model. We’ll use the **INLA** package to predict rat activity given the population, building age, kitchens, and food worker variables we assembled earlier. **INLA** is very popular for fitting Bayesian models with spatial components. It uses a form of approximate Bayesian inference called *I*tegrated *N*ested *L*aplace *A*pproximation, which means it runs a lot faster than full Markov Chain Monte Carlo samplers such as Stan.

The `f(zip_code_code, model = "bym", graph = zip_mat)`

section in the model is the spatial component. The part of the model that allows neighboring ZIP codes to share rat information with each other. **INLA** and all the spatial functions we used in **spdep** (`poly2nb()`

and `nb2mat()`

) are kind of old-school when it comes to their reliance on integer indexing—as opposed to allowing ZIP codes to be their true, un-ordered, selves. This is why I use the variable `zip_code_code`

instead of `zip_code`

here. It contains integer values matching each ZIP code with its position in the `nb`

spatial neighbor object. The `model = "bym2"`

part tells **INLA** we are doing spatial analyses using the Besag-York-Mollié method. BYM2 gives each ZIP code a varying intercept (i.e. “random effect”) which is a combination of spatial correlation with its neighboring ZIP codes and an unstructured effect for non-spatial rat behavior.

We’re using `family = "poisson"`

in the model because the outcome variable, `n_rats`

, is a count of rat inspections and Poisson likelihoods are good for count data.^{2}

```
rat_scores <- rats_all |>
mutate(rat_score = rat_mod$summary.fitted.values$mode) |>
st_as_sf() |>
summarise(rat_score = mean(rat_score),
.by = c("zip_code", "borough"),
across(geometry, st_union)) |>
mutate(log_rat_score = log10(rat_score),
zip_rat_rank = as.integer(as.factor(-log_rat_score)),
zip_rat_perc = percent_rank(-zip_rat_rank))
```

The predicted rat activity can be extracted from the fitted model using `rat_mod$summary.fitted.values`

. Mean, median, and mode of these posterior values are all about the same, so we’ll save `$mode`

as our new `rat_score`

variable. Because `rat_mod`

uses the time-series data, we will aggregate rat scores across years down to the ZIP code level. Rat activity is highly skewed, so for plotting purposes we’ll use `log_rat_score`

. And we’ll also create rat index rank and percentile variables so that we can find the biggest rat hot-spots in the city.

```
l <- st_as_sf(rat_scores) |>
leaflet() |>
addTiles()
labels <- sprintf(
"<strong> ZIP Code: %s </strong> <br/>
Rat Index: %s <br/> Percentile: %s",
rat_scores$zip_code,
round(rat_scores$rat_score, 2),
round(rat_scores$zip_rat_perc, 3)
) |>
lapply(htmltools::HTML)
pal <- colorNumeric(
palette = "plasma",
domain = rat_scores$log_rat_score)
l |>
addPolygons(
smoothFactor = .1, fillOpacity = .8,
fillColor = ~pal(log_rat_score),
weight = .1,
highlightOptions = highlightOptions(weight = 5, color = "white"),
label = labels,
labelOptions = labelOptions(
style = list(
"font-weight" = "normal",
padding = "3px 8px"
),
textsize = "15px", direction = "auto"
)
) |>
addLegend(
pal = pal, values = ~log_rat_score, opacity = 1,
labFormat = labelFormat(transform = function(x) 10^x),
title = "Rat Index", position = "bottomright")
```