Using two-step regression to study gender gaps across countries
This post illustrates how you can use two-step regression in R to analyze cross-country differences — in this case, why women are politically further on the left than men in some countries but not in others.
Men and women differ in many respects, including (unfortunately) income and wealth or their political attitudes. But these gender gaps do not look the same in all countries. For example, OECD data show that men earn on average about 30% more than women in South Korea, but only about 2.5% more in Bulgaria.
This is similar when it comes to political attitudes and ideology. For instance, when people are asked to place themselves on an ideological left-right scale ranging from 0 (far left) to 10 (far right), women typically place themselves further to the left than men (on average).¹
But this also varies between countries. When, as the graph below shows, respondents to the 2014 European Social Survey were asked to place themselves ideologically, women in Denmark were clearly further to the left than men but the pattern was the exact opposite in Lithuania.
This is one example of the type of cross-country differences in some individual-level pattern that political and social data analysts are regularly interested in and try to explain.
Two-step regression is one tool that can be used to study such patterns (next to multilevel regression models).² Lewis and Linzer (2005) introduced the technique, and they also provide R
code for a part of the estimation. Two-step regression requires a dataset that is hierarchical (i.e, has multiple levels) like the European Social Survey, where lower-level observations (respondents) are clustered within upper-level observations (countries).
In the first step, you focus on the lower level. For example, if you are interested in differences between men and women, then you first run a set of regression models in which you regress some outcome of interest (e.g., political ideology) on gender and perhaps some other micro-level control variables — running one estimation per higher-level cluster (e.g., country). After each estimation, you store the coefficient on gender (which captures the difference between men and women) and its standard error (because the effect of gender is estimated rather than known, which needs to be taken into account in the second step).
At the end of step one, you will have a mini dataset of lower-level coefficients and their standard errors. This is your outcome or dependent variable for step two.
You can then merge this with additional explanatory variables for your higher-level units (e.g., countries) that, you think, can explain the variation in coefficients. Once you have a complete higher-level dataset, you do step two.
In step two, you run one additional regression in which you regress the coefficient estimates from the first step on your explanatory variables and where you use the standard errors of the coefficients from step one to calculate weights.
To make this more concrete, lets walk through an example analysis in R
. I stick with the earlier example and try to explain why women are more leftist in some countries than in others. To do so, I use (as above) data from the 2014 round of the European Social Survey.
The data can be accessed and downloaded via europeansocialsurvey.org. I use the full 2014 data (all variables, all countries) in SAV format. I load the tidyverse
package for easier data handling and visualization and import the data with haven
& labelled
:
library(tidyverse)
# Importing data
ess <- labelled::unlabelled(haven::read_sav("ess7.sav")
To get an idea of how the data look, I cross-tabulate them by country and gender:
table(ess$gndr,ess$cntry)
AT BE CH CZ DE DK EE ES FI FR GB HU IE IL LT NL NO PL PT SE SI
Male 853 896 766 998 1545 779 835 988 1027 913 1024 722 1102 1164 869 859 764 740 571 893 563
Female 942 873 766 1128 1500 723 1216 937 1060 1004 1240 976 1288 1398 1381 1060 672 875 694 898 661
No answer 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This is how a typical comparative social survey dataset looks like: Hundreds of respondents (about half male, half female) clustered into different countries.
Now to step one of the two-step procedure, in which we estimate the political gender gap within each country. First, I set up an empty data frame that stores the main results for later: The coefficients (“betas”) on gender and their estimated variances (the squared standard errors, called “omegas”) for each country:
results <- data.frame(cntry = unique(ess$cntry),
betas = rep(0,length(unique(ess$cntry))),
omegas = rep(0,length(unique(ess$cntry))))
Then I set up a loop in which I run country-by-country regression models of political ideology ( lrscale
, as above ranging from far left [0] to far right [10]) on gender and age as a control (you will most likely have additional controls in a real analysis):
for (cn in unique(ess$cntry)) {
mod0 <- broom::tidy(lm(as.numeric(lrscale)-1 ~ droplevels(gndr) + agea,
data = ess[which(ess$cntry==cn),]))
results[which(results$cntry==cn),2] <- mod0$estimate[2]
results[which(results$cntry==cn),3] <- (mod0$std.error[2])^2
}
The results can be directly visualized:
results %>%
mutate(upper = betas + 1.965*sqrt(omegas),
lower = betas - 1.965*sqrt(omegas)) %>%
ggplot(aes(y = reorder(cntry, betas), x = betas, xmin = lower, xmax = upper)) +
geom_col() +
labs(x = "Estimated political gender gap", y = "",
caption = "Negative scores = women are more leftist than men. 95% confidence intervals") +
geom_linerange() +
theme_classic()
As above, Danish women are clearly more to the left than Danish men, and this is the same in most other countries. In a few the difference is not statistically significant, and there are two where women are more conservative than men (Lithuania and Spain).
With these estimates in hand, we can proceed to step 2 where we try to explain this variation. In other words, the estimated gender differences from step one are now the dependent variable or outcome of interest.
One possible explanation for why some countries have larger political gender gap was formulated by Iversen & Rosenbluth. Their point is (very briefly put) that women are more leftist in countries where they have greater economic opportunities. This in turn depends on public policies: Where governments provide policies that support women’s employment, they increase their economic opportunities and thus turn women more leftist.
To test this explanation, I use data on public spending on family policies (e.g., childcare) from the Comparative Political Dataset and merge them with the step-one estimates:
# Adding CPDS macro-level data
##############################
cpds <- haven::read_dta("https://www.cpds-data.org/images/Update2022/CPDS_1960-2020_Update_2022.dta")
cpds %>%
filter(year==2014) %>%
mutate(cntry = countrycode::countrycode(country, origin = "country.name",
destination = "iso2c")) %>%
select(cntry, country,family_pmp) -> cpds_ex
# merge
results %>%
left_join(cpds_ex,
by = "cntry") %>%
filter(!is.na(country)) -> results_merged
The family_pmp
variable measures public spending on family policies as a percentage of GDP.
Then I use the edvreg()
function provided by Lewis and Linzer to estimate the second step regression:
# Running estimation
mod2 <- edvreg(mod = results_merged$betas ~ results_merged$family_pmp,
omegasq = results_merged$omegas)
summary(mod2)
Call:
lm(formula = y ~ X - 1, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.3332 -0.3982 0.1067 0.5998 1.6072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18723 0.13513 1.386 0.18281
results_merged$family_pmp -0.15645 0.05377 -2.909 0.00935 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9982 on 18 degrees of freedom
Multiple R-squared: 0.5945, Adjusted R-squared: 0.5495
F-statistic: 13.2 on 2 and 18 DF, p-value: 0.0002963
It turns out that family policies do have the expected effect: Where countries spend more on public family policies, the gender gap is significantly larger with women being further to the left than men.
A more intuitive way to present this result is to visualize the relationship in a graph (using naive linear regression):
results_merged %>%
ggplot(aes(x = family_pmp, y = betas)) +
geom_smooth(method = "lm", linetype = "dashed", color = "gray",
alpha = .4) +
geom_text(aes(label = cntry)) +
labs(x = "Public spending on family policies (%GDP)",
y = "Political gender gap\n(Est. gender difference in left-right placement)",
caption = "95% confidence intervals based on naive OLS regression.") +
theme_classic()
The graph nicely illustrates that public policies can account for quite a bit of the between-country differences in the political gender gap: Where countries spend more on family policies (as e.g., in Denmark or also Sweden), women are further to the left than men. But where countries spend relatively little, as in Spain or Lithuania, the gap is smaller or even reversed. This being said, the Netherlands and Switzerland seem to be outliers where women are quite clearly to the left of men despite a relatively low level of public investment into family policies.
Obviously, this technique can also be used to study other interesting patterns such as the gender pay gap or non-gender related patterns that differ between countries.
[1] This is a relatively recent pattern. A few decades ago, women used to be more conservative than men, on average. See e.g., Inglehart, R. and Norris, P. (2000). The developmental theory of the gender gap: Women’s and men’s voting behavior in global perspective. International Political Science Review, 21(4):441–463.
[2] See e.g., the article by Franzese on the (dis-)advantages of different methods to work with multilevel data: Franzese, R. (2017). Empirical Strategies for Various Manifestations of Multilevel Data. Political Analysis, 13(4), 430–446.