Gapminder & Brexit Data Analysis
Task 1: gapminder country comparison
The gapminder dataset is imported, and this has data on life expectancy, population, and GDP per capita for 142 countries from 1952 to 2007. Below is a ‘glimpse’ of the data frame, as well as the first 20 lines within the ‘table’.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
head(gapminder, 20) # look at the first 20 rows of the dataframe
## # A tibble: 20 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## 11 Afghanistan Asia 2002 42.1 25268405 727.
## 12 Afghanistan Asia 2007 43.8 31889923 975.
## 13 Albania Europe 1952 55.2 1282697 1601.
## 14 Albania Europe 1957 59.3 1476505 1942.
## 15 Albania Europe 1962 64.8 1728137 2313.
## 16 Albania Europe 1967 66.2 1984060 2760.
## 17 Albania Europe 1972 67.7 2263554 3313.
## 18 Albania Europe 1977 68.9 2509048 3533.
## 19 Albania Europe 1982 70.4 2780097 3631.
## 20 Albania Europe 1987 72 3075321 3739.
This data can be used to produce two graphs of how life expectancy has changed over the years for the country and the continent I come from - in this case the United Kingdom and Europe. We will now produce three graphs which show how the life expectancy (for different regions) evolves over time, starting by filtering the data for these specific regions.
country_data <- gapminder %>%
filter(country == "United Kingdom")
continent_data <- gapminder %>%
filter(continent == "Europe")
Life Expectancy in the United Kingdom over time.
First, we plot of life expectancy over time for a single country of choice - the United Kingdom. This is done by mapping year on the x-axis, and lifeExp on the y-axis. We use ‘geom_smooth’ to incorporate a line of best fit, as well as ‘labs’ to label the graph.
plot1 <- ggplot(data = country_data, mapping = aes(x = year, y = lifeExp))+
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Average UK Life Expectancy over time.",
x = "Year",
y = "Average Life Expectancy") +
NULL
plot1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As we can see, the life expectancy in the UK has been gradually increasing since the 1950s, following what seems to be (close to) a linear progression.
Life Expectancy in Europe over time.
Secondly, we produce a plot for all countries in Europe. This is done by mapping the country variable to the colour aesthetic. We also map country to the group aesthetic, so that all points for each country are grouped together.
ggplot(continent_data, mapping = aes(x = year , y = lifeExp , colour = country , group = country))+
geom_line() +
geom_smooth(se = FALSE) +
labs(title = "Average Life Expectancy over time for European Countries.",
x = "Year",
y = "Average Life Expectancy") +
NULL
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Although this graph is hard to interperet, due to the large quantity of data, we can notably see that all European countries have seen an increase in life expectancy. The most notable outlier, Turkey, started with a particularly low life expactancy (less than 45) in 1950, but has now caught up with the rest of the pack.
Life Expectancy in different continents over time.
Finally, using the original gapminder data, we produce a life expectancy over time graph, grouped (or faceted) by continent - using the ‘facet_wrap’ function. The legend is removed by adding the theme(legend.position="none") in the end of our ggplot.
ggplot(data = gapminder , mapping = aes(x = year , y = lifeExp , colour = continent ))+
geom_point() +
geom_smooth(se = FALSE) +
facet_wrap(~continent) +
theme(legend.position="none") + #remove all legends
labs(title = "Average Life Expectancy over time for different Continents.",
x = "Year",
y = "Average Life Expectancy") +
NULL
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The continents that have seen the sharpest rises in life expectancy since 1952 are those which contain emerging economies. Countries that have exhibited substantial economic growth in this period, such as Brazil, China, India, Indonesia, Malaysia and Mexico, are typically located in Asia or the Americas. This is likely why these continents seem to have exhibited the highest growth, as the economic development within their countries has led to improved healthcare, technology, diet and living standards.
Growth in Africa followed an upward trend between 1950 - 1990 that exhibited a similar gradient to other fast-growing continents. But this increase levelled off shortly after, which is likely due to limited economic power (which is enhanced by major corruption in many of these countries), as well as poor healthcare and many conflicts/wars on this continent causing many young deaths.
As expected, continents whose economic growth primarily accelerated before the 1950s - namely Europe and Oceania - exhibit a smaller increase in life expectancy during the above time-frame.
Task 2: Brexit vote analysis
We will have a look at the results of the 2016 Brexit vote in the UK. First we read the data using read_csv() and have a quick glimpse at the data (shown below).
brexit_results <- read_csv("../../data/brexit_results.csv")
glimpse(brexit_results)
## Rows: 632
## Columns: 11
## $ Seat <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015 <dbl> 50.592, 52.050, 52.994, 43.979, 60.788, 22.418, 52.454, 22…
## $ lab_2015 <dbl> 18.333, 22.369, 26.686, 34.781, 11.197, 41.022, 18.441, 49…
## $ ld_2015 <dbl> 8.824, 3.367, 8.383, 2.975, 7.192, 14.828, 5.984, 2.423, 1…
## $ ukip_2015 <dbl> 17.867, 19.624, 8.011, 15.887, 14.438, 21.409, 18.821, 21.…
## $ leave_share <dbl> 57.89777, 67.79635, 38.58780, 65.29912, 49.70111, 70.47289…
## $ born_in_uk <dbl> 83.10464, 96.12207, 90.48566, 97.30437, 93.33793, 96.96214…
## $ male <dbl> 49.89896, 48.92951, 48.90621, 49.21657, 48.00189, 49.17185…
## $ unemployed <dbl> 3.637000, 4.553607, 3.039963, 4.261173, 2.468100, 4.742731…
## $ degree <dbl> 13.870661, 9.974114, 28.600135, 9.336294, 18.775591, 6.085…
## $ age_18to24 <dbl> 9.406093, 7.325850, 6.437453, 7.747801, 5.734730, 8.209863…
The data comes from Elliott Morris, who cleaned it and made it available through his DataCamp class on analysing election and polling data in R.
Our main outcome variable (or y) is leave_share, which is the percent of votes cast in favour of Brexit, or leaving the EU. Each row is a UK parliament constituency.
To get a sense of the spread, or distribution, of the data, we can plot a histogram, a density plot, and the empirical cumulative distribution function of the leave % in all constituencies.
# histogram
ggplot(brexit_results, aes(x = leave_share)) +
geom_histogram(binwidth = 2.5) +
labs(title = "Brexit Leave Share - Histogram",
x = "Leave Share [%]",
y = "Frequency of Constituencies",
subtitle = "Figure 3.1")

# density plot-- think smoothed histogram
ggplot(brexit_results, aes(x = leave_share)) +
geom_density()+
labs(title = "Brexit Leave Share - Density Plot",
x = "Leave Share [%]",
y = "Frequency of Constituencies",
subtitle = "Figure 3.2")

# The empirical cumulative distribution function (ECDF)
ggplot(brexit_results, aes(x = leave_share)) +
stat_ecdf(geom = "step", pad = FALSE) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Brexit Leave Share - Cumulative Distribution Function Plot",
x = "Leave Share [%]",
y = "Frequency of Constituencies",
subtitle = "Figure 3.3")

One common explanation for the Brexit outcome was fear of immigration and opposition to the EU’s more open border policy. We can check the relationship (or correlation) between the proportion of native born residents (born_in_uk) in a constituency and its leave_share. To do this, let us get the correlation between the two variables. This can be done using the ‘cor’ function.
brexit_results %>%
select(leave_share, born_in_uk) %>%
cor()
## leave_share born_in_uk
## leave_share 1.0000000 0.4934295
## born_in_uk 0.4934295 1.0000000
The correlation is almost 0.5, which shows that the two variables are positively correlated - although the correlation could be stronger (up to a potential value of +1).
We can also create a scatterplot between these two variables using geom_point. We also add the best fit line, using geom_smooth(method = "lm").
ggplot(brexit_results, aes(x = born_in_uk, y = leave_share)) +
geom_point(alpha=0.3) +
# add a smoothing line, and use method="lm" to get the best straight-line
geom_smooth(method = "lm") +
# use a white background and frame the plot with a black box
theme_bw() +
labs(title = "Relationship between the proportion of native born residents and leave share.",
x = "Born in the UK [%]",
y = "Leave Share [%]",
subtitle = "Figure 3.4")
## `geom_smooth()` using formula 'y ~ x'

This relationship clearly shows that constituencies with a larger percentage of native British people are more likely to have a larger leave share. This is expected as those born in the UK are expected to feel more nationalistic when it comes to the key issues of Brexit - namely, immigration, pressure on the NHS and other public services, etc.
This task was extrapolated further in AM01 - Applied Statistics Homework 3, where for every constituency, we plotted leave share (%) against the percentage (%) of votes for the UK’s main political parties in the 2015 general election. This allows us to examine the potentiaal correlation between voting for a particular political party and voting for Brexit.
This involved using the ‘select’ and ‘pivot_longer’ function to group data by the main political parties (Labour, Conservatives, UKIP and Liberal Democrat). By pivoting the table, we can easily plot the leave percentage against the party vote percentage for each party within each constituency. We fit a line of best fit to the data for each political party using geom_smooth - which allows for useful comparison and prediction.
temp_brexit <- brexit_results %>%
select(con_2015, lab_2015, ld_2015, ukip_2015, leave_share) %>%
pivot_longer(cols = 1:4, names_to = "party", values_to = "party_percentage")
cols <- c("con_2015" = "#0087dc", "lab_2015" = "#d50000", "ld_2015" = "#FDBB30", "ukip_2015" = "#EFE600")
temp_brexit %>%
ggplot(aes(x=party_percentage,y=leave_share, group=party, color=party)) +
geom_point(size = 0.5, alpha = 0.2)+
geom_smooth(method=lm, size = 0.5)+
labs(title = "How political affiliation translated to Brexit Voting", x = "Party % in the UK 2015 general election", y = "Leave % in the 2016 Brexit referendum")+
scale_colour_manual(labels = c("Conservative", "Labour","Lib Dems","UKIP"), values = cols)+
theme_bw()+
theme(legend.position = "bottom",plot.title = element_text(face = "bold", size = 13))+
theme(legend.title = element_blank())+
NULL
## `geom_smooth()` using formula 'y ~ x'

The graph here can be used to determine whether there is a correlation between voting for a particular political party and voting for Brexit. As we can see there is a strong positive correlation between UKIP voting and leave voting - this is expected as the values of UKIP are closely aligned with those of Brexit. Although weaker, there is a negative correlation between voting for the liberal democrats and voting for leave - indicating that Liberal Democrat voters are more likely to be pro-European.
Finally, we can see that the correlations for the main two political parties were fairly weak at the time of voting. There is a very slight positive correlation associated with voting Brexit and voting Conservative, along with a very weak negative correlation associated with voting Brexit and voting Labour. The explanation of this could probably fill its own blog, but ultimately, it shows that Dominic Cummings and the ‘Vote Leave’ team were successful in getting the issues associated with Brexit to transcend traditional political ideology. By making issues such as the NHS, fishing rights and migrant’s (realistically minimal) impact on wages and public services central to the campaign, Vote Leave was able to convince both traditionally eurosceptic Conservative voters and poor, marginalised Labour voters that leaving the EU would be beneficial.