DataScience+ An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Visualizing Data

Map the Life Expectancy in United States with data from Wikipedia

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.

In this post, I will map the life expectancy in White and African-American in US.

Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")

le = le %>%
  html_nodes("table") %>%
  .[[2]]%>%
  html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# check the structure of dataset
str(le)
# select only columns with data
le = le[c(1:8)]
# get the names from 3rd row and add to columns
names(le) = le[3,]
# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]
# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")
# make variables as numeric
le = le %>% 
  mutate(
    le_black = as.numeric(le_black), 
    le_white = as.numeric(le_white))
# check the structure of dataset
str(le)
'data.frame':	54 obs. of  417 variables:
 $ X1  : chr  "" "Rank\nState\nLife Expectancy, All\n(in years)\nLife Expectancy, African American\n(in years)\nLife Expectancy, Asian American\n"| __truncated__ "Rank" "1" ...
 $ X2  : chr  NA "Rank" "State" "Hawaii" ...
 $ X3  : chr  NA "State" "Life Expectancy, All\n(in years)" "81.3" ...
 $ X4  : chr  NA "Life Expectancy, All\n(in years)" "Life Expectancy, African American\n(in years)" "-" ...
 $ X5  : chr  NA "Life Expectancy, African American\n(in years)" "Life Expectancy, Asian American\n(in years)" "82.0" ...
 $ X6  : chr  NA "Life Expectancy, Asian American\n(in years)" "Life Expectancy, Latino\n(in years)" "76.8" ...
 $ X7  : chr  NA "Life Expectancy, Latino\n(in years)" "Life Expectancy, Native American\n(in years)" "-" ...
.....
.....

'data.frame':	51 obs. of  7 variables:
 $ Rank                            : chr  "1" "2" "3" "4" ...
 $ State                           : chr  "Hawaii" "Minnesota" "Connecticut" "California" ...
 $ Life Expectancy, All
(in years): chr  "81.3" "81.1" "80.8" "80.8" ...
 $ le_black                        : num  NA 79.7 77.8 75.1 78.8 77.4 NA NA 75.5 NA ...
 $ le_white                        : num  80.4 81.2 81 79.8 80.4 80.5 80.4 80.1 80.3 80.1 ...
 $ le_diff                         : num  NA 1.5 3.2 4.7 1.6 ...
 $ region                          : chr  "hawaii" "minnesota" "connecticut" "california" ...

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.

## LOAD THE MAP DATA ####
states = map_data("state")
str(states)
# create a new variable name for state
le$region = tolower(le$State)
# merge the datasets
states = merge(states, le, by="region", all.x=T)
str(states)
'data.frame':	15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...

'data.frame':	15537 obs. of  12 variables:
 $ region                          : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ long                            : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat                             : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group                           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order                           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ subregion                       : chr  NA NA NA NA ...
 $ Rank                            : chr  "49" "49" "49" "49" ...
 $ State                           : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ Life Expectancy, All
(in years): chr  "75.4" "75.4" "75.4" "75.4" ...
 $ le_black                        : num  72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 ...
 $ le_white                        : num  76 76 76 76 76 76 76 76 76 76 ...
 $ le_diff                         : num  3.1 3.1 3.1 3.1 3.1 ...

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don’t have the data, and therefore I will color it in grey color.

## MAKE THE PLOT ####

# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in African American") +
  coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in White") +
  coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
  coord_map()

Here is the plot:
Le_differences

On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot, and then to use this function ggplotly(map_plot) to plot it.

library(plotly)
map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
  labs(title="Life expectancy in African American") +
  coord_map()
ggplotly(map_plot)

Here is the plot:
le_plotly

Thats all! Leave a comment below if you have any question.