An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Visualizing Data

# Analyzing Obesity across USA

• Published on July 12, 2017 at 8:48 am

The main aim of this project is to study the states which had the most obese population among adults and children as well as teens in USA. Secondly, another objective of this project is to learn how to scrape data in R from an HTML page using rvest package and generate beautiful maps using ggplot and maps package in R. A similar post was published earlier at DataScience+.

require(rvest)
#rvest is the package to scrape Web pages in R
# ?rvest to search more on this package

require(ggplot2)
require(dplyr)
require(scales)
require(maps)

Now Scraping Data from wikipedia article and converting it to a R Data frame

#Loading the Data

?read_html #for knowing more on this function

#html_nodes() to select a particular HTML element from the above page
#Converting to a R dataframe
#xpath of the Wikipedia table data
obesity = obesity %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
.[[1]] %>%
html_table(fill=T)

##   State and District of Columbia Obese adults
## 1                        Alabama        30.1%
## 3                        Arizona        23.3%
## 4                       Arkansas        28.1%
## 5                     California        23.1%
## 1                           65.4%                          16.7%
## 2                           64.5%                          11.1%
## 3                           59.5%                          12.2%
## 4                           64.7%                          16.4%
## 5                           59.4%                          13.2%
## 6                           55.0%                           9.9%
##   Obesity rank
## 1            3
## 2           14
## 3           40
## 4            9
## 5           41
## 6           51



Data Transformation

#to check the structure of the data
str(obesity)
## 'data.frame':    51 obs. of  5 variables:
##  $State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ##$ Obese adults                   : chr  "30.1%" "27.3%" "23.3%" "28.1%" ...
##  $Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ... ##$ Obese children and adolescents : chr  "16.7%" "11.1%" "12.2%" "16.4%" ...
##  $Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...  We need to remove the ‘%’ from the data and convert it to numeric data type to draw plots using ggplot package. #removing the % and making the data numeric for(i in 2:4){ obesity[,i] = gsub("%", "", obesity[,i]) obesity[,i] = as.numeric(obesity[,i]) } str(obesity) ## 'data.frame': 51 obs. of 5 variables: ##$ State and District of Columbia : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ... ##$ Overweight (incl. obese) adults: num  65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
##  $Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ... ##$ Obesity rank                   : int  3 14 40 9 41 51 49 43 22 39 ...



Now fixing the attributes Names to remove spaces between them.

#Fixing the names to remove spaces

names(obesity)
## [1] "State and District of Columbia"  "Obese adults"
## [5] "Obesity rank"



We will use make.names to remove spaces and make syntactically valid names.

names(obesity) = make.names(names(obesity))
names(obesity)
## [5] "Obesity.rank"



### Loading the Map data for USA and merging it with Obesity Data frame

maps package in R provides methods to load the geographical data of different countries and the world in R to a data frame consisting of Latitudes and Longitudes as well which can further be used to generate and visualize maps in R.

#Loading the map data-----------------

states = map_data("state")
?map_data to read more on using this function

# create a new variable region for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)

#merging the datasets
states = merge(states, obesity, by="region", all.x=T)
str(states)
## 'data.frame':    15537 obs. of  11 variables:
##  $region : chr "alabama" "alabama" "alabama" "alabama" ... ##$ long                           : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $lat : num 30.4 30.4 30.4 30.3 30.3 ... ##$ group                          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $order : int 1 2 3 4 5 6 7 8 9 10 ... ##$ subregion                      : chr  NA NA NA NA ...
##  $State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ... ##$ Obese.adults                   : num  30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
##  $Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ... ##$ Obese.children.and.adolescents : num  16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
##  \$ Obesity.rank                   : int  3 3 3 3 3 3 3 3 3 3 ...



## Visualizing the Data

#a data frame for adding Names to the states on the Map- making a new data frame

statenames = states %>%
group_by(region) %>%
summarise(
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
)


Now finding the top 10 states with most Obese Adult population using dplyr package.

#Data frame consisting of top 10 Most Obese Adults States
topstate = states %>%
group_by(region) %>%
summarise(

)
%>%
top_n(10)


### Plotting the States with Most Obese Adults

Making a Barplot.

#Plotting the top 10 states

geom_col(color="black",fill="#1EDBC2",alpha=0.6) +
labs(y = "Percentage of Obese Adults",x="Top 10 States") +
coord_flip()



Gives this plot:

From the Barplot we notice that the State with Highest Obese Adult Population is Mississippi.

### Plotting a Map for Most Obese Adult Population

#Plotting the data on a map------------------------

ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white",show.legend = T) +
scale_fill_gradient(name = "Percent", low = "#FAB8D2", high = "#F91C74", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Obesity in Adults for USA",x = "Longitude",y = "Latitude") +
coord_map() +
#adding States names to the states on the map
geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)


Gives this plot:

The darker regions on the map indicate the State with highest percentage of Obese Adult population. The state with minimum obese adult population is Connecticut.

### Obesity in Children and Teen Population

Creating an new data frame with Top 15 States with Most Obese Children and Teens Population.

#Now Analyzing the Obese Children and Teens

#Finding top 15 States with Most Obese Children and Teens
topChild = states %>%
group_by(region) %>%
top_n(15)


Making a Barplot

#Barplot

ggplot(data = topChild, aes(x = reorder(region,Obese.Child.and.Teens), y = Obese.Child.and.Teens))+
geom_col(color="black",fill="#6EE543",alpha=0.8) +
coord_flip()


Gives this plot:

As we can notice the state with most Obese children and Teens is Delaware.

#Map for Obesity in Children

ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent Obese", low = "#B8D5EC", high = "#0A4B7D", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Obesity in Children and Teens", x = "Longitude",y = "latitude") +
coord_map() +
#adding States names to the states on the map
geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)


Gives this plot:

Now let’s plot a complete Barplot of States and Percentage of Obese Children and Teens

ggplot(aes(x = reorder(region,Obese.children.and.adolescents),y = Obese.children.and.adolescents),
data = statenames) +
geom_col(color="black",fill="#F43E3E",width=1)   +
coord_flip() +
labs(x = "States", y ="Percentage of Obese Children and Teens",title="Barplot of Obese Children and Teens")


Gives this plot:

The State with Highest Obese Teen and Children population is Delaware and the state with least obese teens and children is Utah.

## Conclusion

In this project we firstly learned to scrape data using rvest package from wikipedia and then analyzed and visualized the States with most Obese Adult and children population.We also learned how to create beautifull maps using ggplot and maps packages in R.

Hope you guys liked the article and is interesting enough to get you started with scraping data from any web document and start analyzing it yourself in R and create beautiful maps and plots.

Make sure to like and share it.Cheers !