This is the fourth part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show a cluster analysis of earthquakes located into the California state mainland. In the specific

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(sf))
suppressPackageStartupMessages(library(spData))
suppressPackageStartupMessages(library(ggmap))
suppressPackageStartupMessages(library(tmap))
suppressPackageStartupMessages(library(factoextra))

Packages versions are herein listed.

packages <- c("dplyr", "lubridate", "sf", "spData", "ggmap", "tmap", "factoextra")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
##     packages version
## 1      dplyr 0.8.0.1
## 2  lubridate   1.7.4
## 3         sf   0.7.3
## 4     spData   0.3.0
## 5      ggmap   3.0.0
## 6       tmap     2.2
## 7 factoextra   1.0.5

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.2                         
## year           2018                        
## month          12                          
## day            20                          
## svn rev        75870                       
## language       R                           
## version.string R version 3.5.2 (2018-12-20)
## nickname       Eggshell Igloo

Getting Data

As shown in the previous posts, we download the earthquake dataset from earthquake.usgs.gov, specifically the last 30 days dataset. Please note that such earthquake dataset is day by day updated to cover the last 30 days of data collection. Further, it is in CSV format. If not already present into our workspace, we download and save it and then we load it into quakes local variable. Moreover, it is not the most recent dataset available as I collected it some weeks ago.

if ("all_week.csv" %in% dir(".") == FALSE) {
  url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv"
  download.file(url = url, destfile = "all_week.csv")
}
quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE)

quakes$time <- ymd_hms(quakes$time)
quakes$updated <- ymd_hms(quakes$updated)

quakes$magType <- as.factor(quakes$magType)
quakes$net <- as.factor(quakes$net)
quakes$type <- as.factor(quakes$type)
quakes$status <- as.factor(quakes$status)
quakes$locationSource <- as.factor(quakes$locationSource)
quakes$magSource <- as.factor(quakes$magSource)
quakes <- arrange(quakes, -row_number())

# earthquakes dataset
earthquakes <- quakes %>% filter(type == "earthquake")

Analysis

First, we have to show the location of the earthquakes occurred inside California mainland. At the purpose, here are the steps to do it whose numbering corresponds to the comments within the shown source code below.

  1. we filter out the California map from the US one, as made available by the spData package
  2. we convert our earthquakes dataset to a sf (simple features) object by st_as_sf() function within the sf package
  3. we inner join (left = FALSE) our simple features object with the california map; that gives a new simple features object providing with earthquakes occured exactly within California geographical boundaries
  4. we create a tmap-element as based on California map; such tmap-element instance specifies a spatial data object using the world Simple Features object as available within the spData package.
  5. we choose the natural style for out map and gray color fill color with borders for regions
  6. we set the title onto the map
  7. we add the compass chossing the 8star type in the right+top position
  8. we add a scale bar 0-100-200 km in the left+bottom position
  9. we add the previously build Simple Features object resulting from the inner join at step #2; we define a new variable p1 since in the prosecution of the analysis the variable p will be used as starting base for a new further plot

  10. we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event
#1
map_california <- us_states %>% filter(NAME == "California")
#2
df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = st_crs(map_california))
#3
df_map_inner_join <- st_join(df, map_california, left=FALSE)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#4
p <- tm_shape(map_california)
#5
p <- p + tm_style("natural") + tm_fill(col = "gray") + tm_borders() 
#6
p <- p + tm_layout(main.title = paste("California earthquakes map from ", paste(as.Date(df_map_inner_join$time[1]), as.Date(df_map_inner_join$time[nrow(df_map_inner_join)]), sep = " to "))) 
#7
p <- p + tm_compass(type = "8star", position = c("right", "top")) 
#8
p <- p + tm_scale_bar(breaks = c(0, 100, 200), size = 1, position = c("left", "bottom"))
#9
p1 <- p + tm_shape(df_map_inner_join) 
#10
p1 <- p1 + tm_dots(size = 0.1, col = "mag", palette = "YlOrRd")
p1

We apply hierarchical clustering on California earthquakes dataset. Steps are:

  1. We extract the coordinates from the simple feature data frame to determine a new dataframe
  2. we give proper names to the columns of our coordinates dataframe
  3. we determines and visualize the optimal number of clusters using the wss (Weighted Squared Sum) method by means of fviz_nbclust() function as available within the factoextra package
#1
cal_quakes_coordinates <- as.data.frame(st_coordinates((df_map_inner_join)))
#2
colnames(cal_quakes_coordinates) <- c("longitude", "latitude")
#3
fviz_nbclust(cal_quakes_coordinates, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2) + labs(subtitle = "Elbow method")

The optimal number of clusters for our dataset is indicated as equal to 4 by the above plot.

Let us give a brief overview of how this works, as outlined by ref. [4].

The elbow method looks at the total WSS as a function of the number of clusters. It chooses a number of clusters so that adding another cluster doesn't improve much better the total WSS.

The optimal number of clusters can be defined as follow:

Other methods available are silhouette and gap_statistics, see ref. [4] and [5] for more details.

In the prosecution of the analysis,

  1. we use computed optimal number of clusters (4) to run a kmeans based clustering
  2. we create a new column cluster in our simple feature data frame
  3. we update our plot object p (as introduced in our first plot) in order to show clusters; that is achieved by passing as argument the updated simple features data frame instance
  4. we color the dots indicating where earthquakes occurred with the cluster identifier
  5. we show the resulting plot
#1
kmeans_res <- kmeans(cal_quakes_coordinates, centers = 4)
#2
df_map_inner_join$cluster <- as.factor(kmeans_res$cluster)
#3
p1 <- p + tm_shape(df_map_inner_join) 
#4
p1 <- p1 + tm_dots(size = 0.1, col = "cluster", palette = "YlOrRd")
#5
p1

The same shown on a stamen sourced map. Here are the steps.

  1. we add the longitude and latitude columns to our simple features dataframe
  2. we take advantage of the qmplot() function within the tmap package passing our simple features dataframe without the geometry associated to; a darken factor is suggested for better highlight the dots
#1
df_map_inner_join$longitude <- cal_quakes_coordinates$longitude
df_map_inner_join$latitude <- cal_quakes_coordinates$latitude
#2
suppressMessages(qmplot(x = longitude, y = latitude, data = st_drop_geometry(df_map_inner_join), geom = "point", color = cluster, source = "stamen", zoom = 9, darken = 0.2) + scale_color_brewer(palette="YlOrRd"))

If you have any questions, please feel free to comment below.

References

  1. Earthquake dataset
  2. Eathquake dataset terms
  3. Geocomputation with R, Robin Lovelace, Jakub Nowosad, Jannes Muenchow, CRC press
  4. Determining the optimal number of clusters
  5. Practical Guide to Cluster Analysis
  6. K-Means