Earthquake Analysis (4/4): Cluster Analysis

This is the fourth part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show a cluster analysis of earthquakes located into the California state mainland. In the specific

we will show a California map where earthquakes are highlighted by different color based on their magnitude
we will determine the optimal number of clusters by means of the elbow method
we will take advantage of the kmeans clustering;k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster (ref. [6])
we then plot again our California map with earthquakes location, however, this time highlighted by a different color as based on the cluster identifier

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(sf))
suppressPackageStartupMessages(library(spData))
suppressPackageStartupMessages(library(ggmap))
suppressPackageStartupMessages(library(tmap))
suppressPackageStartupMessages(library(factoextra))

Packages versions are herein listed.

packages <- c("dplyr", "lubridate", "sf", "spData", "ggmap", "tmap", "factoextra")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
##     packages version
## 1      dplyr 0.8.0.1
## 2  lubridate   1.7.4
## 3         sf   0.7.3
## 4     spData   0.3.0
## 5      ggmap   3.0.0
## 6       tmap     2.2
## 7 factoextra   1.0.5

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.2                         
## year           2018                        
## month          12                          
## day            20                          
## svn rev        75870                       
## language       R                           
## version.string R version 3.5.2 (2018-12-20)
## nickname       Eggshell Igloo

Getting Data

As shown in the previous posts, we download the earthquake dataset from earthquake.usgs.gov, specifically the last 30 days dataset. Please note that such earthquake dataset is day by day updated to cover the last 30 days of data collection. Further, it is in CSV format. If not already present into our workspace, we download and save it and then we load it into quakes local variable. Moreover, it is not the most recent dataset available as I collected it some weeks ago.

if ("all_week.csv" %in% dir(".") == FALSE) {
  url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv"
  download.file(url = url, destfile = "all_week.csv")
}
quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE)

quakes$time <- ymd_hms(quakes$time)
quakes$updated <- ymd_hms(quakes$updated)

quakes$magType <- as.factor(quakes$magType)
quakes$net <- as.factor(quakes$net)
quakes$type <- as.factor(quakes$type)
quakes$status <- as.factor(quakes$status)
quakes$locationSource <- as.factor(quakes$locationSource)
quakes$magSource <- as.factor(quakes$magSource)
quakes <- arrange(quakes, -row_number())

# earthquakes dataset
earthquakes <- quakes %>% filter(type == "earthquake")

Analysis

First, we have to show the location of the earthquakes occurred inside California mainland. At the purpose, here are the steps to do it whose numbering corresponds to the comments within the shown source code below.

we filter out the California map from the US one, as made available by the spData package
we convert our earthquakes dataset to a sf (simple features) object by st_as_sf() function within the sf package
we inner join (left = FALSE) our simple features object with the california map; that gives a new simple features object providing with earthquakes occured exactly within California geographical boundaries
we create a tmap-element as based on California map; such tmap-element instance specifies a spatial data object using the world Simple Features object as available within the spData package.
we choose the natural style for out map and gray color fill color with borders for regions
we set the title onto the map
we add the compass chossing the 8star type in the right+top position
we add a scale bar 0-100-200 km in the left+bottom position
we add the previously build Simple Features object resulting from the inner join at step #2; we define a new variable p1 since in the prosecution of the analysis the variable p will be used as starting base for a new further plot
we use the dot symbol to indicate earthquake events on the map with a color scale associated with the magnitude of the event

#1
map_california <- us_states %>% filter(NAME == "California")
#2
df <- st_as_sf(x = earthquakes, coords = c("longitude", "latitude"), crs = st_crs(map_california))
#3
df_map_inner_join <- st_join(df, map_california, left=FALSE)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar

#4
p <- tm_shape(map_california)
#5
p <- p + tm_style("natural") + tm_fill(col = "gray") + tm_borders() 
#6
p <- p + tm_layout(main.title = paste("California earthquakes map from ", paste(as.Date(df_map_inner_join$time[1]), as.Date(df_map_inner_join$time[nrow(df_map_inner_join)]), sep = " to "))) 
#7
p <- p + tm_compass(type = "8star", position = c("right", "top")) 
#8
p <- p + tm_scale_bar(breaks = c(0, 100, 200), size = 1, position = c("left", "bottom"))
#9
p1 <- p + tm_shape(df_map_inner_join) 
#10
p1 <- p1 + tm_dots(size = 0.1, col = "mag", palette = "YlOrRd")
p1

We apply hierarchical clustering on California earthquakes dataset. Steps are:

We extract the coordinates from the simple feature data frame to determine a new dataframe
we give proper names to the columns of our coordinates dataframe
we determines and visualize the optimal number of clusters using the wss (Weighted Squared Sum) method by means of fviz_nbclust() function as available within the factoextra package

#1
cal_quakes_coordinates <- as.data.frame(st_coordinates((df_map_inner_join)))
#2
colnames(cal_quakes_coordinates) <- c("longitude", "latitude")
#3
fviz_nbclust(cal_quakes_coordinates, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2) + labs(subtitle = "Elbow method")

The optimal number of clusters for our dataset is indicated as equal to 4 by the above plot.

Let us give a brief overview of how this works, as outlined by ref. [4].

The elbow method looks at the total WSS as a function of the number of clusters. It chooses a number of clusters so that adding another cluster doesn't improve much better the total WSS.

The optimal number of clusters can be defined as follow:

compute clustering algorithm for different values of k
for each k, calculate the total within-cluster sum of square (wss)
plot the curve of wss according to the number of clusters k
the location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters

Other methods available are silhouette and gap_statistics, see ref. [4] and [5] for more details.

In the prosecution of the analysis,

we use computed optimal number of clusters (4) to run a kmeans based clustering
we create a new column cluster in our simple feature data frame
we update our plot object p (as introduced in our first plot) in order to show clusters; that is achieved by passing as argument the updated simple features data frame instance
we color the dots indicating where earthquakes occurred with the cluster identifier
we show the resulting plot

#1
kmeans_res <- kmeans(cal_quakes_coordinates, centers = 4)
#2
df_map_inner_join$cluster <- as.factor(kmeans_res$cluster)
#3
p1 <- p + tm_shape(df_map_inner_join) 
#4
p1 <- p1 + tm_dots(size = 0.1, col = "cluster", palette = "YlOrRd")
#5
p1

The same shown on a stamen sourced map. Here are the steps.

we add the longitude and latitude columns to our simple features dataframe
we take advantage of the qmplot() function within the tmap package passing our simple features dataframe without the geometry associated to; a darken factor is suggested for better highlight the dots

#1
df_map_inner_join$longitude <- cal_quakes_coordinates$longitude
df_map_inner_join$latitude <- cal_quakes_coordinates$latitude
#2
suppressMessages(qmplot(x = longitude, y = latitude, data = st_drop_geometry(df_map_inner_join), geom = "point", color = cluster, source = "stamen", zoom = 9, darken = 0.2) + scale_color_brewer(palette="YlOrRd"))

If you have any questions, please feel free to comment below.

References

Data VisualisationMaps