The present tutorial analyses the Tennis Grand Slam tournaments main results from the statistical point of view. Specifically, I try to answer the following questions:

– How to fit the distribution of the Grand Slam tournaments number of victories across players?
– How to compute the probability of having player’s victories greater than a specific number?
– How the number of Grand Slam tournaments winners increases along with time?
– How can we assign a metric to the tennis players based on the number of Grand Slam tournaments they won?

Our dataset is provided within ref. [1], whose content is based on what reported by the ESP site at: ESPN site tennis history table.

Packages

suppressPackageStartupMessages(library(fitdistrplus))
suppressPackageStartupMessages(library(extremevalues))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(ggplot2))
Copy

Analysis

The analysis within the present tutorial is organized as follows:

Basic Data Exploration

Loading the Tennis Grand Slam tournaments dataset and running a basic data exploration.

url_file <- "https://datascienceplus.com/wp-content/uploads/2017/12/tennis-grand-slam-winners_end2017.txt"
slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE)
dim(slam_win)
[1] 489   4
Copy
kable(head(slam_win), row.names=FALSE)
| YEAR|TOURNAMENT      |WINNER        |RUNNER.UP      |
|----:|:---------------|:-------------|:--------------|
| 2017|U.S. Open       |Rafael Nadal  |Kevin Anderson |
| 2017|Wimbledon       |Roger Federer |Marin Cilic    |
| 2017|French Open     |Rafael Nadal  |Stan Wavrinka  |
| 2017|Australian Open |Roger Federer |Rafael Nadal   |
| 2016|U.S. Open       |Stan Wawrinka |Novak Djokovic |
| 2016|Wimbledon       |Andy Murray   |Milos Raonic   |

Copy
kable(tail(slam_win), row.names=FALSE)
| YEAR|TOURNAMENT |WINNER           |RUNNER.UP          |
|----:|:----------|:----------------|:------------------|
| 1881|U.S. Open  |Richard D. Sears |William E. Glyn    |
| 1881|Wimbledon  |William Renshaw  |John Hartley       |
| 1880|Wimbledon  |John Hartley     |Herbert Lawford    |
| 1879|Wimbledon  |John Hartley     |V. St. Leger Gould |
| 1878|Wimbledon  |Frank Hadow      |Spencer Gore       |
| 1877|Wimbledon  |Spencer Gore     |William Marshall   |

Copy
nr <- nrow(slam_win)
start_year <- slam_win[nr, "YEAR"]
end_year <- slam_win[1, "YEAR"]
(years_span <- end_year - start_year + 1)
[1] 141

Copy
(total_slam_winners <- length(unique(slam_win[,"WINNER"])))
[1] 166

Copy

So we have 166 winners spanning over 141 years of Tennis Grand Slam tournaments. We observe that during first and second World Wars, a reduced number of Grand Slam tournaments were played for obvious reasons.

slam_win_df <- as.data.frame(table(slam_win[,"WINNER"]))
slam_win_df =  slam_win_df %>% arrange(desc(Freq))
# computing champions' leaderboard position
pos <- rep(0, nrow(slam_win_df))
pos[1] <- 1
for(i in 2:nrow(slam_win_df)) {
  pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1])
}
last_win_year = sapply(slam_win_df$Var1, function(x) {slam_win %>% filter(WINNER == x) %>% dplyr::select(YEAR) %>% max()})
# creating and showing leaderboard dataframe
slam_winners <- data.frame(RANK = pos,
                           PLAYER = slam_win_df$Var1,
                           WINS = slam_win_df$Freq,
                           LAST_WIN_YEAR = last_win_year)
kable(slam_winners)
| RANK|PLAYER                      | WINS| LAST_WIN_YEAR|
|----:|:---------------------------|----:|-------------:|
|    1|Roger Federer               |   19|          2017|
|    2|Rafael Nadal                |   16|          2017|
|    3|Pete Sampras                |   14|          2002|
|    4|Novak Djokovic              |   12|          2016|
|    4|Roy Emerson                 |   12|          1967|
...

Copy

The WINS and log(WINS) distribution density are shown.

par(mfrow=c(1,2))
plot(density(slam_winners$WINS), main = "Wins Density")
plot(density(log(slam_winners$WINS)), main = "Log Wins Density")
Copy

Gives this plot:

You may want to arrange the same dataframe ordering by the champions’ last win year.

par(mfrow=c(1,1))
slam_winners_last_win_year = slam_winners %>% arrange(LAST_WIN_YEAR)
kable(slam_winners %>% arrange(desc(LAST_WIN_YEAR)))
| RANK|PLAYER                      | WINS| LAST_WIN_YEAR|
|----:|:---------------------------|----:|-------------:|
|    1|Roger Federer               |   19|          2017|
|    2|Rafael Nadal                |   16|          2017|
|    4|Novak Djokovic              |   12|          2016|
|   43|Andy Murray                 |    3|          2016|
...

Copy

We may want to visualize the timeline of the number of Tennis Grand Slam champions.

df_nwin = data.frame()
for (year in start_year : end_year) {
  n_slam_winners = slam_win %>% filter(YEAR <= year) %>% dplyr::select(WINNER) %>% unique %>% nrow
  df_nwin = rbind(df_nwin, data.frame(YEAR = year, N_WINNERS = n_slam_winners))
}
plot(x = df_nwin$YEAR, y = df_nwin$N_WINNERS, type ='s', xlab = "year", ylab = "no_winners")
grid()
Copy

Gives this plot:

We may want to visualize the timeline of the Grand Slam tournaments wins record.

df2_nwin = data.frame()
for (year in start_year : end_year) {
  slam_win_years = slam_win %>% filter(YEAR <= year)
  slam_win_record = as.data.frame(table(slam_win_years[,"WINNER"]))
  df2_nwin = rbind(df2_nwin, data.frame(YEAR = year, RECORD_WINS = max(slam_win_record$Freq)))
}
plot(x = df2_nwin$YEAR, y = df2_nwin$RECORD_WINS, type ='s', xlab = "year", ylab = "record_wins")
grid()
Copy

Gives this plot:

It is interesting to have a look at the number of wins frequency.

wins_frequency <- as.data.frame(table(slam_winners[,"WINS"]))
colnames(wins_frequency) <- c("WINS", "FREQUENCY")
kable(wins_frequency)
|WINS | FREQUENCY|
|:----|---------:|
|1    |        80|
|2    |        25|
|3    |        19|
|4    |        12|
|5    |         5|
|6    |         3|
|7    |         6|
|8    |         8|
|10   |         1|
|11   |         2|
|12   |         2|
|14   |         1|
|16   |         1|
|19   |         1|

Copy
summary(slam_winners[,"WINS"])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.946   3.750  19.000 

Copy

Probabilistic Distribution Fit

We now take advantage of the fitdist() function within the fitdistr package to fit a lognormal distribution for our Grand Slam wins data.

fw <- fitdist(slam_winners$WINS, "lnorm")
summary(fw)
Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
         estimate Std. Error
meanlog 0.7047927 0.06257959
sdlog   0.8062817 0.04425015
Loglikelihood:  -316.7959   AIC:  637.5918   BIC:  643.8158 
Correlation matrix:
        meanlog sdlog
meanlog       1     0
sdlog         0     1

Copy

Then we can plot the distribution fit results.

plot(fw)
Copy

Gives this plot:

# left outliers quantile
left_thresh <- 0.05 
# right outliers quantile
right_thresh <- 0.95
# determining the outliers
slam_outlier <- getOutliersI(as.vector(slam_winners$WINS), 
                             FLim = c(left_thresh, right_thresh), 
                             distribution = "lognormal")
# outliers are plotted in red color
outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")
Copy

Gives this plot:

The outliers are:

slam_winners[slam_outlier$iRight,]
 RANK        PLAYER WINS LAST_WIN_YEAR
1    1 Roger Federer   19          2017
2    2  Rafael Nadal   16          2017

Copy

The mean and standard deviation associated to the log-normal fit are:

(mean_log <- fw$estimate["meanlog"])
meanlog 
0.7047927 

Copy
(sd_log <- fw$estimate["sdlog"])
sdlog 
0.8062817

Copy

Now we compute the probability associated with 19 and 16 wins.

# clearing names
names(mean_log) <- NULL
names(sd_log) <- NULL
# probability associated to 19 wins performance or more
(lnorm_19 <- plnorm(19, mean_log, sd_log, lower.tail=FALSE))
[1] 0.002736863

Copy
# probability associated to 16 wins performance or more
(lnorm_16 <- plnorm(16, mean_log, sd_log, lower.tail=FALSE))
[1] 0.005164628

Copy

However, if a random variable follows a log-normal distribution, its logarithm follows a normal distribution. Hence we fit the logarithm of the variable under analysis using a normal distribution and compare the results with above log-normal fit.

fw_norm <- fitdist(log(slam_winners$WINS), "norm")
summary(fw_norm)
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters : 
      estimate Std. Error
mean 0.7047927 0.06257959
sd   0.8062817 0.04425015
Loglikelihood:  -199.8003   AIC:  403.6006   BIC:  409.8246 
Correlation matrix:
     mean sd
mean    1  0
sd      0  1

Copy

Similarly, we plot the fit results.

plot(fw_norm)
Copy

Gives this plot:

# left outliers quantile
left_thresh <- 0.05 
# right outliers quantile
right_thresh <- 0.95
slam_outlier <- getOutliersI(log(as.vector(slam_winners$WINS)), 
                             FLim = c(left_thresh, right_thresh), 
                             distribution = "normal")
outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")
Copy

Gives this plot:

The outliers are:

slam_winners[slam_outlier$iRight,]
 RANK        PLAYER WINS LAST_WIN_YEAR
1    1 Roger Federer   19          2017
2    2  Rafael Nadal   16          2017

Copy

The mean and standard deviation values are:

# mean and standard deviation of the fitted lognormal distribution
(mean_norm <- fw_norm$estimate["mean"])
mean 
0.7047927

Copy
(sd_norm <- fw_norm$estimate["sd"])
 sd 
0.8062817 

Copy

As we can see above, same fitting parameters result from the two approaches, even though with different log-likelihood, AIC and BIC metrics. Now we compute the probability associated to 19 and 16 wins together with their distance from the mean in terms of multiples of the standard deviation.

# clearing names
names(mean_norm) <- NULL
names(sd_norm) <- NULL
# probability associated to the 19 wins performance or better
(norm_19 <- pnorm(log(19), mean_norm, sd_norm, lower.tail=FALSE))
[1] 0.002736863

Copy
# standard deviation times from the mean associated to 19 wins
(deviation_19 <- (log(19) - mean_norm)/sd_norm)

[1] 2.777747

Copy
# probability associated to the 16 wins performance or better
(norm_16 <- pnorm(log(16), mean_norm, sd_norm, lower.tail=FALSE))

[1] 0.005164628

Copy
# standard deviation times from the mean associated to 16 wins
(deviation_16 <- (log(16) - mean_norm)/sd_norm)

[1] 2.564608

Copy

As we can see above, we also obtained the same probability value as resulting from the log-normal distribution fit. In the following, we consider the second fitting approach (the one which takes the log of the original variable) for easing the computation of the distance from the mean in terms of multiples of the standard deviation.

Regression Analysis

Let us see again the plot of the number of tennis Grand Slam winners against their timeline.

year <- df_nwin $YEAR
winners <- df_nwin$N_WINNERS
plot(x = year, y = winners, type ='l')
grid()
Copy

Gives this plot:

It is visually evident the linear relationship between the variables. Hence, a linear regression would help in understanding how many newbie Grand Slam winners we may have each year.

year_lm <- lm(winners ~ year)
summary(year_lm)
Call:
lm(formula = winners ~ year)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8506 -1.9810 -0.4683  2.6102  6.2866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.388e+03  1.220e+01  -195.8   <2e-16 ***
year         1.270e+00  6.264e-03   202.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.027 on 139 degrees of freedom
Multiple R-squared:  0.9966,	Adjusted R-squared:  0.9966 
F-statistic: 4.113e+04 on 1 and 139 DF,  p-value: < 2.2e-16

Copy

Coefficients are reported as significant and the R-squared value is very high. On average each year, 1.27 Grand Slam tournaments newbie winners show up. Residuals analysis has not been reported for brevity. Similarly, we can regress the year against the number of winners.

n_win_lm <- lm(year ~ winners)
summary(n_win_lm)
Call:
lm(formula = year ~ winners)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8851 -1.9461  0.3268  1.4327  7.9641 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.880e+03  3.848e-01  4886.4   <2e-16 ***
winners     7.846e-01  3.868e-03   202.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.379 on 139 degrees of freedom
Multiple R-squared:  0.9966,	Adjusted R-squared:  0.9966 
F-statistic: 4.113e+04 on 1 and 139 DF,  p-value: < 2.2e-16

Copy

Coefficients are reported as significant and the R-squared value is very high. On average, each new Grand Slam tournaments winner appears every 0.7846 fraction of year. Residuals analysis has not been reported for brevity. Such model can be used to predict the year when a given number of Grand Slam winners may show up. For example, considering Federer, Nadal and Sampras wins, we obtain:

# probability associated to the 14 wins performance or better
(norm_14 <- pnorm(log(14), mean_norm, sd_norm, lower.tail=FALSE))
[1] 0.008220098

Copy
# standard deviation times from the mean associated to 14 wins
(deviation_14 <- (log(14) - mean_norm)/sd_norm)
[1] 2.398994

Copy
# average number of Grand Slam winners to expect for a 19 Grand Slam wins champion
(x_19 <- round(1/norm_19))
[1] 365

Copy
# average number of Grand Slam winners to expect for a 16 Grand Slam wins champion
(x_16 <- round(1/norm_16))
[1] 194

Copy
# average number of Grand Slam winners to expect for a 14 Grand Slam wins champion
(x_14 <- round(1/norm_14))
[1] 122

Copy

The x_19, x_16 and x_14 values can be interpreted as the average size of Grand Slam tournaments winners population to therein find a 19, 16, 14 times winner respectively. As a consequence, the prediction of the calendar years to see players capable to win 19, 16, 14 times is:

predict(n_win_lm, newdata = data.frame(winners = c(x_19, x_16, x_14)), interval = "p")
      fit      lwr      upr
1 2166.732 2161.549 2171.916
2 2032.573 2027.779 2037.366
3 1976.084 1971.355 1980.813

Copy

The table above shows the earliest year when, on average, to expect a Grand Slam tournament winner capable to win 19, 16, 14 times (fit column), together with lower (lwr) and upper (upr) bound predicted values. In the real world, 14 wins champion showed up a little bit later than expected by our linear regression model, whilst 16 and 19 win champions did much earlier than expected by the same model.

Population Statistical Dispersion Analysis

In our previous analysis, we computed the distance from the mean for 19, 16 and 14 Grand Slam tournaments win probabilities, distance expressed in terms of multiples of the standard deviation.

deviation_19
[1] 2.777747

Copy
deviation_16
[1] 2.564608

Copy
deviation_14
[1] 2.398994

Copy

Based on above values, we can compute the probability to have a 19, 16 and 14 times winner. As we saw before, we resume up such result using the pnorm() function.

(prob_19 <- pnorm(mean_norm+deviation_19*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))
[1] 0.002736863

Copy
(prob_16 <- pnorm(mean_norm+deviation_16*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))
[1] 0.005164628

Copy
(prob_14 <- pnorm(mean_norm+deviation_14*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))
[1] 0.008220098

Copy

Similarly to the Intellectual Quotient (IQ) assigning a value equal to 100 at the mean and +/- 15 points for each standard deviation of distance from the mean itself, we can figure out a Tennis Quotient (TQ).

Here in below, we show a plot to remember how the IQ is computed:

We notice that the median of our player’s population scores a TQ equal to 100.

(median_value <- median(slam_winners$WINS))
[1] 2

Copy
(deviation_median <- (log(median_value) - mean_norm)/sd_norm)
[1] -0.01444343

Copy
round(100 + 15*deviation_median)
[1] 100

Copy

We now compute the Tennis Quotients (TQ) for leading champions.

(Federer_TQ <- round(100 + 15*deviation_19))
[1] 142

Copy
(Nadal_TQ <- round(100 + 15*deviation_16))
[1] 138

Copy
(Sampras_TQ <- round(100 + 15*deviation_14))
[1] 136

Copy

And what about for example 7 times Grand Slam tournament winner?

(deviation_7 <- (log(7) - mean_norm)/sd_norm)
[1] 1.53931

Copy
TQ_7wins <- round(100 + 15*deviation_7)
[1] 123

Copy

Let us then compute the Tennis Quotients (TQ) for all our tennis Grand Slam tournaments winners.

tq_compute <- function(x) {
  deviation_x <- (log(x) - mean_norm)/sd_norm
  round(100 + 15*deviation_x)
}
slam_winners = slam_winners %>% mutate(TQ = tq_compute(WINS))
kable(slam_winners)
| RANK|PLAYER                      | WINS| LAST_WIN_YEAR|  TQ|
|----:|:---------------------------|----:|-------------:|---:|
|    1|Roger Federer               |   19|          2017| 142|
|    2|Rafael Nadal                |   16|          2017| 138|
|    3|Pete Sampras                |   14|          2002| 136|
|    4|Novak Djokovic              |   12|          2016| 133|
....

Copy

We then visualize the top twenty Grand Slam tournaments champions.

ggplot(data=slam_winners[1:20,], aes(x=reorder(PLAYER, TQ), y=TQ, fill = TQ)) + 
  geom_bar(stat ="identity") + coord_flip() + 
  scale_y_continuous(breaks = seq(0,150,10), limits = c(0,150)) +
  xlab("PLAYER")
Copy

Gives this plot.

Conclusions

The answers to our initial questions:

– The log-normal distribution
– Based upon the fitted distribution and taking advantage of plnorm() or pnorm() stats package functions, probabilities have been computed
– A linear increase is a very good fit for that, resulting in significative regression coefficients and high R-squared values
– Yes, we defined the Tennis Quotient similarly to the Intellectual Quotient and show the resulting leaderboard. Federer is confirmed “genius” in that respect, however, a few other very talented players are not that far from him.

If you have any questions, please feel free to comment below.

References