An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

Basic Statistics

- Published on December 11, 2017 at 9:00 am

- 650 Views
- Shares
- 0 Comments

The present tutorial analyses the Tennis Grand Slam tournaments main results from the statistical point of view. Specifically, I try to answer the following questions:

– How to fit the distribution of the Grand Slam tournaments number of victories across players?

– How to compute the probability of having player’s victories greater than a specific number?

– How the number of Grand Slam tournaments winners increases along with time?

– How can we assign a metric to the tennis players based on the number of Grand Slam tournaments they won?

Our dataset is provided within ref. [1], whose content is based on what reported by the ESP site at: ESPN site tennis history table.

suppressPackageStartupMessages(library(fitdistrplus)) suppressPackageStartupMessages(library(extremevalues)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(ggplot2))

The analysis within the present tutorial is organized as follows:

- a basic data exploration is run
- log-normal distribution is fit against the Grand Slam tournaments victories data
- regression analysis is presented in order to evaluate the increase of tennis Grand Slam champions along with time
- population statistical dispersion is evaluated to determine a Tennis Quotient assigned to tournaments’ winners

Loading the Tennis Grand Slam tournaments dataset and running a basic data exploration.

url_file <- "https://datascienceplus.com/wp-content/uploads/2017/12/tennis-grand-slam-winners_end2017.txt" slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE) dim(slam_win)[1] 489 4

kable(head(slam_win), row.names=FALSE)| YEAR|TOURNAMENT |WINNER |RUNNER.UP | |----:|:---------------|:-------------|:--------------| | 2017|U.S. Open |Rafael Nadal |Kevin Anderson | | 2017|Wimbledon |Roger Federer |Marin Cilic | | 2017|French Open |Rafael Nadal |Stan Wavrinka | | 2017|Australian Open |Roger Federer |Rafael Nadal | | 2016|U.S. Open |Stan Wawrinka |Novak Djokovic | | 2016|Wimbledon |Andy Murray |Milos Raonic |

kable(tail(slam_win), row.names=FALSE)| YEAR|TOURNAMENT |WINNER |RUNNER.UP | |----:|:----------|:----------------|:------------------| | 1881|U.S. Open |Richard D. Sears |William E. Glyn | | 1881|Wimbledon |William Renshaw |John Hartley | | 1880|Wimbledon |John Hartley |Herbert Lawford | | 1879|Wimbledon |John Hartley |V. St. Leger Gould | | 1878|Wimbledon |Frank Hadow |Spencer Gore | | 1877|Wimbledon |Spencer Gore |William Marshall |

nr <- nrow(slam_win) start_year <- slam_win[nr, "YEAR"] end_year <- slam_win[1, "YEAR"] (years_span <- end_year - start_year + 1)[1] 141

(total_slam_winners <- length(unique(slam_win[,"WINNER"])))[1] 166

So we have 166 winners spanning over 141 years of Tennis Grand Slam tournaments. We observe that during first and second World Wars, a reduced number of Grand Slam tournaments were played for obvious reasons.

slam_win_df <- as.data.frame(table(slam_win[,"WINNER"])) slam_win_df = slam_win_df %>% arrange(desc(Freq)) # computing champions' leaderboard position pos <- rep(0, nrow(slam_win_df)) pos[1] <- 1 for(i in 2:nrow(slam_win_df)) { pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1]) } last_win_year = sapply(slam_win_df$Var1, function(x) {slam_win %>% filter(WINNER == x) %>% dplyr::select(YEAR) %>% max()}) # creating and showing leaderboard dataframe slam_winners <- data.frame(RANK = pos, PLAYER = slam_win_df$Var1, WINS = slam_win_df$Freq, LAST_WIN_YEAR = last_win_year) kable(slam_winners)| RANK|PLAYER | WINS| LAST_WIN_YEAR| |----:|:---------------------------|----:|-------------:| | 1|Roger Federer | 19| 2017| | 2|Rafael Nadal | 16| 2017| | 3|Pete Sampras | 14| 2002| | 4|Novak Djokovic | 12| 2016| | 4|Roy Emerson | 12| 1967| ...

The WINS and log(WINS) distribution density are shown.

par(mfrow=c(1,2)) plot(density(slam_winners$WINS), main = "Wins Density") plot(density(log(slam_winners$WINS)), main = "Log Wins Density")

You may want to arrange the same dataframe ordering by the champions’ last win year.

par(mfrow=c(1,1)) slam_winners_last_win_year = slam_winners %>% arrange(LAST_WIN_YEAR) kable(slam_winners %>% arrange(desc(LAST_WIN_YEAR)))| RANK|PLAYER | WINS| LAST_WIN_YEAR| |----:|:---------------------------|----:|-------------:| | 1|Roger Federer | 19| 2017| | 2|Rafael Nadal | 16| 2017| | 4|Novak Djokovic | 12| 2016| | 43|Andy Murray | 3| 2016| ...

We may want to visualize the timeline of the number of Tennis Grand Slam champions.

df_nwin = data.frame() for (year in start_year : end_year) { n_slam_winners = slam_win %>% filter(YEAR <= year) %>% dplyr::select(WINNER) %>% unique %>% nrow df_nwin = rbind(df_nwin, data.frame(YEAR = year, N_WINNERS = n_slam_winners)) } plot(x = df_nwin$YEAR, y = df_nwin$N_WINNERS, type ='s', xlab = "year", ylab = "no_winners") grid()

We may want to visualize the timeline of the Grand Slam tournaments wins record.

df2_nwin = data.frame() for (year in start_year : end_year) { slam_win_years = slam_win %>% filter(YEAR <= year) slam_win_record = as.data.frame(table(slam_win_years[,"WINNER"])) df2_nwin = rbind(df2_nwin, data.frame(YEAR = year, RECORD_WINS = max(slam_win_record$Freq))) } plot(x = df2_nwin$YEAR, y = df2_nwin$RECORD_WINS, type ='s', xlab = "year", ylab = "record_wins") grid()

It is interesting to have a look at the number of wins frequency.

wins_frequency <- as.data.frame(table(slam_winners[,"WINS"])) colnames(wins_frequency) <- c("WINS", "FREQUENCY") kable(wins_frequency)|WINS | FREQUENCY| |:----|---------:| |1 | 80| |2 | 25| |3 | 19| |4 | 12| |5 | 5| |6 | 3| |7 | 6| |8 | 8| |10 | 1| |11 | 2| |12 | 2| |14 | 1| |16 | 1| |19 | 1|

summary(slam_winners[,"WINS"])Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.946 3.750 19.000

We now take advantage of the `fitdist()`

function within the `fitdistr`

package to fit a lognormal distribution for our Grand Slam wins data.

fw <- fitdist(slam_winners$WINS, "lnorm") summary(fw)Fitting of the distribution ' lnorm ' by maximum likelihood Parameters : estimate Std. Error meanlog 0.7047927 0.06257959 sdlog 0.8062817 0.04425015 Loglikelihood: -316.7959 AIC: 637.5918 BIC: 643.8158 Correlation matrix: meanlog sdlog meanlog 1 0 sdlog 0 1

Then we can plot the distribution fit results.

plot(fw)

# left outliers quantile left_thresh <- 0.05 # right outliers quantile right_thresh <- 0.95 # determining the outliers slam_outlier <- getOutliersI(as.vector(slam_winners$WINS), FLim = c(left_thresh, right_thresh), distribution = "lognormal") # outliers are plotted in red color outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

The outliers are:

slam_winners[slam_outlier$iRight,]RANK PLAYER WINS LAST_WIN_YEAR 1 1 Roger Federer 19 2017 2 2 Rafael Nadal 16 2017

The mean and standard deviation associated to the log-normal fit are:

(mean_log <- fw$estimate["meanlog"])meanlog 0.7047927

(sd_log <- fw$estimate["sdlog"])sdlog 0.8062817

Now we compute the probability associated with 19 and 16 wins.

# clearing names names(mean_log) <- NULL names(sd_log) <- NULL # probability associated to 19 wins performance or more (lnorm_19 <- plnorm(19, mean_log, sd_log, lower.tail=FALSE))[1] 0.002736863

# probability associated to 16 wins performance or more (lnorm_16 <- plnorm(16, mean_log, sd_log, lower.tail=FALSE))[1] 0.005164628

However, if a random variable follows a log-normal distribution, its logarithm follows a normal distribution. Hence we fit the logarithm of the variable under analysis using a normal distribution and compare the results with above log-normal fit.

fw_norm <- fitdist(log(slam_winners$WINS), "norm") summary(fw_norm)Fitting of the distribution ' norm ' by maximum likelihood Parameters : estimate Std. Error mean 0.7047927 0.06257959 sd 0.8062817 0.04425015 Loglikelihood: -199.8003 AIC: 403.6006 BIC: 409.8246 Correlation matrix: mean sd mean 1 0 sd 0 1

Similarly, we plot the fit results.

plot(fw_norm)

# left outliers quantile left_thresh <- 0.05 # right outliers quantile right_thresh <- 0.95 slam_outlier <- getOutliersI(log(as.vector(slam_winners$WINS)), FLim = c(left_thresh, right_thresh), distribution = "normal") outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

The outliers are:

slam_winners[slam_outlier$iRight,]RANK PLAYER WINS LAST_WIN_YEAR 1 1 Roger Federer 19 2017 2 2 Rafael Nadal 16 2017

The mean and standard deviation values are:

# mean and standard deviation of the fitted lognormal distribution (mean_norm <- fw_norm$estimate["mean"])mean 0.7047927

(sd_norm <- fw_norm$estimate["sd"])sd 0.8062817

As we can see above, same fitting parameters result from the two approaches, even though with different log-likelihood, AIC and BIC metrics. Now we compute the probability associated to 19 and 16 wins together with their distance from the mean in terms of multiples of the standard deviation.

# clearing names names(mean_norm) <- NULL names(sd_norm) <- NULL # probability associated to the 19 wins performance or better (norm_19 <- pnorm(log(19), mean_norm, sd_norm, lower.tail=FALSE))[1] 0.002736863

# standard deviation times from the mean associated to 19 wins (deviation_19 <- (log(19) - mean_norm)/sd_norm)[1] 2.777747

# probability associated to the 16 wins performance or better (norm_16 <- pnorm(log(16), mean_norm, sd_norm, lower.tail=FALSE))[1] 0.005164628

# standard deviation times from the mean associated to 16 wins (deviation_16 <- (log(16) - mean_norm)/sd_norm)[1] 2.564608

As we can see above, we also obtained the same probability value as resulting from the log-normal distribution fit. In the following, we consider the second fitting approach (the one which takes the log of the original variable) for easing the computation of the distance from the mean in terms of multiples of the standard deviation.

Let us see again the plot of the number of tennis Grand Slam winners against their timeline.

year <- df_nwin $YEAR winners <- df_nwin$N_WINNERS plot(x = year, y = winners, type ='l') grid()

It is visually evident the linear relationship between the variables. Hence, a linear regression would help in understanding how many newbie Grand Slam winners we may have each year.

year_lm <- lm(winners ~ year) summary(year_lm)Call: lm(formula = winners ~ year) Residuals: Min 1Q Median 3Q Max -9.8506 -1.9810 -0.4683 2.6102 6.2866 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.388e+03 1.220e+01 -195.8 <2e-16 *** year 1.270e+00 6.264e-03 202.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.027 on 139 degrees of freedom Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966 F-statistic: 4.113e+04 on 1 and 139 DF, p-value: < 2.2e-16

Coefficients are reported as significant and the R-squared value is very high. On average each year, 1.27 Grand Slam tournaments newbie winners show up. Residuals analysis has not been reported for brevity. Similarly, we can regress the year against the number of winners.

n_win_lm <- lm(year ~ winners) summary(n_win_lm)Call: lm(formula = year ~ winners) Residuals: Min 1Q Median 3Q Max -4.8851 -1.9461 0.3268 1.4327 7.9641 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.880e+03 3.848e-01 4886.4 <2e-16 *** winners 7.846e-01 3.868e-03 202.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.379 on 139 degrees of freedom Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966 F-statistic: 4.113e+04 on 1 and 139 DF, p-value: < 2.2e-16

Coefficients are reported as significant and the R-squared value is very high. On average, each new Grand Slam tournaments winner appears every 0.7846 fraction of year. Residuals analysis has not been reported for brevity. Such model can be used to predict the year when a given number of Grand Slam winners may show up. For example, considering Federer, Nadal and Sampras wins, we obtain:

# probability associated to the 14 wins performance or better (norm_14 <- pnorm(log(14), mean_norm, sd_norm, lower.tail=FALSE))[1] 0.008220098

# standard deviation times from the mean associated to 14 wins (deviation_14 <- (log(14) - mean_norm)/sd_norm)[1] 2.398994

# average number of Grand Slam winners to expect for a 19 Grand Slam wins champion (x_19 <- round(1/norm_19))[1] 365

# average number of Grand Slam winners to expect for a 16 Grand Slam wins champion (x_16 <- round(1/norm_16))[1] 194

# average number of Grand Slam winners to expect for a 14 Grand Slam wins champion (x_14 <- round(1/norm_14))[1] 122

The x_19, x_16 and x_14 values can be interpreted as the average size of Grand Slam tournaments winners population to therein find a 19, 16, 14 times winner respectively. As a consequence, the prediction of the calendar years to see players capable to win 19, 16, 14 times is:

predict(n_win_lm, newdata = data.frame(winners = c(x_19, x_16, x_14)), interval = "p")fit lwr upr 1 2166.732 2161.549 2171.916 2 2032.573 2027.779 2037.366 3 1976.084 1971.355 1980.813

The table above shows the earliest year when, on average, to expect a Grand Slam tournament winner capable to win 19, 16, 14 times (fit column), together with lower (lwr) and upper (upr) bound predicted values. In the real world, 14 wins champion showed up a little bit later than expected by our linear regression model, whilst 16 and 19 win champions did much earlier than expected by the same model.

In our previous analysis, we computed the distance from the mean for 19, 16 and 14 Grand Slam tournaments win probabilities, distance expressed in terms of multiples of the standard deviation.

deviation_19[1] 2.777747

deviation_16[1] 2.564608

deviation_14[1] 2.398994

Based on above values, we can compute the probability to have a 19, 16 and 14 times winner. As we saw before, we resume up such result using the `pnorm()`

function.

(prob_19 <- pnorm(mean_norm+deviation_19*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))[1] 0.002736863

(prob_16 <- pnorm(mean_norm+deviation_16*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))[1] 0.005164628

(prob_14 <- pnorm(mean_norm+deviation_14*sd_norm, mean_norm, sd_norm, lower.tail = FALSE))[1] 0.008220098

Similarly to the Intellectual Quotient (IQ) assigning a value equal to 100 at the mean and +/- 15 points for each standard deviation of distance from the mean itself, we can figure out a Tennis Quotient (TQ).

Here in below, we show a plot to remember how the IQ is computed:

We notice that the median of our player’s population scores a TQ equal to 100.

(median_value <- median(slam_winners$WINS))[1] 2

(deviation_median <- (log(median_value) - mean_norm)/sd_norm)[1] -0.01444343

round(100 + 15*deviation_median)[1] 100

We now compute the Tennis Quotients (TQ) for leading champions.

(Federer_TQ <- round(100 + 15*deviation_19))[1] 142

(Nadal_TQ <- round(100 + 15*deviation_16))[1] 138

(Sampras_TQ <- round(100 + 15*deviation_14))[1] 136

And what about for example 7 times Grand Slam tournament winner?

(deviation_7 <- (log(7) - mean_norm)/sd_norm)[1] 1.53931

TQ_7wins <- round(100 + 15*deviation_7)[1] 123

Let us then compute the Tennis Quotients (TQ) for all our tennis Grand Slam tournaments winners.

tq_compute <- function(x) { deviation_x <- (log(x) - mean_norm)/sd_norm round(100 + 15*deviation_x) } slam_winners = slam_winners %>% mutate(TQ = tq_compute(WINS)) kable(slam_winners)| RANK|PLAYER | WINS| LAST_WIN_YEAR| TQ| |----:|:---------------------------|----:|-------------:|---:| | 1|Roger Federer | 19| 2017| 142| | 2|Rafael Nadal | 16| 2017| 138| | 3|Pete Sampras | 14| 2002| 136| | 4|Novak Djokovic | 12| 2016| 133| ....

We then visualize the top twenty Grand Slam tournaments champions.

ggplot(data=slam_winners[1:20,], aes(x=reorder(PLAYER, TQ), y=TQ, fill = TQ)) + geom_bar(stat ="identity") + coord_flip() + scale_y_continuous(breaks = seq(0,150,10), limits = c(0,150)) + xlab("PLAYER")

The answers to our initial questions:

– The log-normal distribution

– Based upon the fitted distribution and taking advantage of`plnorm()`

or`pnorm()`

stats package functions, probabilities have been computed

– A linear increase is a very good fit for that, resulting in significative regression coefficients and high R-squared values

– Yes, we defined the Tennis Quotient similarly to the Intellectual Quotient and show the resulting leaderboard. Federer is confirmed “genius” in that respect, however, a few other very talented players are not that far from him.

If you have any questions, please feel free to comment below.

**References**