We publish R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!
Visualizing Data

# Building Wordclouds in R

In this article, I will show you how to use text data to build word clouds in R. We will use a dataset containing around 200k Jeopardy questions. The dataset can be downloaded here (thanks to reddit user trexmatt for providing the dataset).

We will require three packages for this: tm, SnowballC, and wordcloud.

library(tm)
library(SnowballC)
library(wordcloud)

jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)


The actual questions are available in the Question column.

Now, we will perform a series of operations on the text data to simplify it.
First, we need to create a corpus.

jeopCorpus <- Corpus(VectorSource(jeopQ\$Question))


Next, we will convert the corpus to a lowercase.

jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))


Then, we will remove all punctuation and stopwords, and convert it to a plain text document.. Stopwords are commonly used words in the English language such as I, me, my, etc. You can see the full list of stopwords using stopwords('english').

jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))


Next, we will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.). This will ensure that different forms of the word are converted to the same form and plotted only once in the wordcloud.

jeopCorpus <- tm_map(jeopCorpus, stemDocument)


Now, we will plot the wordcloud.

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)


This will produce the following wordcloud:

There are a few ways to customize it.

• scale: This is used to indicate the range of sizes of the words.
• max.words and min.freq: These parameters are used to limit the number of words plotted. max.words will plot the specified number of words and discard least frequent terms, whereas, min.freq will discard all terms whose frequency is below the specified value.
• random.order: By setting this to FALSE, we make it so that the words with the highest frequency are plotted first. If we don’t set this, it will plot the words in a random order, and the highest frequency words may not necessarily appear in the center.
• rot.per: This value determines the fraction of words that are plotted vertically.
• colors: The default value is black. If you want to use different colors based on frequency, you can specify a vector of colors, or use one of the pre-defined color palettes. You can find a list here.