We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

Programming

Machine learning makes sentiment analysis more convenient. This post would introduce how to do sentiment analysis with machine learning using R. In the landscape of R, the **sentiment** R package and the more general text mining package have been well developed by Timothy P. Jurka. You can check out the sentiment package and the fantastic RTextTools package. Actually, Timothy also writes an maxent package for low-memory multinomial logistic regression (also known as maximum entropy).

However, the naive bayes method is not included into RTextTools. The e1071 package did a good job of implementing the naive bayes method. e1071 is a course of the Department of Statistics (e1071), TU Wien. Its primary developer is David Meyer.

It is still necessary to learn more about **text analysis**. Text analysis in R has been well recognized (see the R views on natural language processing). Part of the success belongs to the tm package: A framework for text mining applications within R. It did a good job for text cleaning (stemming, delete the stopwords, etc) and transforming texts to document-term matrix (dtm). There is one paper about it. As you know the most important part of text analysis is to get the feature vectors for each document. The word feature is the most important one. Of course, you can also extend the **unigram** word features to **bigram and trigram**, and so on to **n-grams**. However, here for our simple case, we stick to the unigram word features.

Note that it’s easy to use `ngrams`

in R. In the past, the package of `Rweka`

supplies functions to do it, check this example. Now, you can set the `ngramLength`

in the function of create_matrix using `RTextTools`

.

The first step is to read data:

library(RTextTools) library(e1071) pos_tweets = rbind( c('I love this car', 'positive'), c('This view is amazing', 'positive'), c('I feel great this morning', 'positive'), c('I am so excited about the concert', 'positive'), c('He is my best friend', 'positive') ) neg_tweets = rbind( c('I do not like this car', 'negative'), c('This view is horrible', 'negative'), c('I feel tired this morning', 'negative'), c('I am not looking forward to the concert', 'negative'), c('He is my enemy', 'negative') ) test_tweets = rbind( c('feel happy this morning', 'positive'), c('larry friend', 'positive'), c('not like that man', 'negative'), c('house not great', 'negative'), c('your song annoying', 'negative') ) tweets = rbind(pos_tweets, neg_tweets, test_tweets)

Then we can build the document-term matrix:

# build dtmmatrix= create_matrix(tweets[,1], language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=FALSE)

Now, we can train the naive Bayes model with the training set. Note that, e1071 asks the response variable to be numeric or factor. Thus, we convert characters to factors here. This is a little trick.

# train the modelmat = as.matrix(matrix) classifier = naiveBayes(mat[1:10,], as.factor(tweets[1:10,2]) )

Now we can step further to test the accuracy.

# test the validitypredicted = predict(classifier, mat[11:15,]); predicted table(tweets[11:15, 2], predicted) recall_accuracy(tweets[11:15, 2], predicted)

Apparently, the result is the same with Python (compare it with the results in an another post).

As I mentioned, we can do it using `RTextTools`

. Let’s rock!

First, to specify our data:

# build the data to specify response variable, training set, testing set.container = create_container(matrix, as.numeric(as.factor(tweets[,2])), trainSize=1:10, testSize=11:15,virgin=FALSE)

Second, to train the model with multiple machine learning algorithms:

models = train_models(container, algorithms=c("MAXENT" , "SVM", "RF", "BAGGING", "TREE"))

Now, we can classify the testing set using the trained models.

results = classify_models(container, models)

How about the accuracy?

# accuracy tabletable(as.numeric(as.factor(tweets[11:15, 2])), results[,"FORESTS_LABEL"]) table(as.numeric(as.factor(tweets[11:15, 2])), results[,"MAXENTROPY_LABEL"])# recall accuracyrecall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"FORESTS_LABEL"]) recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"MAXENTROPY_LABEL"]) recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"TREE_LABEL"]) recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"BAGGING_LABEL"]) recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"SVM_LABEL"])

To summarize the results (especially the validity) in a formal way:

# model summaryanalytics = create_analytics(container, results) summary(analytics) head(analytics@document_summary) analytics@ensemble_summar

To cross validate the results:

N=4 set.seed(2014) cross_validate(container,N,"MAXENT") cross_validate(container,N,"TREE") cross_validate(container,N,"SVM") cross_validate(container,N,"RF")

The results can be found on my Rpub page. It seems that maxent reached the same recall accuracy as naive Bayes. The other methods even did a worse job. This is understandable, since we have only a very small data set. To enlarge the training set, we can get a much better results for sentiment analysis of tweets using more sophisticated methods. I will show the results with anther example.

The data comes from victorneo. victorneo shows how to do sentiment analysis for tweets using Python. Here, I will demonstrate how to do it in R.

Read data:

################### "load data" ###################setwd("D:/Twitter-Sentimental-Analysis-master/") happy = readLines("./happy.txt") sad = readLines("./sad.txt") happy_test = readLines("./happy_test.txt") sad_test = readLines("./sad_test.txt") tweet = c(happy, sad) tweet_test= c(happy_test, sad_test) tweet_all = c(tweet, tweet_test) sentiment = c(rep("happy", length(happy) ), rep("sad", length(sad))) sentiment_test = c(rep("happy", length(happy_test) ), rep("sad", length(sad_test))) sentiment_all = as.factor(c(sentiment, sentiment_test)) library(RTextTools)

First, try naive Bayes.

# naive bayesmat= create_matrix(tweet_all, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=FALSE, tm::weightTfIdf) mat = as.matrix(mat) classifier = naiveBayes(mat[1:160,], as.factor(sentiment_all[1:160])) predicted = predict(classifier, mat[161:180,]); predicted table(sentiment_test, predicted) recall_accuracy(sentiment_test, predicted)

Then, try the other methods:

# the other methodsmat= create_matrix(tweet_all, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=FALSE, tm::weightTfIdf) container = create_container(mat, as.numeric(sentiment_all), trainSize=1:160, testSize=161:180,virgin=FALSE) #可以设置removeSparseTerms models = train_models(container, algorithms=c("MAXENT", "SVM", #"GLMNET", "BOOSTING", "SLDA","BAGGING", "RF", # "NNET", "TREE" ))# test the modelresults = classify_models(container, models) table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"]) recall_accuracy(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])

Here we also want to get the formal test results, including:

**analytics@algorithm_summary**: Summary of precision, recall, f-scores, and accuracy sorted by topic code for each algorithm**analytics@label_summary**: Summary of label (e.g. Topic) accuracy**analytics@document_summary**: Raw summary of all data and scoring**analytics@ensemble_summary**: Summary of ensemble precision/coverage. Uses the n variable passed into create_analytics()

Now let’s see the results:

# formal testsanalytics = create_analytics(container, results) summary(analytics) head(analytics@algorithm_summary) head(analytics@label_summary) head(analytics@document_summary) analytics@ensemble_summary # Ensemble Agreement# Cross ValidationN=3 cross_SVM = cross_validate(container,N,"SVM") cross_GLMNET = cross_validate(container,N,"GLMNET") cross_MAXENT = cross_validate(container,N,"MAXENT")

You can find that compared with naive Bayes, the other algorithms did a much better job to achieve a recall accuracy higher than 0.95. Check the results on Rpub.

If you have any question, feel free to post a comment below.