DataScience+ An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

How to Search PubMed with RISmed package in R

In the last tutorial, we developed a simple shiny App to provide a tool to collect and analyze PubMed data. There was some interest in learning more about RISmed itself, so I’ll back up a little and present some of the core RISmed package.

About PubMed and RISmed

PubMed is a public query database of published journal articles and other literature maintained and made available by the National Institutes of Health. The RISmed packages extracts content from the Entrez Programming Utilities (E-Utilities) interface to the PubMed query and database at NCBI into R. You will find great tutorials to get RISmed up and running by checking out this introduction or this very nice post by some dude named Dave Tang. For raw code, you can find Stephanie Kovalchik’s terrific RISmed github page here.

Introducing EUtilsSummary

RISmed allows us to call the EUtilsSummary() function to produce an object for us to analyze. EUtilsSummary() is overloaded with many arguments. First, we send a character string in quotes as the word/phrase/acronymn/author for which we wish to search PubMed. Here we will stick with type=esearch.

You can click the links above to learn more about arguments to EUtilsSummary(), here we are choosing PubMed, specifying a date range, and stopping after 500 publications are found.

res <- EUtilsSummary("pinkeye", type="esearch", db="pubmed", datetype='pdat', mindate=2000, maxdate=2015, retmax=500)

Now we have an object called res of class “Medline”. We can call QueryCount() to see how many publications we found.


There are 23 publications in PubMed between 2000 and 2015 that contain the word “pinkeye”.

Introducing EUtilsGet

The Medline class collects pertinent information about the publications in our search, and stores the information conveniently for analysis. We can access our data by calling EUtilsGet(). Here we create a new object t that is a character vector containing all the publication titles by using the EUtilsGet() function inside ArticleTitle(). We can access them like any other character vector.

"Infectious bovine keratoconjunctivitis (pinkeye)."
"Moraxella spp. isolated from field outbreaks of infectious bovine keratoconjunctivitis: a retrospective study of case submissions from 2010 to 2013."

res contains a lot of data relating to dates. For instance, we can access the year a publication was received, or the year it was accepted into the PubMed database.

y <- YearPubmed(EUtilsGet(res))
r <- YearReceived(EUtilsGet(res))
2015 2014 2013 2012 2011 2011 2011 2010 2010 2010 2010 2008 2008 2008 2008 2007 2007 2007 2007 2006 2005 2004 2002
NA   NA 2012 2012 2011 2011   NA 2010   NA   NA   NA   NA 2008 2007   NA 2007 2007 2007   NA   NA 2004   NA   NA

Be mindful of your data! If you make a plot of the number of documents received for publication in a given year, you can easily be missing many publications that simply don’t have that info recorded (this happens quite often). I suggest using YearPubmed, because it seems to have very few (if any) NA’s.

It’s a little tough to see a pattern in the YearPubmed results above, here is code to make a barplot.

names(count)<-c("Year", "Counts")
num <- data.frame(Year=count$Year, Counts=cumsum(count$Counts))
num$g <- "g"
names(num) <- c("Year", "Counts", "g")
q <- qplot(x=Year, y=Counts, data=count, geom="bar", stat="identity")
q <- q + ggtitle(paste("PubMed articles containing \'", g,            "\' ", "= ", max(num$Counts), sep="")) +
     ylab("Number of articles") +
     xlab(paste("Year \n Query date: ", Sys.time(), sep="")) +
     labs(colour="") +

Here is the plot we made.

The interesting thing about this (slightly contrived) plot is that there is a sudden increase in publications in 2007 that lasts for a few years. This is the sort of thing we might want to investigate. We might take a look at term frequencies to see if there were any differences between 2006 and 2007.

Introducing text mining

Here is a function that takes a year as an argument and returns the twenty most frequent words in the abstracts for that year, after some manipulations.

articles1<-data.frame('Abstract'=AbstractText(fetch), 'Year'=YearPubmed(fetch))
abstractsOnly<-paste(abstractsOnly, sep="", collapse="")
stsp<-rm_stopwords(abstractsOnly, stopwords = qdapDictionaries::Top100Words)
ord<-ord[order(ord$Freq, decreasing=TRUE),]

Let’s run our function twice and compare.

all<-cbind(oSix, oSeven)

                   2006 freq          2007 freq
10               bovine    6         bovis   12
38           infectious    6        calves   11
42 keratoconjunctivitis    6             m    8
60                  qtl    4           ibk    7
14          centimorgan    3     moraxella    7
24                    f    3       vaccine    6
32             hereford    3       control    5
50            offspring    3     cytotoxin    5
2                  also    2      isolates    5
4                 angus    2       pinkeye    5
5               animals    2   recombinant    5
13               cattle    2             s    5
16           chromosome    2      analysis    4
17                 cows    2    associated    4
20              disease    2             b    4
22             evidence    2      efficacy    4
27           fstatistic    2          gene    4
31                group    2          mbxa    4
33           identified    2 phospholipase    4
41             interval    2     pilinmbxa    4

In addition to recognizing that pinkeye seems to have nothing to do with second-graders, there does seem to be a difference in the types of words that are most frequent. Articles from 2006 have words associated with generational studies of different breeds like “offspring”, “centimorgan”, “angus” and “hereford”, “f-statistic”, “group”, and “interval”. In contrast, 2007 articles have biochemistry looking terms like “ibk”, “Moraxella”, “cytotoxin”, “isolates”, “recombinant”, “gene”, “mbxa”, “phospholipase”, and even perhaps “vaccine”. Maybe the spike in articles is the result of a discovery concerning the host pathways controlled by pinkeye bacteria toxins, or the emergence of a new strategy for treatment that targets a bacterial pathway.

A good strategy might be to search words like “Moraxella” or “mbxa” to see if there is a corresponding spike before 2007. Even “bovis” might be a candidate to search 2006 for a discovery that led to the spike in scientific interest. (It doesn’t seem like the kind of topic that would warrant a nyTimes API scrub or twitter feeds, maybe in the rural Midwest only). We could investigate funding sources by searching NIH grant titles, or searching rural news sources for clues.

EUtilsGet subtleties

Another strategy in this particular case might be reviewing publications by a researcher who specializes in pinkeye. Author(EUtilsGet()) returns a list of data frames, so we need to use an inline function to extract all the names.

Last<-sapply(auths, function(x)paste(x$LastName))
auths2<, dec=TRUE))
   LastName  ForeName Initials order
1 Kizilkaya     Kadir        K     1
2      Tait Richard G       RG     2
3   Garrick  Dorian J       DJ     3
4  Fernando   Rohan L       RL     4
5     Reecy   James M       JM     5

Angelos     5
Ball        3
O'Connor    3
Reecy       3
Tait        3
Casas       2

Maybe Dr. Angelos has written a review after 2006 that describes the change in interest in bovine pinkeye.

Until next time!