DataScience+ An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Getting Data

Pulling Data Out of Census Spreadsheets Using R

  • Published on August 31, 2017 at 8:22 am

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14))
hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.

for (i in 1:11)
{
fn = paste("http://www2.census.gov/govs/local/",years[i], "slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.

ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise.

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.

ssi1 = which(grepl("*Hospital*", ex, ignore.case=T))
ssi2 = which(grepl("Iowa", ex, ignore.case=T))
ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.

if (ssi2<80) ssi5=ssi2+0:2
else ssi5 = ssi2 + c(0,2,3)
df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F)
names(df.1)=c(NA,NA,NA)
hospital.capital.expend = rbind(hospital.capital.expend, df.1)
rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5)
}

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.

hospital.capital.expend = as.matrix(hospital.capital.expend[-1,])
hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3)
hospital.capital.expend = cbind(2004:2014,hospital.capital.expend)
colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local")
print(hospital.capital.expend)

That’s it!!!