DataScience+ An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Introduction

Working with the Data Frame in R

Here you will learn about transforming, merging, ordering a data frame, changing the column order, removing a variable, sub setting and indexing.

Transposing

This means put the rows as columns and the columns as the rows, this is done very easily in one line:

data(mtcars)
mtcars <- t(mtcars)

Merging two Data Frame

Often it happen that you have information scattered across several dataset and sometimes you might want to bring them together in the same dataset, as example let’s take again our mtcars data frame and say that we have colour information for each model in a different data frame, we want to merge the two data frame in one by associating to each model its correct colour.

Here is the code:

data(mtcars)
df <- data.frame(Color=c(rep("red",5),rep("grey",10),rep("white",10),rep("black",5),rep("chrome",2)))
df$Model <- rownames(mtcars)
mtcars$Model <- rownames(mtcars)
df_merged <- merge(mtcars,df)
head(df_merged)
             Model  mpg cyl disp  hp drat    wt  qsec vs am gear carb Color
1        AMC Javelin 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2 white
2 Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4  grey
3         Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4 white
4  Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4 white
5         Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   red
6   Dodge Challenger 15.5   8  318 150 2.76 3.520 16.87  0  0    3    2 white

Here is a new function merge that take as arguments two data frames and associate row that share a common value at a common column, so R look at the column names of the two data frame and when two of them are equivalent it merges the data frame by this column.

Now let’s play a bit around with this function:

colnames(df)[2] <- "Models"
df[33,] <-c ("grey","Citroën")
df_merged <- merge(mtcars,df,by.x="Model",by.y="Models",all.x=TRUE,all.y=TRUE)
df_merged[df_merged$Model=="Citroën",]
    Model mpg cyl disp hp drat wt qsec vs am gear carb Color
5 Citroën  NA  NA   NA NA   NA NA   NA NA NA   NA   NA  grey

What I did there was to change the name of the common column, and to add a line to the second data frame, this time we need to specify to R which column to merge by, by.x give the name of the column of the first data frame and by.y give the name of the column of the second data frame. Since we have more line in the second data frame we need to specify all.y as TRUE so that all row that didn’t fit to the first data frame will be added and filled with NA were values are missing (for every column of the first data frame. In this example all.x is not necessary but we still use it for example purposes. I will explain later the awful looking last line when I will speak about indexing.

So to sum up merge put together two data frame by a common column, certain precautions mus be taken, if there is more than one match between rows (for example if we had two times ‘Citroën’ in one data frame) then multiple row will be created, inflating the size of your data frame, so usualy this function is used one there is only one match between rows.

Sorting a data frame

Another common operation done on data frame is ordering, so classify the rows in a certain order (from biggest to lowest..), let’s use again our merged mtcars data frame and sort it in ascending order (lowest first) on the number of gears;

df_order <- df_merged[order(df_merged$gear),]
df_merged$gear
df_order$gear
[1]  3  3  3  3 NA  4  3  3  5  4  4  5  4  3  3  3  5  5  4  4  4  4  4  4  3  3  3  3  5  4  3  3  4
[1]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  5  5 5  5  5 NA

The function order returns a vector of row indices sorting the rows in ascending order based on the vector given (here df_merged$gear). If we want to sort in a decreasing order here is an equivalent command (note the adding of the minus sign):

df_order <- df_merged[order(-df_merged$gear),]

Now a very handy extension of this would be if we wanted to sort the column by a ascending order of their column name (first A last Z)

df_order <- df_merged[,order(colnames(df_merged))]
colnames(df_merged)
colnames(df_order)
[1] "Model" "mpg"   "cyl"   "disp"  "hp"    "drat"  "wt"    "qsec"  "vs"    "am"    "gear"  "carb"  "Color"
[1] "am"    "carb"  "Color" "cyl"   "disp"  "drat"  "gear"  "hp"    "Model" "mpg"   "qsec" "vs"    "wt"

Here we put the comma as the first argument which R interpret as columns, then we gave as argument the list of the column names.

Changing the column order

We already saw just before how to order the column, but if we want to be finer for example only wanting to put the last column at the second place we need to use a different strategy;

df_merged <- df_merged[,c(1,13,2:12)]
head(df_merged)
               Model Color  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1        AMC Javelin white 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
2 Cadillac Fleetwood  grey 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3         Camaro Z28 white 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
4  Chrysler Imperial white 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5            Citroën  grey   NA  NA   NA  NA   NA    NA    NA NA NA   NA   NA
6         Datsun 710   red 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

What R did was to take the first column in the df_merged data frame and put it in the same place, then it took the 13eth column and put it in the second place and all remaining columns were rearranged. The c is necessary (as always!) because we are giving a list of more than one elements (this always true in R whenever you want to give more than one element outside a function context you have to use c).

Deleting a variable

There are numerous way to delete a variable from a data frame here is just one that I selected because you don’t need to know about the position of the variable in the data frame just the name of it:

df_merged$Color <- NULL
head(df_merged)
              Model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1        AMC Javelin 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
2 Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3         Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
4  Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5         Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
6   Dodge Challenger 15.5   8  318 150 2.76 3.520 16.87  0  0    3    2

Very easy we set the variable color to NULL so all its content is erased.

Subseting and Indexing

This is a very powerful tool it allows you to take subset of the data frame, there are two ways to do it the first one by using the squared bracket and the index that R automatically use (remember when we use them to select specific columns in a matrix) or to use the function subset, from our data frame we want to select only the cars with 4 or more gears

new <- df_merged[df$gear>=4,]
new2 <- subset(df_merged,gear>=4)
new$gear
new2$gear
[1] 4 5 4 4 5 4 5 5 4 4 4 4 4 4 5 4 4
[1] 4 5 4 4 5 4 5 5 4 4 4 4 4 4 5 4 4

In both ways there is a logical expression (here >=) the only major difference is that in the first one we have to add a comma at the end to work on every rows when the function subset does this automatically. This subseting is very often used so I will show a few more example of what we can do:

n <- 8
new <-df_merged[df_merged$cyl%in%n,]
new$cyl
[1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8

Here the %in% mean the intersection between df_merged$cyl and n, this intersection is very useful when you have a list of names and you want to extract information related to them in a data frame.

new <- subset(df_merged, gear==4 & cyl!=8)
new$cyl
new$gear
[1] 4 4 4 4 6 6 4 4 6 6 4 4
[1] 4 4 4 4 4 4 4 4 4 4 4 4

Now here is a nested subsetting in this case we are only keeping the rows (car models) that have 4 gears and not 8 cylinders (the != operator mean not equal to).

You can play around indefinitely with these tools, this allow you to really master your data. Please note that some packages helps you manage your data frame, a very handy one is the dplyr package I would recommend that you look into this. To fully appreciate the power of this package one need to have spent lines and lines of code writing what dplyr can elegantly do in a few lines.