AnalyticsDojo

Introduction to R - Merging and Aggregating Data

rpi.analyticsdojo.com

Overview

  • Merging Dataframes
  • Aggregating Dataframes
  • Advanced Functions

Merging Data Frame with Vector

  • Can combine vector with data frame in multiple ways.
  • data.frame(a,b) where a & b can be vectors, matrices, or data frames.
#Below is the sample data we will be creating 2 dataframes  
key=(1:10)

#Here we are passing the row names and column names as a list. 
m<- data.frame(matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("a","b","c","d"))))
m2<- data.frame(matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h"))))

#This is one way of combining a vector with a dataframe. 
df<-  data.frame(key,m)
df2<- data.frame(key,m2)

#This is another way way of combining a vector with a dataframe. 
dfb<-  cbind(key,m)
df2b<- cbind(key,m2)

df
df2
dfb
df2b


<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th>
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
<th scope=col>key</th><th scope=col>e</th><th scope=col>f</th><th scope=col>g</th><th scope=col>h</th>
1 1004.1240 997.4379 997.56971000.8540
2 1002.6933 998.40411009.17201010.4120
3 995.91381001.09591004.60251002.5405
4 999.5493 998.80541003.96491000.0133
5 1007.23731006.25801000.1882 992.9980
6 1000.2068 994.7482 998.28761002.7093
7 999.1622 998.6231 998.7175 998.0497
8 1003.12631002.72791004.16231000.5204
9 1003.1548 994.20301002.1614 999.3726
10 1006.87441004.2677 998.8720 993.5726
<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th>
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
<th scope=col>key</th><th scope=col>e</th><th scope=col>f</th><th scope=col>g</th><th scope=col>h</th>
1 1004.1240 997.4379 997.56971000.8540
2 1002.6933 998.40411009.17201010.4120
3 995.91381001.09591004.60251002.5405
4 999.5493 998.80541003.96491000.0133
5 1007.23731006.25801000.1882 992.9980
6 1000.2068 994.7482 998.28761002.7093
7 999.1622 998.6231 998.7175 998.0497
8 1003.12631002.72791004.16231000.5204
9 1003.1548 994.20301002.1614 999.3726
10 1006.87441004.2677 998.8720 993.5726

Merging Columns of Data Frame with another Data Frame

  • Can combine data frame in multiple ways.
  • merge(a,b,by="key") where a & b are dataframes with the same keys.
  • cbind(a,b) where a & b are dataframes with the same number of rows.
# This manages the merge by an associated key.
df3 <- merge(df,df2,by="key")
# This just does a "column bind" 
df4<- cbind(df,df2)
df5<- data.frame(df,df2)
df3
df4
df5

<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th><th scope=col>e</th><th scope=col>f</th><th scope=col>g</th><th scope=col>h</th>
1 16.27879923.00429719.26252422.11648 1004.4714 995.40551001.51561004.8862
2 19.28725219.22925318.81757513.67939 1000.73991004.6248 995.89501008.5242
3 18.83362316.14200418.45422418.29241 994.37941002.2578 998.6004 999.7609
4 23.78084710.93420713.44854017.83936 1001.47951009.08851002.5866 998.8287
5 13.98293516.92440219.03747520.19748 993.9745 999.98681001.0336 987.3751
6 15.53458923.437320 6.79592620.19305 996.12841008.84401005.51961003.6926
7 16.66007618.31507732.10713923.35534 994.50261004.99901004.09721005.6532
8 19.44779918.27838411.82310813.09162 1007.8858 993.87451005.1093 996.8686
9 9.22506924.92579613.86802117.06181 997.60261001.1045 991.79691000.5898
10 25.809451 7.49274718.48300324.99244 995.61901010.2642 998.6192 998.8618
<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th><th scope=col>key</th><th scope=col>e</th><th scope=col>f</th><th scope=col>g</th><th scope=col>h</th>
1 16.27879923.00429719.26252422.11648 1 1004.4714 995.40551001.51561004.8862
2 19.28725219.22925318.81757513.67939 2 1000.73991004.6248 995.89501008.5242
3 18.83362316.14200418.45422418.29241 3 994.37941002.2578 998.6004 999.7609
4 23.78084710.93420713.44854017.83936 4 1001.47951009.08851002.5866 998.8287
5 13.98293516.92440219.03747520.19748 5 993.9745 999.98681001.0336 987.3751
6 15.53458923.437320 6.79592620.19305 6 996.12841008.84401005.51961003.6926
7 16.66007618.31507732.10713923.35534 7 994.50261004.99901004.09721005.6532
8 19.44779918.27838411.82310813.09162 8 1007.8858 993.87451005.1093 996.8686
9 9.22506924.92579613.86802117.06181 9 997.60261001.1045 991.79691000.5898
10 25.809451 7.49274718.48300324.99244 10 995.61901010.2642 998.6192 998.8618
<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th><th scope=col>key.1</th><th scope=col>e</th><th scope=col>f</th><th scope=col>g</th><th scope=col>h</th>
1 16.27879923.00429719.26252422.11648 1 1004.4714 995.40551001.51561004.8862
2 19.28725219.22925318.81757513.67939 2 1000.73991004.6248 995.89501008.5242
3 18.83362316.14200418.45422418.29241 3 994.37941002.2578 998.6004 999.7609
4 23.78084710.93420713.44854017.83936 4 1001.47951009.08851002.5866 998.8287
5 13.98293516.92440219.03747520.19748 5 993.9745 999.98681001.0336 987.3751
6 15.53458923.437320 6.79592620.19305 6 996.12841008.84401005.51961003.6926
7 16.66007618.31507732.10713923.35534 7 994.50261004.99901004.09721005.6532
8 19.44779918.27838411.82310813.09162 8 1007.8858 993.87451005.1093 996.8686
9 9.22506924.92579613.86802117.06181 9 997.60261001.1045 991.79691000.5898
10 25.809451 7.49274718.48300324.99244 10 995.61901010.2642 998.6192 998.8618

Merging Rows of Data Frame with another Data Frame

  • rbind(a,b) combines rows of data frames of a and b.
  • rbind(a,b, make.row.names=FALSE) this will reset the index.
#Here we can combine rows with rbind. 
df5<-df
#The make Row
df6<-rbind(df,df5)
df6
df7<-rbind(df,df5, make.row.names=FALSE)
df7

<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th><th scope=row>1</th><th scope=row>2</th><th scope=row>3</th><th scope=row>4</th><th scope=row>5</th><th scope=row>6</th><th scope=row>7</th><th scope=row>8</th><th scope=row>9</th><th scope=row>10</th><th scope=row>11</th><th scope=row>21</th><th scope=row>31</th><th scope=row>41</th><th scope=row>51</th><th scope=row>61</th><th scope=row>71</th><th scope=row>81</th><th scope=row>91</th><th scope=row>101</th>
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th>
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
df7

<th scope=col>key</th><th scope=col>a</th><th scope=col>b</th><th scope=col>c</th><th scope=col>d</th>
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103
1 16.19873 14.41495 27.73225 15.564688
2 27.76677 18.54772 21.06688 15.697810
3 11.68592 14.91207 22.82086 15.666790
4 16.39982 28.30284 10.97550 19.083633
5 22.16232 16.82574 14.28676 20.162797
6 17.17425 14.36932 18.55487 13.498498
7 20.15380 18.00987 15.99028 14.325000
8 20.68866 12.83505 25.24119 24.538494
9 18.84664 24.01079 12.69775 8.095156
10 16.29913 21.51270 15.14676 23.722103

aggregate and by

  • Aggregation is a very important function.
  • Can have variables/analyses that happen at different levels.
  • by(x, by, FUN) provides similar functionality.
iris=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
head(iris)

<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th>
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")

#Aggregate by Species  aggregate(x, by, FUN, ...)
iris.agg<-aggregate(iris[,1:4], by=list("species" = iris$species), mean)
print(iris.agg)

#Notice this gives us the same output but structured differently. 
by(iris[, 1:4], iris$species, colMeans)

     species sepal_length sepal_width petal_length petal_width
1     setosa        5.006       3.418        1.464       0.244
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
iris$species: setosa
sepal_length  sepal_width petal_length  petal_width 
       5.006        3.418        1.464        0.244 
------------------------------------------------------------ 
iris$species: versicolor
sepal_length  sepal_width petal_length  petal_width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------ 
iris$species: virginica
sepal_length  sepal_width petal_length  petal_width 
       6.588        2.974        5.552        2.026 

apply(plus lapply/sapply/tapply/rapply)

  • apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
  • lapply - Apply a function to each element of a list or vector, return a list.
  • sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
  • tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
  • rapply - Apply a function to each element of a nested list structure, recursively, return a list.
  • Some functions aren’t vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.
  • For more info see this tutorial

apply

  • apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
  • If you are using a data frame the data types must all be the same.
  • `apply(X, MARGIN, FUN, …) where X is an array or matrix.
  • MARGIN is a vector giving the where function should be applied. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
  • FUN is any function.
iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
iris$sum<-apply(iris[1:4], 1, sum) #This provides a sum across  for each row. 
iris$mean<-apply(iris[1:4], 1, mean)#This provides a mean across collumns for each row. 
head(iris)
apply(iris[1:4], 2, mean)

<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>sum</th><th scope=col>mean</th>
5.1 3.5 1.4 0.2 setosa10.2 2.550
4.9 3.0 1.4 0.2 setosa 9.5 2.375
4.7 3.2 1.3 0.2 setosa 9.4 2.350
4.6 3.1 1.5 0.2 setosa 9.4 2.350
5.0 3.6 1.4 0.2 setosa10.2 2.550
5.4 3.9 1.7 0.4 setosa11.4 2.850
<dl class=dl-horizontal>
sepal_length
5.84333333333333
sepal_width
3.054
petal_length
3.75866666666667
petal_width
1.19866666666667
</dl>

lapply & sapply

  • lapply - Apply a function to each element of a list or vector, return a list.
  • lapply(X, FUN, ...)
  • sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
  • sapply(X, FUN, ...)
# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-lapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"]))


'list'
$count
[1] 15

$numbers
[1] 45

$numbers
[1] 45

$count
[1] 15

# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-sapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"],sample.sum[["count"]]))

#Note the differenece between #sample.sum[["count"]] and sample.sum["count"]

'integer'
  count numbers numbers   count         
     15      45      45      15      15 
# We can also utilize simple 
square<-function(x) x^2
square(1:5)

# We can use our own function here.     
sapply(1:10, square)

#We can also specify the function directly in sapply.
sapply(1:10, function(x) x^2)


<ol class=list-inline>
  • 1
  • 4
  • 9
  • 16
  • 25
  • </ol>
    <ol class=list-inline>
  • 1
  • 4
  • 9
  • 16
  • 25
  • 36
  • 49
  • 64
  • 81
  • 100
  • </ol>
    <ol class=list-inline>
  • 1
  • 4
  • 9
  • 16
  • 25
  • 36
  • 49
  • 64
  • 81
  • 100
  • </ol>

    tapply

    • tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
    • Can do something similar to aggregate.
    #Tapply example
    #tapply(X, INDEX, FUN, …) 
    #X = a vector, INDEX = list of one or more factor, FUN = Function or operation that needs to be applied. 
    iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
    iris.sepal_length.agg<-tapply(iris$sepal_length, iris$species, mean)
    print(iris.sepal_length.agg)
    
    
    
    
        setosa versicolor  virginica 
         5.006      5.936      6.588 
    

    CREDITS

    Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement. This work is adopted from the Berkley R Bootcamp.