Introduction to R - Merging and Aggregating Data

rpi.analyticsdojo.com

Overview

Merging Dataframes
Aggregating Dataframes
Advanced Functions

Merging Data Frame with Vector

Can combine vector with data frame in multiple ways.
data.frame(a,b) where a & b can be vectors, matrices, or data frames.

#Below is the sample data we will be creating 2 dataframes  
key=(1:10)

#Here we are passing the row names and column names as a list. 
m<- data.frame(matrix(rnorm(40, mean=20, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("a","b","c","d"))))
m2<- data.frame(matrix(rnorm(40, mean=1000, sd=5), nrow=10, ncol=4, dimnames=list((1:10),c("e","f","g","h"))))

#This is one way of combining a vector with a dataframe. 
df<-  data.frame(key,m)
df2<- data.frame(key,m2)

#This is another way way of combining a vector with a dataframe. 
dfb<-  cbind(key,m)
df2b<- cbind(key,m2)

df
df2
dfb
df2b

1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

1	1004.1240	997.4379	997.5697	1000.8540
2	1002.6933	998.4041	1009.1720	1010.4120
3	995.9138	1001.0959	1004.6025	1002.5405
4	999.5493	998.8054	1003.9649	1000.0133
5	1007.2373	1006.2580	1000.1882	992.9980
6	1000.2068	994.7482	998.2876	1002.7093
7	999.1622	998.6231	998.7175	998.0497
8	1003.1263	1002.7279	1004.1623	1000.5204
9	1003.1548	994.2030	1002.1614	999.3726
10	1006.8744	1004.2677	998.8720	993.5726

1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

1	1004.1240	997.4379	997.5697	1000.8540
2	1002.6933	998.4041	1009.1720	1010.4120
3	995.9138	1001.0959	1004.6025	1002.5405
4	999.5493	998.8054	1003.9649	1000.0133
5	1007.2373	1006.2580	1000.1882	992.9980
6	1000.2068	994.7482	998.2876	1002.7093
7	999.1622	998.6231	998.7175	998.0497
8	1003.1263	1002.7279	1004.1623	1000.5204
9	1003.1548	994.2030	1002.1614	999.3726
10	1006.8744	1004.2677	998.8720	993.5726

Merging Columns of Data Frame with another Data Frame

Can combine data frame in multiple ways.
merge(a,b,by="key") where a & b are dataframes with the same keys.
cbind(a,b) where a & b are dataframes with the same number of rows.

# This manages the merge by an associated key.
df3 <- merge(df,df2,by="key")
# This just does a "column bind" 
df4<- cbind(df,df2)
df5<- data.frame(df,df2)
df3
df4
df5

1	16.278799	23.004297	19.262524	22.11648	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	995.6190	1010.2642	998.6192	998.8618

1	16.278799	23.004297	19.262524	22.11648	1	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	2	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	3	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	4	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	5	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	6	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	7	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	8	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	9	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	10	995.6190	1010.2642	998.6192	998.8618

1	16.278799	23.004297	19.262524	22.11648	1	1004.4714	995.4055	1001.5156	1004.8862
2	19.287252	19.229253	18.817575	13.67939	2	1000.7399	1004.6248	995.8950	1008.5242
3	18.833623	16.142004	18.454224	18.29241	3	994.3794	1002.2578	998.6004	999.7609
4	23.780847	10.934207	13.448540	17.83936	4	1001.4795	1009.0885	1002.5866	998.8287
5	13.982935	16.924402	19.037475	20.19748	5	993.9745	999.9868	1001.0336	987.3751
6	15.534589	23.437320	6.795926	20.19305	6	996.1284	1008.8440	1005.5196	1003.6926
7	16.660076	18.315077	32.107139	23.35534	7	994.5026	1004.9990	1004.0972	1005.6532
8	19.447799	18.278384	11.823108	13.09162	8	1007.8858	993.8745	1005.1093	996.8686
9	9.225069	24.925796	13.868021	17.06181	9	997.6026	1001.1045	991.7969	1000.5898
10	25.809451	7.492747	18.483003	24.99244	10	995.6190	1010.2642	998.6192	998.8618

Merging Rows of Data Frame with another Data Frame

rbind(a,b) combines rows of data frames of a and b.
rbind(a,b, make.row.names=FALSE) this will reset the index.

#Here we can combine rows with rbind. 
df5<-df
#The make Row
df6<-rbind(df,df5)
df6
df7<-rbind(df,df5, make.row.names=FALSE)
df7


1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

df7

1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103
1	16.19873	14.41495	27.73225	15.564688
2	27.76677	18.54772	21.06688	15.697810
3	11.68592	14.91207	22.82086	15.666790
4	16.39982	28.30284	10.97550	19.083633
5	22.16232	16.82574	14.28676	20.162797
6	17.17425	14.36932	18.55487	13.498498
7	20.15380	18.00987	15.99028	14.325000
8	20.68866	12.83505	25.24119	24.538494
9	18.84664	24.01079	12.69775	8.095156
10	16.29913	21.51270	15.14676	23.722103

`aggregate` and `by`

Aggregation is a very important function.
Can have variables/analyses that happen at different levels.
by(x, by, FUN) provides similar functionality.

iris=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
head(iris)

<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th>

5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")

#Aggregate by Species  aggregate(x, by, FUN, ...)
iris.agg<-aggregate(iris[,1:4], by=list("species" = iris$species), mean)
print(iris.agg)

#Notice this gives us the same output but structured differently. 
by(iris[, 1:4], iris$species, colMeans)

     species sepal_length sepal_width petal_length petal_width
   setosa        5.006       3.418        1.464       0.244
versicolor        5.936       2.770        4.260       1.326
virginica        6.588       2.974        5.552       2.026

iris$species: setosa
sepal_length  sepal_width petal_length  petal_width 
       5.006        3.418        1.464        0.244 
------------------------------------------------------------ 
iris$species: versicolor
sepal_length  sepal_width petal_length  petal_width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------ 
iris$species: virginica
sepal_length  sepal_width petal_length  petal_width 
       6.588        2.974        5.552        2.026 

`apply`(plus `lapply`/`sapply`/`tapply`/`rapply`)

apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
lapply - Apply a function to each element of a list or vector, return a list.
sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
rapply - Apply a function to each element of a nested list structure, recursively, return a list.
Some functions aren’t vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.
For more info see this tutorial

`apply`

apply - Applying a function to an array or matrix, return a vector or array or list of values. apply(X, MARGIN, FUN, ...)
If you are using a data frame the data types must all be the same.
`apply(X, MARGIN, FUN, …) where X is an array or matrix.
MARGIN is a vector giving the where function should be applied. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
FUN is any function.

iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
iris$sum<-apply(iris[1:4], 1, sum) #This provides a sum across  for each row. 
iris$mean<-apply(iris[1:4], 1, mean)#This provides a mean across collumns for each row. 
head(iris)
apply(iris[1:4], 2, mean)

<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>sum</th><th scope=col>mean</th>

5.1	3.5	1.4	0.2	setosa	10.2	2.550
4.9	3.0	1.4	0.2	setosa	9.5	2.375
4.7	3.2	1.3	0.2	setosa	9.4	2.350
4.6	3.1	1.5	0.2	setosa	9.4	2.350
5.0	3.6	1.4	0.2	setosa	10.2	2.550
5.4	3.9	1.7	0.4	setosa	11.4	2.850

sepal_length

5.84333333333333

sepal_width

3.054

petal_length

3.75866666666667

petal_width

1.19866666666667

</dl>

`lapply` & `sapply`

lapply - Apply a function to each element of a list or vector, return a list.
lapply(X, FUN, ...)
sapply - A user-friendly version if lapply. Apply a function to each element of a list or vector, return a vector.
sapply(X, FUN, ...)

# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-lapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"]))

'list'

$count
[1] 15

$numbers
[1] 45

$numbers
[1] 45

$count
[1] 15

# create a list with 2 elements
sample <- list("count" = 1:5, "numbers" =5:10)

# sum each and return as a list. 
sample.sum<-sapply(sample, sum)

class(sample.sum)
print(c(sample.sum, sample.sum["numbers"],sample.sum["count"],sample.sum[["count"]]))

#Note the differenece between #sample.sum[["count"]] and sample.sum["count"]

'integer'

  count numbers numbers   count         
     15      45      45      15      15 

# We can also utilize simple 
square<-function(x) x^2
square(1:5)

# We can use our own function here.     
sapply(1:10, square)

#We can also specify the function directly in sapply.
sapply(1:10, function(x) x^2)

</ol>

100

</ol>

100

</ol>

`tapply`

tapply - Apply a function to subsets of a vector (and the subsets are defined by some other vector, usually a factor), return a vector.
Can do something similar to aggregate.

#Tapply example
#tapply(X, INDEX, FUN, …) 
#X = a vector, INDEX = list of one or more factor, FUN = Function or operation that needs to be applied. 
iris<-read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
iris.sepal_length.agg<-tapply(iris$sepal_length, iris$species, mean)
print(iris.sepal_length.agg)

    setosa versicolor  virginica 
     5.006      5.936      6.588 

CREDITS

Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement. This work is adopted from the Berkley R Bootcamp.

Introduction to R - Merging and Aggregating Data

rpi.analyticsdojo.com

Overview

Merging Data Frame with Vector

Merging Columns of Data Frame with another Data Frame

Merging Rows of Data Frame with another Data Frame

aggregate and by

apply(plus lapply/sapply/tapply/rapply)

apply

lapply & sapply

tapply

CREDITS

`aggregate` and `by`

`apply`(plus `lapply`/`sapply`/`tapply`/`rapply`)

`apply`

`lapply` & `sapply`

`tapply`