AnalyticsDojo

Introduction to R - DataFrames

rpi.analyticsdojo.com

Introduction to R DataFrames

  • Data frames are combinations of vectors of the same length, but can be of different types.
  • It is a special type of list.
  • Data frames are what is used for standard rectangular (record by field) datasets, similar to a spreadsheet
  • Data frames are a functionality that both sets R aside from some languages (e.g., Matlab) and provides functionality similar to some statistical packages (e.g., Stata, SAS) and Python’s Pandas Packages.
frame=read.csv(file="../../input/iris.csv", header=TRUE,sep=",")
class(frame)
head(frame) #The first few rows.
tail(frame) #The last few rows.
str(frame) #The Structure.



'data.frame'
<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th>
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
<th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=row>145</th><th scope=row>146</th><th scope=row>147</th><th scope=row>148</th><th scope=row>149</th><th scope=row>150</th>
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica
'data.frame':	150 obs. of  5 variables:
 $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(frame) #Results in rows x columns
nrow(frame)  #The number of Rows
names(frame) #Provides the names
length(frame) #The number of columns
summary(frame) #Provides summary statistics.
is.matrix(frame) #Yields False because it has different types.  
is.list(frame) #Yields True
class(frame$sepal_length)
class(frame$species)
levels(frame$species)

<ol class=list-inline>
  • 150
  • 5
  • </ol>
    150
    <ol class=list-inline>
  • 'sepal_length'
  • 'sepal_width'
  • 'petal_length'
  • 'petal_width'
  • 'species'
  • </ol>
    5
      sepal_length    sepal_width     petal_length    petal_width   
     Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
     1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
     Median :5.800   Median :3.000   Median :4.350   Median :1.300  
     Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
     3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
     Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
           species  
     setosa    :50  
     versicolor:50  
     virginica :50  
                    
                    
                    
    
    FALSE
    TRUE
    'numeric'
    'factor'
    <ol class=list-inline>
  • 'setosa'
  • 'versicolor'
  • 'virginica'
  • </ol>
    frame[c("species","sepal_width")]
    
    
    <th scope=col>species</th><th scope=col>sepal_width</th>
    setosa3.5
    setosa3.0
    setosa3.2
    setosa3.1
    setosa3.6
    setosa3.9
    setosa3.4
    setosa3.4
    setosa2.9
    setosa3.1
    setosa3.7
    setosa3.4
    setosa3.0
    setosa3.0
    setosa4.0
    setosa4.4
    setosa3.9
    setosa3.5
    setosa3.8
    setosa3.8
    setosa3.4
    setosa3.7
    setosa3.6
    setosa3.3
    setosa3.4
    setosa3.0
    setosa3.4
    setosa3.5
    setosa3.4
    setosa3.2
    virginica3.2
    virginica2.8
    virginica2.8
    virginica2.7
    virginica3.3
    virginica3.2
    virginica2.8
    virginica3.0
    virginica2.8
    virginica3.0
    virginica2.8
    virginica3.8
    virginica2.8
    virginica2.8
    virginica2.6
    virginica3.0
    virginica3.4
    virginica3.1
    virginica3.0
    virginica3.1
    virginica3.1
    virginica3.1
    virginica2.7
    virginica3.2
    virginica3.3
    virginica3.0
    virginica2.5
    virginica3.0
    virginica3.4
    virginica3.0
    frame['petals']<-0
    frame$petals2<-0
    head(frame)
    
    
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=col>petals2</th>
    5.1 3.5 1.4 0.2 setosa0 0
    4.9 3.0 1.4 0.2 setosa0 0
    4.7 3.2 1.3 0.2 setosa0 0
    4.6 3.1 1.5 0.2 setosa0 0
    5.0 3.6 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0
    mean.sepalLenth.setosa<-mean(frame[,'sepal_length'])
    
    

    Slicing a Dataframe by Column

    • Remember the syntax of df[rows,columns]
    • Using dataframe$column provides one way of selecting a column.
    • We can also specify the index position: dataframe[,columnIndex]
    • We can also specify the column name: dataframe[,columnsName]
    sepal_length1<-frame$sepal_length #Using Dollar Sign and the column name.
    sepal_length2<- frame[,1]  #Using the Index Location
    sepal_length3<- frame[,'sepal_length']
    sepal_length4<- frame[,c('sepal_length','sepal_width')]
    
    sepal_length1[1:5]  #Print the first 5  
    sepal_length2[1:5]
    sepal_length3[1:5]
    
    
    
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>
    <ol class=list-inline>
  • 5.1
  • 4.9
  • 4.7
  • 4.6
  • 5
  • </ol>

    Selecting Rows

    • We can select rows from a dataframe using index position: dataframe[rowIndex,columnIndex].
    • Use c(row1, row2, row3) to select out specific rows.
    frame2<-frame[1:20,]   
    frame3<-frame[c(1,5,6),] #This selects out specific rows
    nrow(frame2)
    nrow(frame3)
    frame3
    
    
    20
    3
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=col>petals2</th><th scope=row>1</th><th scope=row>5</th><th scope=row>6</th>
    5.1 3.5 1.4 0.2 setosa0 0
    5.0 3.6 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0

    Conditional Statements and Dataframes with Subset

    • We can select subsets of a dataframe by putting an equality in the row or subset.
    • Subset is also a dataframe.
    • Can optionally select columns with the select = c(col1, col2)
    setosa.df <- subset(frame, species == 'setosa')
    
    head(setosa.df)
    class(setosa.df)
    nrow(setosa.df)
    mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
    mean.sepalLenth.setosa
    setosa.df.highseptalLength <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa)
    nrow(setosa.df.highseptalLength)
    head(setosa.df.highseptalLength)
    setosa.dfhighseptalLength2 <- subset(setosa.df, sepal_length > mean.sepalLenth.setosa, select = c(sepal_length, species))
    head(setosa.dfhighseptalLength2)
    
    
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=col>petals2</th>
    5.1 3.5 1.4 0.2 setosa0 0
    4.9 3.0 1.4 0.2 setosa0 0
    4.7 3.2 1.3 0.2 setosa0 0
    4.6 3.1 1.5 0.2 setosa0 0
    5.0 3.6 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0
    'data.frame'
    50
    5.006
    22
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=col>petals2</th><th scope=row>1</th><th scope=row>6</th><th scope=row>11</th><th scope=row>15</th><th scope=row>16</th><th scope=row>17</th>
    5.1 3.5 1.4 0.2 setosa0 0
    5.4 3.9 1.7 0.4 setosa0 0
    5.4 3.7 1.5 0.2 setosa0 0
    5.8 4.0 1.2 0.2 setosa0 0
    5.7 4.4 1.5 0.4 setosa0 0
    5.4 3.9 1.3 0.4 setosa0 0
    <th scope=col>sepal_length</th><th scope=col>species</th><th scope=row>1</th><th scope=row>6</th><th scope=row>11</th><th scope=row>15</th><th scope=row>16</th><th scope=row>17</th>
    5.1 setosa
    5.4 setosa
    5.4 setosa
    5.8 setosa
    5.7 setosa
    5.4 setosa

    Subsetting Rows Using Indices

    • Just like pandas, we are using conditional statements to specify specific rows.
    • See here for good coverage and examples.
    setosa.df <- frame[frame$species == "setosa",]
    head(setosa.df)
    class(setosa.df)
    nrow(setosa.df)
    mean.sepalLenth.setosa<-mean(setosa.df$sepal_length) #This creates a new vector
    mean.sepalLenth.setosa
    setosa.df.highseptalLength <- setosa.df[setosa.df$sepal_length > mean.sepalLenth.setosa,]
    nrow(setosa.df.highseptalLength)
    head(setosa.df.highseptalLength)
    
    
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=row>1</th><th scope=row>2</th><th scope=row>3</th><th scope=row>4</th><th scope=row>5</th><th scope=row>6</th>
    5.1 3.5 1.4 0.2 setosa0
    4.9 3 1.4 0.2 setosa0
    4.7 3.2 1.3 0.2 setosa0
    4.6 3.1 1.5 0.2 setosa0
    5 3.6 1.4 0.2 setosa0
    5.4 3.9 1.7 0.4 setosa0
    'data.frame'
    50
    5.006
    22
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=row>1</th><th scope=row>6</th><th scope=row>11</th><th scope=row>15</th><th scope=row>16</th><th scope=row>17</th>
    5.1 3.5 1.4 0.2 setosa0
    5.4 3.9 1.7 0.4 setosa0
    5.4 3.7 1.5 0.2 setosa0
    5.8 4 1.2 0.2 setosa0
    5.7 4.4 1.5 0.4 setosa0
    5.4 3.9 1.3 0.4 setosa0
    specific.df <- frame[frame$sepal_length %in% c(5.1,5.8),]
    head(specific.df)
    
    
    
    <th scope=col>sepal_length</th><th scope=col>sepal_width</th><th scope=col>petal_length</th><th scope=col>petal_width</th><th scope=col>species</th><th scope=col>petals</th><th scope=row>1</th><th scope=row>15</th><th scope=row>18</th><th scope=row>20</th><th scope=row>22</th><th scope=row>24</th>
    5.1 3.5 1.4 0.2 setosa0
    5.8 4.0 1.2 0.2 setosa0
    5.1 3.5 1.4 0.3 setosa0
    5.1 3.8 1.5 0.3 setosa0
    5.1 3.7 1.5 0.4 setosa0
    5.1 3.3 1.7 0.5 setosa0

    Basics

    1. Load the Titanic train.csv data into an R data frame.
    2. Calculate the number of rows in the data frame.
    3. Calcuated general descriptive statistics for the data frame.
    4. Slice the data frame into 2 parts, selecting the first half of the rows.
    5. Select just the columns passangerID and whether they survivied or not.

    CREDITS

    Copyright AnalyticsDojo 2016 This work is licensed under the Creative Commons Attribution 4.0 International license agreement. Adopted from Berkley R Bootcamp.