AnalyticsDojo

Revisting Boston Housing with Pytorch

rpi.analyticsdojo.com

```!pip install torch torchvision

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Let's get rid of some imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
#Define the model 
import torch
import torch.nn as nn
import torch.nn.functional as F

Overview

  • Getting the Data
  • Reviewing Data
  • Modeling
  • Model Evaluation
  • Using Model
  • Storing Model

Getting Data

```#From sklearn tutorial. from sklearn.datasets import load_boston boston = load_boston() print( “Type of boston dataset:”, type(boston))

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#A bunch is you remember is a dictionary based dataset.  Dictionaries are addressed by keys. 
#Let's look at the keys. 
print(boston.keys())


```#DESCR sounds like it could be useful. Let’s print the description. print(boston[‘DESCR’])

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```# Let's change the data to a Panda's Dataframe
import pandas as pd
boston_df = pd.DataFrame(boston['data'] )
boston_df.head()

```#Now add the column names. boston_df.columns = boston[‘feature_names’] boston_df.head()

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Add the target as PRICE. 
boston_df['PRICE']= boston['target']
boston_df.head()

## Attribute Information (in order): Looks like they are all continuous IV and continuous DV. - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per 10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in 1000’s Let’s check for missing values.

```import numpy as np #check for missing values print(np.sum(np.isnan(boston_df)))

</div>

</div>



## What type of data are there?
- First let's focus on the dependent variable, as the nature of the DV is critical to selection of model. 
- *Median value of owner-occupied homes in $1000's* is the Dependent Variable  (continuous variable).
- It is relevant to look at the distribution of the dependent variable, so let's do that first.
- Here there is a normal distribution for the most part, with some at the top end of the distribution we could explore later.



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Let's us seaborn, because it is pretty. ;) 
#See more here. http://seaborn.pydata.org/tutorial/distributions.html
import seaborn as sns
sns.distplot(boston_df['PRICE']);

```#We can quickly look at other data. #Look at the bottom row to see thinks likely coorelated with price. #Look along the diagonal to see histograms of each. sns.pairplot(boston_df);

</div>

</div>



## Preparing to Model
- It is common to separate `y` as the dependent variable and `X` as the matrix of independent variables.
- Here we are using `train_test_split` to split the test and train.
- This creates 4 subsets, with IV and DV separted: `X_train, X_test, y_train, y_test`
 




<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#This will throw and error at import if haven't upgraded. 
# from sklearn.cross_validation  import train_test_split  
from sklearn.model_selection  import train_test_split
#y is the dependent variable.
y = boston_df['PRICE']
#As we know, iloc is used to slice the array by index number. Here this is the matrix of 
#independent variables.
X = boston_df.iloc[:,0:13]

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


```#Define training hyperprameters. batch_size = 50 num_epochs = 200 learning_rate = 0.01 size_hidden= 100

#Calculate some other hyperparameters based on data.
batch_no = len(X_train) // batch_size #batches cols=X_train.shape[1] #Number of columns in input matrix n_output=1

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Create the model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assume that we are on a CUDA machine, then this should print a CUDA device:
print("Executing the model on :",device)
class Net(torch.nn.Module):
    def __init__(self, n_feature, size_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(cols, size_hidden)   # hidden layer
        self.predict = torch.nn.Linear(size_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.predict(x)             # linear output
        return x
net = Net(cols, size_hidden, n_output)

```#Adam is a specific flavor of gradient decent which is typically better optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate) #optimizer = torch.optim.SGD(net.parameters(), lr=0.2) criterion = torch.nn.MSELoss(size_average=False) # this is for regression mean squared loss

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Change to numpy arraay. 
X_train=X_train.values
y_train=y_train.values
X_test=X_test.values
y_test=y_test.values

```from sklearn.utils import shuffle from torch.autograd import Variable running_loss = 0.0 for epoch in range(num_epochs): #Shuffle just mixes up the dataset between epocs X_train, y_train = shuffle(X_train, y_train) # Mini batch learning for i in range(batch_no): start = i * batch_size end = start + batch_size inputs = Variable(torch.FloatTensor(X_train[start:end])) labels = Variable(torch.FloatTensor(y_train[start:end])) # zero the parameter gradients optimizer.zero_grad()

    # forward + backward + optimize
    outputs = net(inputs)
    #print("outputs",outputs)
    #print("outputs",outputs,outputs.shape,"labels",labels, labels.shape)
    loss = criterion(outputs, torch.unsqueeze(labels,dim=1))
    loss.backward()
    optimizer.step()

    # print statistics
    running_loss += loss.item()
    
print('Epoch {}'.format(epoch+1), "loss: ",running_loss)
running_loss = 0.0
</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">

import pandas as pd from sklearn.metrics import r2_score

X = Variable(torch.FloatTensor(X_train)) result = net(X) pred=result.data[:,0].numpy() print(len(pred),len(y_train)) r2_score(pred,y_train)

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```import pandas as pd
from sklearn.metrics import r2_score
#This is a little bit tricky to get the resulting prediction.  
def calculate_r2(x,y=[]):
    """
    This function will return the r2 if passed x and y or return predictions if just passed x. 
    """
    # Evaluate the model with the test set. 
    X = Variable(torch.FloatTensor(x))  
    result = net(X) #This outputs the value for regression
    result=result.data[:,0].numpy()
  
    if len(y) != 0:
        r2=r2_score(result, y)
        print("R-Squared", r2)
        #print('Accuracy {:.2f}'.format(num_right / len(y)), "for a total of ", len(y), "records")
        return pd.DataFrame(data= {'actual': y, 'predicted': result})
    else:
        print("returning predictions")
        return result



result2=calculate_r2(X_test,y_test)

Modeling

  • First import the package: from sklearn.linear_model import LinearRegression
  • Then create the model object.
  • Then fit the data.
  • This creates a trained model (an object) of class regression.
  • The variety of methods and attributes available for regression are shown here.

```from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit( X_train, y_train )

</div>

</div>



## Evaluating the Model Results
- You have fit a model. 
- You can now store this model, save the object to disk, or evaluate it with different outcomes. 
- Trained regression objects have coefficients (`coef_`) and intercepts (`intercept_`) as attributes. 
- R-Squared is determined from the `score` method of the regression object.
- For Regression, we are going to use the coefficient of determination as our way of evaluating the results, [also referred to as R-Squared](https://en.wikipedia.org/wiki/Coefficient_of_determination)



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">

print(‘R2 for Train)’, lm.score( X_train, y_train )) print(‘R2 for Test (cross validation)’, lm.score(X_test, y_test))

```

Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement.