
Revisting Boston Housing with Pytorch

!pip install torch torchvision



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#Let's get rid of some imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
#Define the model 
import torch
import torch.nn as nn
import torch.nn.functional as F


  • Getting the Data
  • Reviewing Data
  • Modeling
  • Model Evaluation
  • Using Model
  • Storing Model

Getting Data

#From sklearn tutorial.
from sklearn.datasets import load_boston
boston = load_boston()
print( "Type of boston dataset:", type(boston))



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#A bunch is you remember is a dictionary based dataset.  Dictionaries are addressed by keys. 
#Let's look at the keys. 
#Let's look at the keys. 

#DESCR sounds like it could be useful. Let's print the description.
print(boston['DESCR'])



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```# Let's change the data to a Panda's Dataframe
import pandas as pd
boston_df = pd.DataFrame(boston['data'] )

#Now add the column names.
boston_df.columns = boston['feature_names']
boston_df.head()



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Add the target as PRICE. 
boston_df['PRICE']= boston['target']

## Attribute Information (in order): Looks like they are all continuous IV and continuous DV. - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per 10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in 1000’s Let’s check for missing values.

import numpy as np
#check for missing values
print(np.sum(np.isnan(boston_df)))



## What type of data are there?
- First let's focus on the dependent variable, as the nature of the DV is critical to selection of model. 
- *Median value of owner-occupied homes in $1000's* is the Dependent Variable  (continuous variable).
- It is relevant to look at the distribution of the dependent variable, so let's do that first.
- Here there is a normal distribution for the most part, with some at the top end of the distribution we could explore later.

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#Let's us seaborn, because it is pretty. ;) 
#See more here. 
#See more here.
import seaborn as sns

#We can quickly look at other data.
#Look at the bottom row to see thinks likely coorelated with price.
#Look along the diagonal to see histograms of each.
sns.pairplot(boston_df);



## Preparing to Model
- It is common to separate `y` as the dependent variable and `X` as the matrix of independent variables.
- Here we are using `train_test_split` to split the test and train.
- This creates 4 subsets, with IV and DV separted: `X_train, X_test, y_train, y_test`

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#This will throw and error at import if haven't upgraded. 
# from sklearn.cross_validation  import train_test_split  
from sklearn.model_selection  import train_test_split
#y is the dependent variable.
y = boston_df['PRICE']
#As we know, iloc is used to slice the array by index number. Here this is the matrix of 
#independent variables.
X = boston_df.iloc[:,0:13]

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#Define training hyperprameters.
batch_size = 50
num_epochs = 200
learning_rate = 0.01
size_hidden= 100

#Calculate some other hyperparameters based on data.
batch_no = len(X_train) // batch_size #batches cols=X_train.shape[1] #Number of columns in input matrix n_output=1



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#Create the model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assume that we are on a CUDA machine, then this should print a CUDA device:
print("Executing the model on :",device)
class Net(torch.nn.Module):
    def __init__(self, n_feature, size_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(cols, size_hidden)   # hidden layer
        self.predict = torch.nn.Linear(size_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.predict(x)             # linear output
        return x
net = Net(cols, size_hidden, n_output)

#Adam is a specific flavor of gradient decent which is typically better
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
criterion = torch.nn.MSELoss(size_average=False)  # this is for regression mean squared loss



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
#Change to numpy arraay. 

from sklearn.utils import shuffle
from torch.autograd import Variable
running_loss = 0.0
for epoch in range(num_epochs):
    #Shuffle just mixes up the dataset between epocs
    X_train, y_train = shuffle(X_train, y_train)
    # Mini batch learning
    for i in range(batch_no):
        start = i * batch_size
        end = start + batch_size
        inputs = Variable(torch.FloatTensor(X_train[start:end]))
        labels = Variable(torch.FloatTensor(y_train[start:end]))
        # zero the parameter gradients
        optimizer.zero_grad()

    # forward + backward + optimize
    outputs = net(inputs)
    #print("outputs",outputs,outputs.shape,"labels",labels, labels.shape)
    loss = criterion(outputs, torch.unsqueeze(labels,dim=1))

    # print statistics
    running_loss += loss.item()
print('Epoch {}'.format(epoch+1), "loss: ",running_loss)
running_loss = 0.0


<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">

import pandas as pd
from sklearn.metrics import r2_score

X = Variable(torch.FloatTensor(X_train))
result = net(X)[:,0].numpy()
print(len(pred),len(y_train))
r2_score(pred,y_train)



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```import pandas as pd
from sklearn.metrics import r2_score
#This is a little bit tricky to get the resulting prediction.  
def calculate_r2(x,y=[]):
    This function will return the r2 if passed x and y or return predictions if just passed x. 
    # Evaluate the model with the test set. 
    X = Variable(torch.FloatTensor(x))  
    result = net(X) #This outputs the value for regression[:,0].numpy()
    if len(y) != 0:
        r2=r2_score(result, y)
        print("R-Squared", r2)
        #print('Accuracy {:.2f}'.format(num_right / len(y)), "for a total of ", len(y), "records")
        return pd.DataFrame(data= {'actual': y, 'predicted': result})
        print("returning predictions")
        return result



  • First import the package: from sklearn.linear_model import LinearRegression
  • Then create the model object.
  • Then fit the data.
  • This creates a trained model (an object) of class regression.
  • The variety of methods and attributes available for regression are shown here.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X_train, y_train )



## Evaluating the Model Results
- You have fit a model. 
- You can now store this model, save the object to disk, or evaluate it with different outcomes. 
- Trained regression objects have coefficients (`coef_`) and intercepts (`intercept_`) as attributes. 
- R-Squared is determined from the `score` method of the regression object.
- For Regression, we are going to use the coefficient of determination as our way of evaluating the results, [also referred to as R-Squared](

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">

print('R2 for Train)', lm.score( X_train, y_train ))
print('R2 for Test (cross validation)', lm.score(X_test, y_test))


