AnalyticsDojo

Introduction to Python - Kaggle Baseline

rpi.analyticsdojo.com

Running Code using Kaggle Notebooks

  • Kaggle utilizes Docker to create a fully functional environment for hosting competitions in data science.
  • You could download/run this locally or view the published version and fork it.
  • Kaggle has created an incredible resource for learning analytics. You can view a number of toy examples that can be used to understand data science and also compete in real problems faced by top companies.

```!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv !wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```import numpy as np 
import pandas as pd 

# Input data files are available in the "../input/" directory.
# Let's input them into a Pandas DataFrame
train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

train and test set on Kaggle

  • The train file contains a wide variety of information that might be useful in understanding whether they survived or not. It also includes a record as to whether they survived or not.
  • The test file contains all of the columns of the first file except whether they survived. Our goal is to predict whether the individuals survived.


Baseline Models: No Survivors

  • The Titanic problem is one of classification, and often the simplest baseline of all 0/1 is an appropriate baseline.
  • Think of the baseline as the simplest model you can think of that can be used to lend intuition on how your model is working.
  • Even if you aren’t familiar with the history of the tragedy, by checking out the Wikipedia Page we can quickly see that the majority of people (68%) died.
  • As a result, our baseline model will be for no survivors.

```test[“Survived”] = 0

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```submission = test.loc[:,["PassengerId", "Survived"]]


Write to CSV

The code below will write your dataframe to a CSV.

```submission.to_csv(‘everyone_dies.csv’, index=False)

</div>

</div>



## Download from Colab
Working on colab requires you to download a file via a google specific package.  



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```from google.colab import files
files.download('everyone_dies.csv')

The First Rule of Shipwrecks

  • You may have seen it in a movie or read it in a novel, but women and children first has at it’s roots something that could provide our first model.
  • Now let’s recode the Survived column based on whether was a man or a woman.
  • We are using conditionals to select rows of interest (for example, where test[‘Sex’] == ‘male’) and recoding appropriate columns.

```#Here we can code it as Survived, but if we do so we will overwrite our other prediction. #Instead, let’s code it as PredGender

test.loc[test[‘Sex’] == ‘male’, ‘PredGender’] = 0 test.loc[test[‘Sex’] == ‘female’, ‘PredGender’] = 1 #test.PredGender.astype(int) test

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```submission = test.loc[:,['PassengerId', 'PredGender']]
# But we have to change the column name.
# Option 1: submission.columns = ['PassengerId', 'Survived']
# Option 2: Rename command.
submission.rename(columns={'PredGender': 'Survived'}, inplace=True)

Writeout and then Download your File

Try your first submission to Kaggle!

```submission.to_csv(‘women_survive.csv’, index=False)

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```from google.colab import files
files.download('women_survive.csv')