Introduction to Python - Kaggle Baseline
rpi.analyticsdojo.com
Running Code using Kaggle Notebooks
- Kaggle utilizes Docker to create a fully functional environment for hosting competitions in data science.
- You could download/run this locally or view the published version and
fork
it. - Kaggle has created an incredible resource for learning analytics. You can view a number of toy examples that can be used to understand data science and also compete in real problems faced by top companies.
```!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv !wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv
</div>
</div>
<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```import numpy as np
import pandas as pd
# Input data files are available in the "../input/" directory.
# Let's input them into a Pandas DataFrame
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train
and test
set on Kaggle
- The
train
file contains a wide variety of information that might be useful in understanding whether they survived or not. It also includes a record as to whether they survived or not. - The
test
file contains all of the columns of the first file except whether they survived. Our goal is to predict whether the individuals survived.
Baseline Models: No Survivors
- The Titanic problem is one of classification, and often the simplest baseline of all 0/1 is an appropriate baseline.
- Think of the baseline as the simplest model you can think of that can be used to lend intuition on how your model is working.
- Even if you aren’t familiar with the history of the tragedy, by checking out the Wikipedia Page we can quickly see that the majority of people (68%) died.
- As a result, our baseline model will be for no survivors.
```test[“Survived”] = 0
</div>
</div>
<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```submission = test.loc[:,["PassengerId", "Survived"]]
Write to CSV
The code below will write your dataframe to a CSV.
```submission.to_csv(‘everyone_dies.csv’, index=False)
</div>
</div>
## Download from Colab
Working on colab requires you to download a file via a google specific package.
<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```from google.colab import files
files.download('everyone_dies.csv')
The First Rule of Shipwrecks
- You may have seen it in a movie or read it in a novel, but women and children first has at it’s roots something that could provide our first model.
- Now let’s recode the
Survived
column based on whether was a man or a woman. - We are using conditionals to select rows of interest (for example, where test[‘Sex’] == ‘male’) and recoding appropriate columns.
```#Here we can code it as Survived, but if we do so we will overwrite our other prediction. #Instead, let’s code it as PredGender
test.loc[test[‘Sex’] == ‘male’, ‘PredGender’] = 0 test.loc[test[‘Sex’] == ‘female’, ‘PredGender’] = 1 #test.PredGender.astype(int) test
</div>
</div>
<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```submission = test.loc[:,['PassengerId', 'PredGender']]
# But we have to change the column name.
# Option 1: submission.columns = ['PassengerId', 'Survived']
# Option 2: Rename command.
submission.rename(columns={'PredGender': 'Survived'}, inplace=True)
Writeout and then Download your File
Try your first submission to Kaggle!
```submission.to_csv(‘women_survive.csv’, index=False)
</div>
</div>
<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```from google.colab import files
files.download('women_survive.csv')