Introduction to Python - Web Mining

rpi.analyticsdojo.com

This tutorial is directly from the the BeautifulSoup documentation.

[https://www.crummy.com/software/BeautifulSoup/bs4/doc/]

Before you begin

If running locally you need to make sure that you have beautifulsoup4 installed. conda install beautifulsoup4

It should already be installed on colab.

All html documents have structure. Here, we can see a basic html page.

```html_doc = “””

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" ``` </div> </div> The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

```from bs4 import BeautifulSoup import requests soup = BeautifulSoup(html_doc, ‘html.parser’)

print(soup.prettify())

</div>

</div>

### A Retreived Beautiful Soup Object
- Can be parsed via dot notation to travers down the hierarchy by *class name*, *tag name*, *tag type*, etc.

<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```soup

```#Select the title class. soup.title

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Name of the tag.
soup.title.name

```#String contence inside the tag soup.title.string

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#Parent in hierarchy.
soup.title.parent.name

```#List the first p tag. soup.p

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#List the class of the first p tag.
soup.p['class']

```#List the class of the first a tag. soup.a

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```#List all a tags.
soup.find_all('a')

soup.find(id="link3")

```#The Robots.txt listing who is allowed. response = requests.get(“https://en.wikipedia.org/robots.txt”) txt = response.text print(txt)

</div>

</div>



<div markdown="1" class="cell code_cell">
<div class="input_area" markdown="1">
```response = requests.get("https://www.rpi.edu")
txt = response.text
soup = BeautifulSoup(txt, 'html.parser')

print(soup.prettify())

```# Experiment with selecting your own website. Selecting out a url.

response = requests.get(“enter url here”) txt = response.text soup = BeautifulSoup(txt, ‘html.parser’)

print(soup.prettify())

```

#For more info, see [https://github.com/stanfordjournalism/search-script-scrape](https://github.com/stanfordjournalism/search-script-scrape)