Beautiful Soup

No not actual soup, Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and etc

Setting it up

Beautiful Soup gives us a couple of useful functions that we can use to parse the HTML we fetched. To get started, use your terminal to install the Beautiful Soup library, like most Python libraries we have to install Beautiful Soup with PIP

$ pip install beautifulsoup4

Success: Then, import the library and proceed to create an object:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

When you create the object, Beautiful Soup will parse the content with the given argument and assigned it to a value.

Find Elements by ID

In a HTML page the DOM elements can have specific attributes like theID , this makes the element uniquely identifiable on the page. With developer tools we can identify the HTML object that we want and get the ID value.

Note: Keep in mind that it’s helpful to switch back to your browser and explore the page using developer tools. This helps you learn how to find the elements you want to fetch.

<div id="ExampleDiv">
    <article class="article-content">Article 1.
        <a href="thelink.com">Random Link</a>
    </article>
    <article class="article-content">Article 2.</article>
    <article class="article-content">Article 3.</article>
</div>

Beautiful Soup has a specific function to find these DOM elements with their respective arguments, in this case we are asking BS to find the DOM element with the ID name of ResultsContainer.

results = soup.find(id='ExampleDiv')

Find Elements by HTML Class Name

Beautiful soup gives us more flexibility for getting DOM elements besides using the elements attributes we can use the element type like so:

articles = results.find_all('article', class_='article-content')

Note: Be careful don't forget to add the _ at the end of the class because the class name is already used by python so for BS they changed to a name that's available: class_

So the first argument is element type and the second this time its a different attribute now its the class attribute. And as u can see this time we are not searching on the scope of the whole site we are using the results variable so BS will only search what's inside the results div we fetched earlier.

Use this wisely this can be really useful for keeping computing time low because u don't have to search elements every time on the scope of the whole site and limiting your scope can make your scraping more accurate by helping you avoid fetching unwanted elements.

As u can we used a different BS function this time instead of find we use find_all so this time it will return a iterable containing all the HTML for all the articles displayed on that page.

articles = 
#<article class="article-content">Lorem Ipsum.</article>
#<article class="article-content">Lorem Ipsum.</article>
#<article class="article-content">Lorem Ipsum.</article>

Extracting Data

Extracting Text

Now that we know how to find the DOM elements we can learn how to extract data from them. And behold! Beautiful Soup has got you covered. You can add .text to a Beautiful Soup object to return only the text content of the HTML elements that the object contains:

for article in articles:
    print(article.text())

If your results are looking like this you are doing great.

Article 1.
Article 2.
Article 3.

Extract Attributes

We already learned how to extract data from DOM elements but that just includes text not the actual attributes of a element, its really important to learn this part because in most web scrapper you will be fetching anchor elements and their href attribute. We will start by fetching the <a> element. Then, extract the value of its href attribute using square-bracket notation:

for article in articles:
    link = article.find('a')['href']

You can use the same square-bracket notation to extract other HTML attributes as well. A common use case is to fetch the URL of a link, as you did above.

Last updated