# Beautiful Soup

## Setting it up

Beautiful Soup gives us a couple of useful functions that we can use to parse the HTML we fetched. To get started, use your terminal to install the Beautiful Soup library, like most Python libraries we have to install Beautiful Soup with PIP

```
$ pip install beautifulsoup4
```

{% hint style="success" %}
**Success**: Then, import the library and proceed to create an object:
{% endhint %}

```python
import requests
from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')
```

When you create the object, Beautiful Soup will parse the content with the given argument and assigned it to a value.

## Find Elements by ID

In a HTML page the DOM elements can have specific attributes like the`ID` , this makes the element uniquely identifiable on the page. With developer tools we can identify the HTML object that we want and get the `ID` value.

{% hint style="info" %}
&#x20;**Note:** Keep in mind that it’s helpful to switch back to your browser and explore the page using developer tools. This helps you learn how to find the elements you want to fetch.
{% endhint %}

```markup
<div id="ExampleDiv">
    <article class="article-content">Article 1.
        <a href="thelink.com">Random Link</a>
    </article>
    <article class="article-content">Article 2.</article>
    <article class="article-content">Article 3.</article>
</div>
```

Beautiful Soup has a specific function to find these **DOM** elements with their respective arguments, in this case we are asking **BS** to find the **DOM** element with the **`ID`** name of **`ResultsContainer`**.

```python
results = soup.find(id='ExampleDiv')
```

## Find Elements by HTML Class Name

Beautiful soup gives us more flexibility for getting **DOM** elements besides using the elements attributes we can use the element type like so:

```python
articles = results.find_all('article', class_='article-content')
```

{% hint style="info" %}
**Note:** Be careful don't forget to add the \_ at the end of the class because the class name is already used by python so for BS they changed to a name that's available: *`class_`*
{% endhint %}

So the first argument is element type and the second this time its a different attribute now its the **`class`** attribute. And as u can see this time we are not searching on the scope of the whole site we are using the **results** variable so **BS** will only search what's inside the results div we fetched earlier.

> Use this wisely this can be really useful for keeping computing time low because u don't have to search elements every time on the scope of the whole site and limiting your scope can make your scraping more accurate by helping you avoid fetching unwanted elements.

As u can we used a different **BS** function this time instead of `find` we use `find_all` so this time  it will return a iterable containing all the HTML for all the articles displayed on that page.

```python
articles = 
#<article class="article-content">Lorem Ipsum.</article>
#<article class="article-content">Lorem Ipsum.</article>
#<article class="article-content">Lorem Ipsum.</article>
```

## Extracting Data

### Extracting Text

Now that we know how to find the **DOM** elements we can learn how to extract data from them. And behold! Beautiful Soup has got you covered. You can add `.text` to a Beautiful Soup object to return only the **text content** of the HTML elements that the object contains:

```python
for article in articles:
    print(article.text())
```

If your results are looking like this you are doing great.

```
Article 1.
Article 2.
Article 3.
```

### Extract Attributes

We already learned how to extract data from **DOM** elements but that just includes text not the actual attributes of a element, its really important to learn this part because in most web scrapper you will be fetching anchor elements and their `href` attribute. We will  start by fetching the `<a>` element. Then, extract the value of its `href` attribute using square-bracket notation:

```python
for article in articles:
    link = article.find('a')['href']
```

> You can use the same square-bracket notation to extract other HTML attributes as well. A common use case is to fetch the URL of a link, as you did above.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://arditxhaferi2.gitbook.io/python-web-scrapper/tools/beautiful-soup.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
