Python Web Scrapper
  • About
  • 📌Getting started
    • Installing Python
    • Installing PIP
  • 🔨Tools
    • Request
    • Beautiful Soup
    • Selenium
  • 📚Project
    • News Scrapper
Powered by GitBook
On this page
  • Getting The Links
  • Getting The Content

Was this helpful?

  1. 📚Project

News Scrapper

We will continue to create a news scrapper with the knowledge we learn from our tools.

Getting The Links

We use the homepage of the news site to get all the links of the most recent articles, so we need to analyze to find what all the links have in common this time it was the right-post-category class and append them all to a array. In this example we will use GazetaExpress.

r = requests.get(https://www.gazetaexpress.com/zgjedhjet2021/)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.findAll(element, attrs={'class': 'right-post-category'})
for div in data:
    links = div.findAll('a')
    for link in links:
        if link.get('href'):
            linksArray.append(link.get('href'))

Getting The Content

Now that we have the links of the articles we just have to find how is the content contained with which class, this case we have to get the content with the single__content class.

for link in linksArray:
    r = requests.get(link)
    soup = BeautifulSoup(r.text, 'html.parser')
    content = soup.find('div', attrs={'class': 'single__content'})
    text = content.text.lower()
    print(text.count(keyword))

We can continue to use now all the articles post that we get back for research or for anything in particular. We can use the count function to get how many times a word was used in the articles and see a pattern or create some statistics about anything really.

PreviousSelenium

Last updated 4 years ago

Was this helpful?