If you want to explore web scraping, Python is the best place to start. Thanks to its simple syntax and great library support, Python makes it easy to extract data from websites.
In this tutorial, you’ll learn how to use Requests and Beautiful Soup to scrape web pages and analyze them. As an example, the project will collect post titles from the r/programming subreddit and determine the most mentioned programming languages.
Web scraping is the automated collection of data from websites.
Scrapers fetch a page’s HTML and extract needed data. Advanced tools may even use headless browsers to simulate user actions.
⚠️ Web scraping can break easily when a website’s structure changes. Always check for available APIs before scraping.
Python offers unmatched simplicity and a strong ecosystem:
- Requests for handling HTTP requests
- BeautifulSoup for HTML parsing
- Scrapy and Playwright for advanced use cases
These tools are well-documented, reliable, and widely used by developers.
You’ll need Python installed. Then, install the libraries:
pip install requests
pip install bs4Create a file named scraper.py for your code.
Fetching page data is the first step. The example below loads the front page of r/programming from the old Reddit interface.
import requests
page = requests.get("https://old.reddit.com/r/programming/",
headers={'User-agent': 'Sorry, learning Python!'})
html = page.contentTo extract titles from the HTML, use BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
p_tags = soup.find_all("p", "title")
titles = [p.find("a").get_text() for p in p_tags]
print(titles)This prints the titles of the posts on the first page.
You can extend the script to scrape multiple pages by looping through them.
import requests
from bs4 import BeautifulSoup
import time
post_titles = []
next_page = "https://old.reddit.com/r/programming/"
for current_page in range(0, 20):
page = requests.get(next_page, headers={'User-agent': 'Sorry, learning Python!'})
html = page.content
soup = BeautifulSoup(html, "html.parser")
p_tags = soup.find_all("p", "title")
titles = [p.find("a").get_text() for p in p_tags]
post_titles += titles
next_page = soup.find("span", "next-button").find("a")['href']
time.sleep(3)
print(post_titles)After scraping, you can analyze which programming languages appear most often in post titles.
language_counter = {
"javascript": 0, "html": 0, "css": 0, "sql": 0,
"python": 0, "typescript": 0, "java": 0, "c#": 0,
"c++": 0, "php": 0, "c": 0, "powershell": 0,
"go": 0, "rust": 0, "kotlin": 0, "dart": 0, "ruby": 0,
}
words = []
for title in post_titles:
words += [word.lower() for word in title.split()]
for word in words:
for key in language_counter:
if word == key:
language_counter[key] += 1
print(language_counter)Frequent scraping can get you blocked. Use a proxy server to hide your IP and distribute requests.
Example with IPRoyal Residential Proxies:
PROXIES = {
"http": "http://yourusername:yourpassword@geo.iproyal.com:22323",
"https": "http://yourusername:yourpassword@geo.iproyal.com:22323"
}
page = requests.get(next_page,
headers={'User-agent': 'Just learning Python, sorry!'},
proxies=PROXIES)This routes all requests through proxy servers to prevent rate-limiting or bans.
You’ve learned how to:
- Fetch and parse HTML with Requests + BeautifulSoup
- Scrape multiple pages of Reddit
- Count programming language mentions
- Add proxy rotation for safer scraping
For more advanced scraping, explore frameworks like Scrapy or Playwright.