top of page
Gradient With Circle
Image by Nick Morrison

Insights Across Technology, Software, and AI

Discover articles across technology, software, and AI. From core concepts to modern tech and practical implementations.

Web Scraping with Python: How to Extract Data from Websites

  • Writer: Samul Black
    Samul Black
  • 27 minutes ago
  • 6 min read

Web scraping with Python is a powerful technique that allows you to extract data from websites automatically, saving time and effort compared to manual copying. In this tutorial, we’ll dive into web scraping with Python using the popular Beautiful Soup library.


To make things practical, we’ll demonstrate how to extract data directly from ColabCodes, showing step by step how to collect information from real web pages. By the end of this guide, you’ll have a solid foundation to scrape data from any website for your projects or analysis.

web scrapping with python - colabcodes

What is Web Scraping?

Web scraping is the process of automatically collecting information from websites. Instead of manually copying and pasting content, web scraping allows you to extract large amounts of structured data quickly and efficiently. This can include anything from product prices, blog posts, news articles, stock information, or even social media data. Essentially, web scraping turns the unstructured content on web pages into usable data that you can analyze, store, or use in your applications.


Web scraping is not just about collecting data—it’s about unlocking insights that are otherwise hidden within websites. With web scraping, you can track trends over time, compare information across multiple sources, and even build datasets for advanced analytics or machine learning projects. It allows individuals and businesses to make data-driven decisions by transforming vast amounts of online content into structured, actionable information. Whether you are a researcher, developer, marketer, or data enthusiast, web scraping opens the door to opportunities that manual data collection simply cannot match.


Why Web Scraping with Python?

Python has become the go-to language for web scraping due to its simplicity, readability, and powerful libraries. Its syntax is beginner-friendly, which means even those new to programming can start scraping websites with minimal effort. Key advantages of using python for web scraping:


  1. Ease of use: Clear syntax and simple code make scraping faster to implement.

  2. Powerful libraries: Beautiful Soup, Requests, and Selenium handle most scraping tasks efficiently.

  3. Flexibility: Works with static pages, dynamic content, and even websites requiring interaction.

  4. Community support: Extensive tutorials, forums, and documentation help solve problems quickly.

  5. Integration-ready: Python can easily process, analyze, and store scraped data for further use.


Libraries like Beautiful Soup simplify HTML parsing, Requests allows you to fetch web pages effortlessly, and Selenium can handle dynamic, interactive websites that require user actions. Together, these tools provide a flexible and robust scraping ecosystem.

Python’s strong community, extensive documentation, and wealth of tutorials make it easy to find solutions, troubleshoot issues, and improve your scraping projects continuously. This combination of simplicity, versatility, and support is why Python dominates the web scraping landscape.


Importance of Web Scraping

Web scraping plays a crucial role in many areas, allowing individuals and businesses to access information quickly and efficiently. Its importance goes beyond mere data collection, helping users gain actionable insights and make data-driven decisions. Key benefits of web scraping include:


  1. Data Analysis: Collect data from multiple sources to study trends, perform research, or conduct competitive analysis.

  2. Automation: Save time and reduce errors by automating repetitive tasks like monitoring updates on websites.

  3. Business Intelligence: Extract insights from market trends, pricing, product availability, or customer behavior.

  4. Content Aggregation: Compile news articles, blog posts, reviews, or product listings into a single, structured source.

  5. Decision Making: Access timely and accurate data to support strategic planning and operations.


By leveraging these benefits, web scraping empowers businesses, researchers, and developers to work more efficiently, make smarter decisions, and gain insights that would be difficult or impossible to collect manually.


Common Use Cases of Web Scraping

Web scraping is widely used across industries and projects, enabling businesses, researchers, and developers to leverage web data in meaningful ways. Popular use cases include:


  1. E-commerce: Collect product details, prices, reviews, and availability from online stores.

  2. Finance & Stock Monitoring: Track stock prices, cryptocurrency trends, or financial reports.

  3. Data Science & Machine Learning: Build datasets for AI, natural language processing, or predictive analytics projects.

  4. Media & Content Aggregation: Aggregate news articles, blogs, and social media content for analysis or dashboards.

  5. Market Research: Monitor competitors’ offerings, promotions, and trends in real time.


These use cases highlight how versatile web scraping is, providing practical solutions for businesses, academics, and developers to harness online data for insights, innovation, and strategic advantage.


Hands 0n Implementation Web Scrapping in Python using Beautiful Soup Module

In this hands-on section, we’ll put theory into practice by demonstrating how to scrape data from a real website using Python and the Beautiful Soup library. Instead of manually copying content, we’ll automate the process to extract meaningful text from multiple pages of ColabCodes. By leveraging techniques like parsing HTML and navigating website structures, this practical example will show how web scraping can be used to collect, organize, and analyze web content efficiently. Whether you’re a beginner or looking to expand your Python skills, this exercise provides a clear, step-by-step approach to building a functional web scraping workflow.


Step 1: Fetching the Sitemap

The first step in our web scraping workflow is to access the website’s sitemap, which contains a structured list of all the pages we want to scrape. Using Python’s requests library, we send a GET request to the sitemap URL while including custom headers to mimic a real browser. This ensures that the website treats our request as legitimate and returns the XML content of the sitemap. By checking the response status code, we confirm whether the sitemap was successfully fetched before proceeding to parse and extract the URLs. This initial step sets the foundation for efficiently scraping all relevant pages without manually collecting URLs.

import requests
from bs4 import BeautifulSoup
import json

# Sitemap URL
SITEMAP_URL = "https://www.colabcodes.com/pages-sitemap.xml"

# Custom headers
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

# Fetch sitemap
print(f"Fetching sitemap: {SITEMAP_URL}")
resp = requests.get(SITEMAP_URL, headers=HEADERS)
if resp.status_code != 200:
    print(f"Failed to load sitemap (code {resp.status_code})")
    exit(1)

Step 2: Parsing the Sitemap and Extracting URLs

After successfully fetching the sitemap, the next step is to parse its content and extract all the page URLs. We use Beautiful Soup with the lxml parser to read the XML structure of the sitemap. By finding all <loc> tags, which store individual page URLs, we create a list of every page on the website. This approach allows us to automatically gather all relevant pages without manually searching for links. At the same time, we initialize a dictionary to store the text content of each page, setting the stage for systematically scraping and organizing the website’s data.

# Parse URLs from sitemap
soup = BeautifulSoup(resp.content, "lxml")
urls = [tag.text.strip() for tag in soup.find_all("loc")]
print(f"Found {len(urls)} URLs in sitemap.")

# Dictionary to store page text
pages_dict = {}

Step 3: Scraping and Storing Text from Each Web Page

With the list of URLs ready, the script now loops through each page and fetches its content one by one. For every URL, a GET request is sent using custom headers to ensure stable access. The HTML response is then parsed using Beautiful Soup, allowing us to extract the visible text from the page. By calling get_text(), the script collects readable content while ignoring HTML tags, making the data easier to analyze or process later. Each page’s extracted text is stored in a dictionary using the page URL as the key, ensuring that the content remains organized and easy to reference. Error handling is included to skip inaccessible pages and continue scraping without interrupting the entire process, making the approach reliable and scalable.

# Loop through each page URL
for i, page_url in enumerate(urls, start=1):
    try:
        print(f"[{i}/{len(urls)}] Fetching: {page_url}")
        page_resp = requests.get(page_url, headers=HEADERS)
        if page_resp.status_code != 200:
            print(f"  Skipped (status {page_resp.status_code})")
            continue

        # Parse page HTML
        page_soup = BeautifulSoup(page_resp.text, "html.parser")

        # Extract main text 
        body_text = page_soup.get_text(separator="\n", strip=True)

        # Save in dictionary
        pages_dict[page_url] = body_text
        print(f"  Stored text for {page_url}")

    except Exception as e:
        print(f"  Error scraping {page_url}: {e}")

Output:
Stored text for https://www.colabcodes.com/freelance-programming-experts
[2/41] Fetching: https://www.colabcodes.com/computer-vision-phd-help
  Stored text for https://www.colabcodes.com/computer-vision-phd-help
[3/41] Fetching: https://www.colabcodes.com/nlp-project-and-research-help
  Stored text for https://www.colabcodes.com/nlp-project-and-research-help
[4/41] Fetching: https://www.colabcodes.com/coding-help-code-coach
  Stored text for https://www.colabcodes.com/coding-help-code-coach
.
.
.

Step 4: Formatting and Displaying the Extracted Content

After collecting text from all web pages, the final step focuses on presenting the scraped data in a clean and readable format. This helper function iterates through the stored page content and prints each page’s text with clear visual separators. By splitting the extracted text into individual lines and removing empty entries, the output becomes easier to read and understand. Additional spacing between paragraphs improves readability, making it simpler to review the scraped content directly in the console. This step is especially useful for validating the extracted data before saving it to files or using it for further analysis.

def pretty_print_pages(pages_dict):
    for url, text in pages_dict.items():
        print("="*80)
        print(f"URL: {url}")
        print("="*80 + "\n")
        
        # Split text by lines, remove empty lines, and print with spacing
        lines = [line.strip() for line in text.splitlines() if line.strip()]
        for line in lines:
            print(line)
            print()  # extra newline for paragraph spacing
        break
        
        print("\n" + "="*80 + "\n")

pretty_print_pages(pages_dict)

Output:
================================================================================
URL: https://www.colabcodes.com/freelance-programming-experts
================================================================================

Hire Freelance Programming Experts for Product Builds | ColabCodes

top of page

contact.colabcodes@gmail.com

+918899822578

Freelance

Generative AI Experts
...

Conclusion

Web scraping with Python opens up a practical way to work with the vast amount of information available across websites. With libraries like Beautiful Soup, extracting meaningful data becomes both accessible and efficient, even when dealing with multiple pages. Using structured sources such as sitemaps further streamlines the process, allowing web content to be collected in an organized and scalable manner.

As web data continues to play a critical role in analysis, automation, and intelligent applications, having a clear understanding of web scraping techniques becomes increasingly valuable. The concepts and approach discussed here provide a strong foundation that can be adapted and expanded for more advanced use cases, from refined content extraction to large-scale data pipelines and analytics-driven projects.

Get in touch for customized mentorship, research and freelance solutions tailored to your needs.

bottom of page