Automating-Webscraping-with-Airflow
Automating-Webscraping-with-Airflow
https://github.com/fanqingsong/Automating-Webscraping-with-Airflow
This project provides step-by-step instructions on setting up and running a web scraping project using Apache Airflow. The project involves scraping data from aan ecommerce site(tonaton.com) using Airflow's scheduling capabilities. Below are the key steps to set up and run the project.
Build the Airflow Docker image using the provided Dockerfile:
docker build -f Dockerfile.Airflow . -t airflow-scraper
This command will create an Airflow image named
airflow-scraper
.Create the necessary project folders for your Airflow environment, including
dags
,logs
, andplugins
:mkdir ./dags ./logs ./pluginsCreate an
.env
file to set the Airflow user ID (UID) and group ID (GID). Your.env
file should look like this:AIRFLOW_UID=33333 AIRFLOW_GID=0This ensures that Airflow runs with the correct permissions.
Start the Airflow and Spark containers using Docker Compose:
docker-compose -f docker-compose.Airflow.yaml up -dThis command will launch the containers in the background.
Define and configure the Airflow DAG for your web scraping task. Ensure that the DAG includes the necessary tasks, operators, and dependencies. Customize the DAG according to your web scraping requirements.
Once the DAG is defined and configured, you can trigger its execution through the Airflow CLI or web interface. The DAG will schedule and run your web scraping task at the specified intervals.
Monitor the progress of your web scraping tasks using the Airflow web interface or command-line tools. You can view logs, check task statuses, and troubleshoot any issues that may arise during the execution of the DAG.
Feel free to customize the web scraping logic, add more tasks, or scale the project as needed. You can adjust scheduling intervals, add error handling, or integrate with external data storage systems to enhance your web scraping pipeline.
With these steps, you can set up a web scraping project using Apache Airflow, allowing you to automate data extraction from websites on a scheduled basis. Ensure that you adhere to the website's terms of service and legal regulations when performing web scraping activities.
https://github.com/fanqingsong/Automating-Webscraping-with-Airflow/blob/main/dags/tonaton_scraper.py
import logging import time import requests from bs4 import BeautifulSoup from .utils import save_to_csv_s3 logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) def get_products(categories: list[str]): product_list = [] for category in categories: url =f"https://tonaton.com/{category}" page = 1 logging.info(f"fetching category: {category}") while page <= 3: logging.info(f"fetching page {page}") resp = requests.get(url+f"?page={page}") content = BeautifulSoup(resp.content, "html.parser") for product in content.select("div.product__content"): try: title = product.select("span.product__title")[0].get_text().strip() location = product.select("p.product__location")[0].get_text().strip() description = product.select("p.product__description")[0].get_text().strip() condition = product.select("div.product__tags span")[0].get_text().strip() data = { "title": title, "location": location, "description": description, "condition": condition } except IndexError: continue product_list.append(data) time.sleep(1) page += 1 return product_list # categories to scrape categories = ["c_vehicles", "c_mobile-phones-tablets", "c_fashion-and-beauty"] def run(): object_key = "src/product_list.csv" product_list = get_products(categories) save_to_csv_s3(object_key, product_list)
https://github.com/fanqingsong/Automating-Webscraping-with-Airflow/blob/main/dags/dag.py
from datetime import timedelta import airflow from airflow import DAG from airflow.operators.python import PythonOperator from dags.tonaton_scraper import get_products default_args = { "owner": "airflow", "retry_delay": timedelta(minutes=5), } # set up dag with DAG( "scraper_dag", default_args=default_args, description="Run scraper with python operator", start_date=airflow.utils.dates.days_ago(1), schedule_interval="@daily", catchup=False, ) as dag: tonaton_scraper = PythonOperator( task_id="egotickets_scraper", python_callable=get_products.run, )
Why Airflow is a Game Changer for Data Pipelines
https://webscrapingsite.com/resources/building-scraping-pipeline-apache-airflow/
Apache Airflow provides a framework to programmatically author, schedule and monitor complex workflows. According to a recent survey by Astronomer, 78% of data teams rely on Airflow for ETL and data integration.
Here are some of the key advantages Airflow delivers:
Visual Pipeline authoring
Airflow uses Python to define Directed Acyclic Graphs (DAGs) which provide a birds-eye view of the workflow pipeline. Nodes represent tasks, while edges show dependencies. Visually constructing pipelines makes the logic easy to comprehend.
Dynamic scheduling with smart retry logic
Tasks can run on standard cron schedules, while also supporting dynamic date-based scheduling. Failed tasks are automatically retried using exponential backoff. The scheduler is highly configurable.
Scalable distributed execution
Airflow leverages Celery to distribute tasks across many workers in parallel. Pipelines effortlessly scale to handle millions of tasks. Asana saw a 5x speedup by switching to Airflow.
Advanced monitoring and alerting
Airflow has built-in charts, logs and task views to monitor pipelines. It can push metadata to enterprise APM tools. Task success and failure alerts can be configured using webhooks or email.
Extensible and Open
Airflow has a modular plug-in architecture. The rich Python API makes it easy to customize and extend capabilities. The open-source community contributes operators for almost every tool.
For complex web scraping projects, Airflow addresses many common pain points out of the box through its enterprise-grade workflow orchestration capabilities. Next let‘s walk through a simple web scraping example.
Scraping Product Data
To demonstrate core concepts, we’ll build a scraper for an e-commerce site using Python and Requests. While basic, it illustrates the key tasks in a web scraping pipeline.
Let‘s start by scraping data from a single page:
import requests URL = ‘https://www.ecommerce.com/p/1‘ def scrape_page(url): response = requests.get(url) # Parse HTML and extract data into dict return data page_data = scrape_page(URL) print(page_data)
This makes a requests to the URL, parses the HTML, extracts fields like title and price into a Python dict, and returns the structured data.
With some additional logic, we can loop through multiple product pages:
URLS = [...] # List of 1000 urls all_data = [] for url in URLS: data = scrape_page(url) all_data.append(data) print(all_data)
This allows batch scraping of an entire catalog. But even for a few thousand pages, issues can arise:
- Slow performance – Requests are made sequentially, limited by network I/O
- Fragility – Any single failure breaks the entire loop
- Duplicates – No deduplication if a page is scraped multiple times
- No work tracking – Can‘t identify pages that still need scraping
Without robust orchestration, these problems are amplified at scale. Airflow provides the solution.
Creating an Airflow DAG for Web Scraping
In Airflow, pipelines are defined through DAGs (Directed Acyclic Graphs). A DAG outlines the steps and their dependencies as code.
We‘ll create a DAG to:
- Fetch URL list – Get CSV of product URLs from S3
- Scrape Page – Download and parse each URL
- Store Data – Insert JSON output into PostgreSQL
First we define some helper functions:
# extractor.py import requests, json from bs4 import BeautifulSoup def scrape(url): """Scrape page and return JSON data""" response = requests.get(url) soup = BeautifulSoup(response.text, ‘html.parser‘) data = { "title": soup.find("h1").text, "price": soup.find("div", class_="price").text, # ... extract other fields } return json.dumps(data) # loader.py import psycopg2 def insert_row(data): """Insert JSON data into Postgres""" conn = psycopg2.connect(DATABASE_URI) cur = conn.cursor() # Use psycopg2 to insert JSON data conn.commit() cur.close()
These contain the main business logic for scraping and data loading.
Now we import them into our DAG file:
# scraper_dag.py from airflow import DAG from airflow.operators import PythonOperator from datetime import datetime from extractor import scrape from loader import insert_row default_args = { ‘owner‘: ‘My Name‘, ‘depends_on_past‘: False, ‘email‘: [‘me@example.com‘], ‘retries‘: 2, ‘retry_delay‘: timedelta(minutes=5) } dag = DAG( ‘webscraper‘, default_args=default_args, start_date=datetime(2019, 1, 1), schedule_interval=‘@daily‘, ) fetch_task = PythonOperator( task_id=‘fetch‘, python_callable=fetch_url_list, # Loads CSV from S3 dag=dag ) scrape_task = PythonOperator( task_id=‘scrape‘, python_callable=scrape, op_kwargs={‘url‘: ‘{{ ti.xcom_pull(task_ids="fetch") }}‘}, dag=dag, ) store_task = PythonOperator( task_id=‘store‘, python_callable=insert_row, op_kwargs={‘data‘: ‘{{ ti.xcom_pull(task_ids="scrape") }}‘}, dag=dag, ) fetch_task >> scrape_task >> store_task
This DAG encapsulates the workflow by:
- Instantiating a
PythonOperator
for each step- Passing data between tasks using XComs
- Setting directional dependencies between tasks
When run, Airflow will:
- Call
fetch_url_list
to load the CSV of URLs- Parse the CSV and send each URL to the
scrape
task- Take the JSON output of scraping and insert into Postgres using
insert_row
This provides a robust, scalable pipeline for web scraping!
Visualizing the DAG
The Airflow UI provides a visual DAG representation:
We can check:
- Number of task instances currently running
- Overall progress and health
- Logs and metadata
- Success/failure breakdown
The pipeline logic is transparently documented. When issues occur, Airflow surfaces them immediately to be diagnosed and fixed.
Comparing Airflow to Other Workflow Tools
Airflow is commonly compared to tools like Apache Oozie and Azkaban. Here‘s how it stacks up for web scraping:
Tool Strengths Weaknesses Airflow Python DAGs, pluggable architecture, great UI, scalability Steep learning curve, can get complex fast Oozie Built-in Hadoop support, simple XML workflows Designed for batch, limited flexibility Azkaban Simple key-value configs, great visualization Not as extensible, smaller community For general purpose pipelines, Airflow provides more power and flexibility. The pluggable architecture makes it easy to slot in new data sources, tools, and processing logic as needed.
Scraping Data From Multiple Sources
A key strength of Airflow is supporting complex workflows that span multiple systems.
Let‘s look at an example DAG that scrapes two different sites, combines and aggregates the data:
from airflow import DAG from airflow.operators import PythonOperator # Scraping operators scrape_site1 = PythonOperator(task_id=‘scrape_site1‘, ...) scrape_site2 = PythonOperator(task_id=‘scrape_site2‘, ...) # Transformation operators clean_data = PythonOperator(task_id=‘clean_data‘, ...) combine_data = PythonOperator(task_id=‘combine_data‘, ...) # Aggregate and load calculate_stats = PythonOperator(task_id=‘stats‘, ...) load_warehouse = PythonOperator(task_id=‘load_warehouse‘, ...) # Dependencies (scrape_site1, scrape_site2) >> clean_data >> combine_data >> calculate_stats >> load_warehouse
This demonstrates a complex branching pipeline with aggregation logic. Airflow‘s code-based approach makes it straightforward to build multi-step workflows.
Moving to Production
While the example provided demonstrates core concepts, more care is needed when moving to production. Here are some best practices:
Error Handling
- Set up SMTP/SMS/Slack alerts for task failures
- Build custom Airflow plugins to implement retry policies
- Perform ad-hoc scrapes through API/CLI when needed
Logging & Metrics
- Stream logs to ElasticSearch for analysis
- Record business metrics like number of rows scraped
- Add sensors to check SLAs and data quality
Scaling Out
- Run Airflow on Kubernetes for auto-scaling
- Add queues and pools for different work priority levels
- Split work across regions using multiple DAGs
Security
- Enable authentication and RBAC
- Encrypt credentials and sensitive data
- Use private VPCs with firewall rules to isolate workers
Monitoring
- Set up external APM tools like Datadog to track performance
- Create custom metrics on queue lag, data processed, pipeline duration etc.
Testing
- Leverage Airflow CLI to run one-off tests
- Mock outputs to simulate different scenarios
- Implement integration tests that validate pipeline correctness
These are just a few tips – entire books have been written on mastering Airflow in production!
The key is to incrementally evolve the pipeline’s robustness as your use case grows while leaning on Airflow‘s architecture every step of the way.
Conclusion
Apache Airflow is invaluable for creating resilient, scalable data pipelines. For web scraping projects, it provides an enterprise-grade orchestration framework out of the box.
Some of the key benefits we explored include:
- Modeling complex workflows as code through Python DAGs
- Dynamic task scheduling and retries
- Distributed scaling to handle large workloads
- Advanced monitoring capabilities
- Flexible framework to incorporate new data sources and processing logic
Airflow has a bit of learning curve, but pays dividends in taming unwieldy scraping pipelines. If you found this guide useful, check out the Official Airflow Docs for many more examples of production-grade workflows. Feel free to reach out if you have any other questions!