Automating-Webscraping-with-Airflow

https://github.com/fanqingsong/Automating-Webscraping-with-Airflow

Web Scraping with Airflow

This project provides step-by-step instructions on setting up and running a web scraping project using Apache Airflow. The project involves scraping data from aan ecommerce site(tonaton.com) using Airflow's scheduling capabilities. Below are the key steps to set up and run the project.

Step 1: Build the Airflow Docker Image

Build the Airflow Docker image using the provided Dockerfile:
docker build -f Dockerfile.Airflow . -t airflow-scraper
This command will create an Airflow image named airflow-scraper.

Step 2: Create Project Folders

Create the necessary project folders for your Airflow environment, including dags, logs, and plugins:
mkdir ./dags ./logs ./plugins
Step 3: Set Environment Variables

Create an .env file to set the Airflow user ID (UID) and group ID (GID). Your .env file should look like this:
AIRFLOW_UID=33333
AIRFLOW_GID=0
This ensures that Airflow runs with the correct permissions.

Step 4: Start Airflow and Spark Containers

Start the Airflow and Spark containers using Docker Compose:
docker-compose -f docker-compose.Airflow.yaml up -d
This command will launch the containers in the background.

Step 5: Define and Configure the DAG

Define and configure the Airflow DAG for your web scraping task. Ensure that the DAG includes the necessary tasks, operators, and dependencies. Customize the DAG according to your web scraping requirements.

Step 6: Run the Web Scraping DAG

Once the DAG is defined and configured, you can trigger its execution through the Airflow CLI or web interface. The DAG will schedule and run your web scraping task at the specified intervals.

Step 7: Monitor and Troubleshoot

Monitor the progress of your web scraping tasks using the Airflow web interface or command-line tools. You can view logs, check task statuses, and troubleshoot any issues that may arise during the execution of the DAG.

Step 8: Customize and Scale

Feel free to customize the web scraping logic, add more tasks, or scale the project as needed. You can adjust scheduling intervals, add error handling, or integrate with external data storage systems to enhance your web scraping pipeline.

Conclusion

With these steps, you can set up a web scraping project using Apache Airflow, allowing you to automate data extraction from websites on a scheduled basis. Ensure that you adhere to the website's terms of service and legal regulations when performing web scraping activities.

https://github.com/fanqingsong/Automating-Webscraping-with-Airflow/blob/main/dags/tonaton_scraper.py

import logging
import time
import requests
from bs4 import BeautifulSoup
from .utils import save_to_csv_s3

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


def get_products(categories: list[str]):
    product_list = []

    for category in categories:
        url =f"https://tonaton.com/{category}"
        page = 1

        logging.info(f"fetching category: {category}")
        while page <= 3:
            logging.info(f"fetching page {page}")
            resp = requests.get(url+f"?page={page}")
            content = BeautifulSoup(resp.content, "html.parser")
            for product in content.select("div.product__content"):

                try:
                    title = product.select("span.product__title")[0].get_text().strip()
                    location = product.select("p.product__location")[0].get_text().strip()
                    description = product.select("p.product__description")[0].get_text().strip()
                    condition = product.select("div.product__tags span")[0].get_text().strip()
                    data = {
                        "title": title,
                        "location": location,
                        "description": description,
                        "condition": condition
                    }
                except IndexError:
                    continue  
                product_list.append(data)        
            time.sleep(1)
            page += 1
            
    return product_list

# categories to scrape
categories = ["c_vehicles", "c_mobile-phones-tablets", "c_fashion-and-beauty"]

def run():
    object_key = "src/product_list.csv"
    product_list = get_products(categories)
    save_to_csv_s3(object_key, product_list)

https://github.com/fanqingsong/Automating-Webscraping-with-Airflow/blob/main/dags/dag.py

from datetime import timedelta
import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from dags.tonaton_scraper import get_products

default_args = {
    "owner": "airflow",
    "retry_delay": timedelta(minutes=5),
}

# set up dag
with DAG(
    "scraper_dag",
    default_args=default_args,
    description="Run scraper with python operator",
    start_date=airflow.utils.dates.days_ago(1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    tonaton_scraper = PythonOperator(
        task_id="egotickets_scraper",
        python_callable=get_products.run,
    )

Why Airflow is a Game Changer for Data Pipelines

https://webscrapingsite.com/resources/building-scraping-pipeline-apache-airflow/

Apache Airflow provides a framework to programmatically author, schedule and monitor complex workflows. According to a recent survey by Astronomer, 78% of data teams rely on Airflow for ETL and data integration.

Here are some of the key advantages Airflow delivers:

Visual Pipeline authoring

Airflow uses Python to define Directed Acyclic Graphs (DAGs) which provide a birds-eye view of the workflow pipeline. Nodes represent tasks, while edges show dependencies. Visually constructing pipelines makes the logic easy to comprehend.

Dynamic scheduling with smart retry logic

Tasks can run on standard cron schedules, while also supporting dynamic date-based scheduling. Failed tasks are automatically retried using exponential backoff. The scheduler is highly configurable.

Scalable distributed execution

Airflow leverages Celery to distribute tasks across many workers in parallel. Pipelines effortlessly scale to handle millions of tasks. Asana saw a 5x speedup by switching to Airflow.

Advanced monitoring and alerting

Airflow has built-in charts, logs and task views to monitor pipelines. It can push metadata to enterprise APM tools. Task success and failure alerts can be configured using webhooks or email.

Extensible and Open

Airflow has a modular plug-in architecture. The rich Python API makes it easy to customize and extend capabilities. The open-source community contributes operators for almost every tool.

For complex web scraping projects, Airflow addresses many common pain points out of the box through its enterprise-grade workflow orchestration capabilities. Next let‘s walk through a simple web scraping example.

Scraping Product Data

To demonstrate core concepts, we’ll build a scraper for an e-commerce site using Python and Requests. While basic, it illustrates the key tasks in a web scraping pipeline.

Let‘s start by scraping data from a single page:
import requests

URL = ‘https://www.ecommerce.com/p/1‘

def scrape_page(url):
  response = requests.get(url)

  # Parse HTML and extract data into dict 

  return data

page_data = scrape_page(URL)
print(page_data)
This makes a requests to the URL, parses the HTML, extracts fields like title and price into a Python dict, and returns the structured data.

With some additional logic, we can loop through multiple product pages:
URLS = [...] # List of 1000 urls 

all_data = []

for url in URLS:
  data = scrape_page(url)
  all_data.append(data) 

print(all_data)
This allows batch scraping of an entire catalog. But even for a few thousand pages, issues can arise:

Slow performance – Requests are made sequentially, limited by network I/O

Fragility – Any single failure breaks the entire loop

Duplicates – No deduplication if a page is scraped multiple times

No work tracking – Can‘t identify pages that still need scraping

Without robust orchestration, these problems are amplified at scale. Airflow provides the solution.

Creating an Airflow DAG for Web Scraping

In Airflow, pipelines are defined through DAGs (Directed Acyclic Graphs). A DAG outlines the steps and their dependencies as code.

We‘ll create a DAG to:

Fetch URL list – Get CSV of product URLs from S3

Scrape Page – Download and parse each URL

Store Data – Insert JSON output into PostgreSQL

First we define some helper functions:
# extractor.py
import requests, json
from bs4 import BeautifulSoup

def scrape(url):
  """Scrape page and return JSON data"""

  response = requests.get(url)
  soup = BeautifulSoup(response.text, ‘html.parser‘)

  data = {
   "title": soup.find("h1").text,  
   "price": soup.find("div", class_="price").text,
   # ... extract other fields
  }

  return json.dumps(data)

# loader.py  
import psycopg2

def insert_row(data):
  """Insert JSON data into Postgres"""

  conn = psycopg2.connect(DATABASE_URI)
  cur = conn.cursor() 

  # Use psycopg2 to insert JSON data 

  conn.commit()
  cur.close()
These contain the main business logic for scraping and data loading.

Now we import them into our DAG file:
# scraper_dag.py

from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime

from extractor import scrape
from loader import insert_row

default_args = {
  ‘owner‘: ‘My Name‘,
  ‘depends_on_past‘: False,
  ‘email‘: [‘me@example.com‘],
  ‘retries‘: 2,
  ‘retry_delay‘: timedelta(minutes=5)
}

dag = DAG(
  ‘webscraper‘,
  default_args=default_args,
  start_date=datetime(2019, 1, 1),
  schedule_interval=‘@daily‘,   
)

fetch_task = PythonOperator(
  task_id=‘fetch‘,
  python_callable=fetch_url_list, # Loads CSV from S3 
  dag=dag
)

scrape_task = PythonOperator(
  task_id=‘scrape‘,
  python_callable=scrape,
  op_kwargs={‘url‘: ‘{{ ti.xcom_pull(task_ids="fetch") }}‘},  
  dag=dag,
)

store_task = PythonOperator(
  task_id=‘store‘,
  python_callable=insert_row,
  op_kwargs={‘data‘: ‘{{ ti.xcom_pull(task_ids="scrape") }}‘},
  dag=dag,
)

fetch_task >> scrape_task >> store_task
This DAG encapsulates the workflow by:

Instantiating a PythonOperator for each step

Passing data between tasks using XComs

Setting directional dependencies between tasks

When run, Airflow will:

Call fetch_url_list to load the CSV of URLs

Parse the CSV and send each URL to the scrape task

Take the JSON output of scraping and insert into Postgres using insert_row

This provides a robust, scalable pipeline for web scraping!

Visualizing the DAG

The Airflow UI provides a visual DAG representation:

We can check:

Number of task instances currently running

Overall progress and health

Logs and metadata

Success/failure breakdown

The pipeline logic is transparently documented. When issues occur, Airflow surfaces them immediately to be diagnosed and fixed.

Comparing Airflow to Other Workflow Tools

Airflow is commonly compared to tools like Apache Oozie and Azkaban. Here‘s how it stacks up for web scraping:

Tool Strengths Weaknesses

Airflow Python DAGs, pluggable architecture, great UI, scalability Steep learning curve, can get complex fast

Oozie Built-in Hadoop support, simple XML workflows Designed for batch, limited flexibility

Azkaban Simple key-value configs, great visualization Not as extensible, smaller community

For general purpose pipelines, Airflow provides more power and flexibility. The pluggable architecture makes it easy to slot in new data sources, tools, and processing logic as needed.

Scraping Data From Multiple Sources

A key strength of Airflow is supporting complex workflows that span multiple systems.

Let‘s look at an example DAG that scrapes two different sites, combines and aggregates the data:
from airflow import DAG
from airflow.operators import PythonOperator

# Scraping operators

scrape_site1 = PythonOperator(task_id=‘scrape_site1‘, ...) 
scrape_site2 = PythonOperator(task_id=‘scrape_site2‘, ...)

# Transformation operators

clean_data = PythonOperator(task_id=‘clean_data‘, ...)
combine_data = PythonOperator(task_id=‘combine_data‘, ...)

# Aggregate and load

calculate_stats = PythonOperator(task_id=‘stats‘, ...)
load_warehouse = PythonOperator(task_id=‘load_warehouse‘, ...)

# Dependencies

(scrape_site1, scrape_site2) >> clean_data >> combine_data >> calculate_stats >> load_warehouse 
This demonstrates a complex branching pipeline with aggregation logic. Airflow‘s code-based approach makes it straightforward to build multi-step workflows.

Moving to Production

While the example provided demonstrates core concepts, more care is needed when moving to production. Here are some best practices:

Error Handling

Set up SMTP/SMS/Slack alerts for task failures

Build custom Airflow plugins to implement retry policies

Perform ad-hoc scrapes through API/CLI when needed

Logging & Metrics

Stream logs to ElasticSearch for analysis

Record business metrics like number of rows scraped

Add sensors to check SLAs and data quality

Scaling Out

Run Airflow on Kubernetes for auto-scaling

Add queues and pools for different work priority levels

Split work across regions using multiple DAGs

Security

Enable authentication and RBAC

Encrypt credentials and sensitive data

Use private VPCs with firewall rules to isolate workers

Monitoring

Set up external APM tools like Datadog to track performance

Create custom metrics on queue lag, data processed, pipeline duration etc.

Testing

Leverage Airflow CLI to run one-off tests

Mock outputs to simulate different scenarios

Implement integration tests that validate pipeline correctness

These are just a few tips – entire books have been written on mastering Airflow in production!

The key is to incrementally evolve the pipeline’s robustness as your use case grows while leaning on Airflow‘s architecture every step of the way.

Conclusion

Apache Airflow is invaluable for creating resilient, scalable data pipelines. For web scraping projects, it provides an enterprise-grade orchestration framework out of the box.

Some of the key benefits we explored include:

Modeling complex workflows as code through Python DAGs

Dynamic task scheduling and retries

Distributed scaling to handle large workloads

Advanced monitoring capabilities

Flexible framework to incorporate new data sources and processing logic

Airflow has a bit of learning curve, but pays dividends in taming unwieldy scraping pipelines. If you found this guide useful, check out the Official Airflow Docs for many more examples of production-grade workflows. Feel free to reach out if you have any other questions!

Tool	Strengths	Weaknesses
Airflow	Python DAGs, pluggable architecture, great UI, scalability	Steep learning curve, can get complex fast
Oozie	Built-in Hadoop support, simple XML workflows	Designed for batch, limited flexibility
Azkaban	Simple key-value configs, great visualization	Not as extensible, smaller community

posted @ 2024-12-23 21:38 lightsong 阅读(28) 评论(0) 收藏举报

刷新页面返回顶部

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}, AI:[LLM, langchain，langraph]

Automating-Webscraping-with-Airflow

Automating-Webscraping-with-Airflow

Web Scraping with Airflow

Step 1: Build the Airflow Docker Image

Step 2: Create Project Folders

Step 3: Set Environment Variables

Step 4: Start Airflow and Spark Containers

Step 5: Define and Configure the DAG

Step 6: Run the Web Scraping DAG

Step 7: Monitor and Troubleshoot

Step 8: Customize and Scale

Conclusion

Why Airflow is a Game Changer for Data Pipelines

Scraping Product Data

Creating an Airflow DAG for Web Scraping

Visualizing the DAG

Comparing Airflow to Other Workflow Tools

Scraping Data From Multiple Sources

Moving to Production

Conclusion

公告