2025/2/12

Python是数据爬取的强大工具，而Hadoop的HDFS是存储大规模数据的理想选择。本篇博客将介绍如何使用Python爬取数据，并将其存储到HDFS中。
Python爬虫：使用requests和BeautifulSoup库爬取网页数据。
HDFS操作：使用hdfs库将数据写入HDFS。
示例代码：

import requests
from bs4 import BeautifulSoup
from hdfs import InsecureClient

# 爬取网页数据
def crawl_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for item in soup.find_all('div', class_='item-class'): # 根据目标网页结构调整
title = item.find('h2').text
link = item.find('a')['href']
data.append(f"{title},{link}\n")
return data

# 将数据写入HDFS
def write_to_hdfs(data, hdfs_path):
client = InsecureClient('http://localhost:50070', user='hadoop') # HDFS地址
with client.write(hdfs_path, encoding='utf-8', overwrite=True) as writer:
writer.writelines(data)

if __name__ == "__main__":
url = "http://example.com" # 目标网页
hdfs_path = "/user/hadoop/crawled_data.csv"
data = crawl_data(url)
write_to_hdfs(data, hdfs_path)
print(f"Data successfully written to {hdfs_path}")

运行步骤：
安装必要的Python库：

pip install requests beautifulsoup4 hdfs
确保Hadoop服务已启动，并通过hdfs命令检查HDFS状态。
将上述代码保存为crawl_and_store.py。
运行脚本：

python crawl_and_store.py
检查HDFS中的数据：

hdfs dfs -cat /user/hadoop/crawled_data.csv

通过Python爬虫可以高效地从网页中提取数据，并结合HDFS进行存储。这种方式适合处理大规模的爬取任务，同时利用Hadoop的分布式存储能力。

posted @ 2025-02-12 21:21 伐木工熊大阅读(23) 评论(0) 收藏举报

刷新页面返回顶部

zhenaifen

2025/2/12

公告