04 Xpath_[实例]爬取maoyan

Xpath lxml库的安装和使用

提取的内容

随意选取的一段

节点包含的影片信息,如下所示:

<dd>
	<i class="board-index board-index-1">1</i>
	<a href="/films/1200486" title="我不是药神" class="image-link" data-act="boarditem-click" data-val="{movieId:1200486}">
		<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default">
		<img alt="我不是药神" class="board-img" src="https://p0.pipi.cn/mmdb/d2dad59253751bd236338fa5bd5a27c710413.jpg?imageView2/1/w/160/h/220">
	</a>
	<div class="board-item-main">
		<div class="board-item-content">
			<div class="movie-item-info">
				<p class="name">
					<a href="/films/1200486" title="我不是药神" data-act="boarditem-click" data-val="{movieId:1200486}">我不是药神</a>
				</p>
				<p class="star">
					主演:徐峥,周一围,王传君
				</p>
				<p class="releasetime">上映时间:2018-07-05</p>
			</div>
			<div class="movie-item-number score-num">
				<p class="score">
					<i class="integer">9.</i>
					<i class="fraction">6</i>
				</p>
			</div>

		</div>
	</div>
</dd>

代码

# coding=utf-8
import requests
import random
import csv
from lxml import etree


class MaoyanSpider(object):
    def user_agent(self):
        """
        return an User-Agent at random
        :return:
        """
        ua_list = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
            'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        ]
        return random.choice(ua_list)

    def __init__(self):
        self.url = 'https://www.maoyan.com/board/4?offset=0'
        # self.headers = {'User-Agent': self.user_agent()}
        self.headers = {
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
            "Referer": "https://www.maoyan.com/"
        }

    def do_requests(self, url):
        print(self.headers)
        r = requests.get(url=url, headers=self.headers)
        r.encoding = 'utf-8'
        res = r.text
        return res

    # 下载图片 、 mp4 、 zip 都可以这样下载
    def save_data_b(self, url, img_name):
        r = requests.get(url)
        with open('img/' + img_name, mode='wb') as f:
            f.write(r.content)

    # 存为CSV
    def save_data_csv(self, data_list):
        headers = data_list[0].keys()  # 获取标题

        with open('movice.csv', 'w', newline='', encoding='utf-8') as f:
            f_csv = csv.DictWriter(f, headers)
            f_csv.writeheader()
            f_csv.writerows(data_list)

    # 解析主页面
    def parser_main_text(self, html_str):
        # 创建解析对象
        parse_html = etree.HTML(html_str)

        # 获取 class="board-wrapper" 的 dl 下的dd , 匹配 10 个<dd>节点
        dd_list = parse_html.xpath('//dl[@class="board-wrapper"]/dd')
        print(dd_list)

        # .// 表示dd节点的所有子节点后代节点
        data_list = list()
        for dd in dd_list:
            # 获取信息
            dataKV = {}
            dataKV['title'] = dd.xpath('.//p[@class="name"]/a/text()')[0].strip()
            dataKV['star'] = dd.xpath('.//p[@class="star"]/text()')[0].strip()
            dataKV['release_time'] = dd.xpath('.//p[@class="releasetime"]/text()')[0].strip()
            dataKV['img'] = dd.xpath('.//a/img[@class="board-img"]/@data-src')[0].strip()
			# https://p0.pipi.cn/mmdb/d2dad59253751bd236338fa5bd5a27c710413.jpg?imageView2/1/w/160/h/220
            img_name = (dataKV['img'].split('?')[0]).split('/')[-1]
            # 下载图片
            self.save_data_b(dataKV['img'], img_name)
            data_list.append(dataKV)

        # 存为csv数据
        # print(dataKV)
        self.save_data_csv(data_list)

    def run(self):
        # 请求页面
        res_main_content = self.do_requests(self.url)
        # 解析页面
        self.parser_main_text(res_main_content)


if __name__ == '__main__':
    # 步骤
	# 1. 请求主页面的内容
    # 2. 爬取主页面中的文章列表标题、作者、时间
    # 3. 下载图片
	# 4. 存为csv
    spider = MaoyanSpider()
    spider.run()

生成的csv

title,star,release_time,img
我不是药神,"主演:徐峥,周一围,王传君",上映时间:2018-07-05,https://p0.pipi.cn/mmdb/d2dad59253751bd236338fa5bd5a27c710413.jpg?imageView2/1/w/160/h/220
肖申克的救赎,"主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿",上映时间:1994-09-10(加拿大),https://p0.pipi.cn/mmdb/fb7386020fa51b0fafcf3e2e3a0bbe694d17d.jpg?imageView2/1/w/160/h/220
海上钢琴师,"主演:蒂姆·罗斯,比尔·努恩 ,克兰伦斯·威廉姆斯三世",上映时间:2019-11-15,https://p0.pipi.cn/mmdb/d2dad592c7e7e1d2365bf1b63cd25951b722b.jpg?imageView2/1/w/160/h/220
绿皮书,"主演:维果·莫腾森,马赫沙拉·阿里,琳达·卡德里尼",上映时间:2019-03-01,https://p0.pipi.cn/mmdb/d2dad59253751b230f21f0818a5bfd4d8679c.jpg?imageView2/1/w/160/h/220
霸王别姬,"主演:张国荣,张丰毅,巩俐",上映时间:1993-07-26,https://p0.pipi.cn/mmdb/fb7386beddd338537c8ea3bb80d25a9078b13.jpg?imageView2/1/w/160/h/220
美丽人生,"主演:罗伯托·贝尼尼,朱斯蒂诺·杜拉诺,赛尔乔·比尼·布斯特里克",上映时间:2020-01-03,https://p0.pipi.cn/mmdb/d2dad592c7e7e1d2367a3507befaed31a5903.jpg?imageView2/1/w/160/h/220
这个杀手不太冷,"主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼",上映时间:1994-09-14(法国),https://p0.pipi.cn/mmdb/d2dad592c7e7e13ba3ddd25677b4d70fc45fa.jpg?imageView2/1/w/160/h/220
小偷家族,"主演:中川雅也,安藤樱,松冈茉优",上映时间:2018-08-03,https://p0.pipi.cn/mmdb/d2dad5925372ffd7c387a9d01bddad81625c3.jpg?imageView2/1/w/160/h/220
哪吒之魔童降世,"主演:吕艳婷,囧森瑟夫,瀚墨",上映时间:2019-07-26,https://p0.pipi.cn/mmdb/d2dad592537923f0ee07acada3ac59b9f3ffb.jpg?imageView2/1/w/160/h/220
怦然心动,"主演:玛德琳·卡罗尔,卡兰·麦克奥利菲,艾丹·奎因",上映时间:2010-07-26(美国),https://p0.pipi.cn/mmdb/d2dad592b122ff8d3387a93ccab6036f616c1.jpg?imageView2/1/w/160/h/220

下载的图片

下载的图片在img中

参考文档

http://c.biancheng.net/python_spider/lxml-case.html

posted @ 2023-06-05 22:58  HaimaBlog  阅读(30)  评论(0编辑  收藏  举报