爬虫-淘宝图片爬取

淘宝网站 图片爬取

1. 分析url构造
2. 查找图片在html位置
3. 抓取图片

淘宝url规律分析

https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3

https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3
&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s=44

https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3
&bcoffset=-2&ntoffset=-2&p4ppushleft=2%2C48&s=88

s代表了页码数 44 * (n-1)  n代表页数
q代表了 搜索关键词

淘宝图片规律分析

1. 获得图片链接
2. 查看url 是否在html中 (真找不到再抓包 js)
图片url分析
高清大图:
http://g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg
缩略图:
https://g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg
_250x250.jpg_.webp
html内部:
"pic_url":"//g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg"

注意:
    这里的淘宝需要登录 所以需要 cookie模拟登录 在头内添加自己的cookie
    会报错  UnicodeEncodeError: 'latin-1' codec can't encode characters in position 702-706: ordin
    原因解析:
        应该是 cookie是latin编码的 但是被浏览器解析为 utf-8编码 然后我按utf-8复制过来了
        latin -- utf8(显示) -- utf8复制 -- utf8解码出的字符串内容 --字符串解析失败(因为字符串内容是utf8解码的格式)
        添加头的时候内容就不对了
        所以应该:
        latin -- utf8显示 -- utf8复制 -- 先按utf8编码回二进制 -- 再按latin解码为latin解码的字符串 -- 字符串解析
        "cookie":"cookie".encode("utf-8").decode("latin1")

淘宝图片爬虫代码

import time
import urllib.request
import urllib.parse
import re

key = "短裙"
page = 5
enKey = urllib.parse.quote(key)

for i in range(page):

    url ="https://s.taobao.com/search?q={}&suggest=0_1&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.2&ie=utf8&initiative_id=tbindexz_20170306&_input_charset=utf-8&wq=&suggest_query=&source=suggest&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s={}".format(enKey, i * 44)
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42",
            "cookie":"填入自己的cookie".encode("utf-8").decode("latin1")
              }
    req = urllib.request.Request(url,headers=header)
    data = urllib.request.urlopen(req).read().decode("utf-8", "ignore")
    with open("./htmls/{}.html".format(i),"wt",encoding="utf-8") as f:
        f.write(data)
    pat = '"pic_url":"(.*?)"'
    imgUrl = re.compile(pat).findall(data)
    for count,j in enumerate(imgUrl):
        imgUrl = "http:" + j
        print(imgUrl)
        # reqImg = urllib.request.Request(imgUrl,headers=header)
        # data = urllib.request.urlopen(imgUrl)
        path = "./imgs/"+str(i)+str(count)+".jpg"
        print(path)
        # 这里添加头 要不会被检测出来反爬
        req2 = urllib.request.Request(imgUrl,headers=header)
        res2 = urllib.request.urlopen(req2)
        with open(path,"wb") as f:
            # print(res2.read())
            f.write(res2.read())
        # urllib.request.urlretrieve(imgUrl,path)
        time.sleep(0.05)
posted @ 2022-10-17 16:31  cc学习之路  阅读(686)  评论(0)    收藏  举报