淘宝网站 图片爬取
1. 分析url构造
2. 查找图片在html位置
3. 抓取图片
淘宝url规律分析
https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3
https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3
&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s=44
https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.2.5af911d9zRurGY&q=%E5%86%85%E8%A1%A3
&bcoffset=-2&ntoffset=-2&p4ppushleft=2%2C48&s=88
s代表了页码数 44 * (n-1) n代表页数
q代表了 搜索关键词
淘宝图片规律分析
1. 获得图片链接
2. 查看url 是否在html中 (真找不到再抓包 js)
图片url分析
高清大图:
http://g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg
缩略图:
https://g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg
_250x250.jpg_.webp
html内部:
"pic_url":"//g-search3.alicdn.com/img/bao/uploaded/i4/i1/63801264/O1CN01DJOh0V1LCxKN1zvlN_!!63801264.jpg"
注意:
这里的淘宝需要登录 所以需要 cookie模拟登录 在头内添加自己的cookie
会报错 UnicodeEncodeError: 'latin-1' codec can't encode characters in position 702-706: ordin
原因解析:
应该是 cookie是latin编码的 但是被浏览器解析为 utf-8编码 然后我按utf-8复制过来了
latin -- utf8(显示) -- utf8复制 -- utf8解码出的字符串内容 --字符串解析失败(因为字符串内容是utf8解码的格式)
添加头的时候内容就不对了
所以应该:
latin -- utf8显示 -- utf8复制 -- 先按utf8编码回二进制 -- 再按latin解码为latin解码的字符串 -- 字符串解析
"cookie":"cookie".encode("utf-8").decode("latin1")
淘宝图片爬虫代码
import time
import urllib.request
import urllib.parse
import re
key = "短裙"
page = 5
enKey = urllib.parse.quote(key)
for i in range(page):
url ="https://s.taobao.com/search?q={}&suggest=0_1&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.2&ie=utf8&initiative_id=tbindexz_20170306&_input_charset=utf-8&wq=&suggest_query=&source=suggest&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s={}".format(enKey, i * 44)
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42",
"cookie":"填入自己的cookie".encode("utf-8").decode("latin1")
}
req = urllib.request.Request(url,headers=header)
data = urllib.request.urlopen(req).read().decode("utf-8", "ignore")
with open("./htmls/{}.html".format(i),"wt",encoding="utf-8") as f:
f.write(data)
pat = '"pic_url":"(.*?)"'
imgUrl = re.compile(pat).findall(data)
for count,j in enumerate(imgUrl):
imgUrl = "http:" + j
print(imgUrl)
# reqImg = urllib.request.Request(imgUrl,headers=header)
# data = urllib.request.urlopen(imgUrl)
path = "./imgs/"+str(i)+str(count)+".jpg"
print(path)
# 这里添加头 要不会被检测出来反爬
req2 = urllib.request.Request(imgUrl,headers=header)
res2 = urllib.request.urlopen(req2)
with open(path,"wb") as f:
# print(res2.read())
f.write(res2.read())
# urllib.request.urlretrieve(imgUrl,path)
time.sleep(0.05)