14每周总结
14每周总结
发表时间:23.5.17
这周学习了爬虫过程中有cookie验证,以及验证码处理的问题。
import urllib.request
url='https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'import requests
url1="https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}
session=requests.session()
response=session.get(url=url1,headers=headers)
content=response.text# print(content)from bs4 import BeautifulSoup
soup=BeautifulSoup(content,'lxml')
value1=soup.select('#__VIEWSTATE')[0].attrs['value']
value2=soup.select('#__VIEWSTATEGENERATOR')[0].attrs['value']
codeurl='https://so.gushiwen.cn/'+soup.select('#imgCode')[0].attrs['src']# print(value1)
# print(value2)
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}
data={'__VIEWSTATE':value1,'__VIEWSTATEGENERATOR':value2,'from':'http://so.gushiwen.cn/user/collect.aspx','email':"账号",'pwd':'密码','code':'','denglu':'登录',
}
code_url='https://so.gushiwen.cn/RandCode.ashx'# urllib.request.urlretrieve(code_url,'code.jpg')
response=session.get(code_url)
content=response.content
with open('code.jpg','wb') as fp:
fp.write(content)
code=input('请输入验证码')
data['code']=code
response=session.post(url=url,data=data,headers=headers)
content=response.text
with open('古诗文网.html','w',encoding='utf-8') as fp:
fp.write(content)