爬虫作业
(2)请用requests库的get()函数访问如下一个网站20次,打印返回状态,text()内容,计算text()属性和content属性所返回网页内容的长度。(不同学号选做如下网页,必做及格)
d: 360搜索主页(尾号7,8学号做)
from requests import * try: for i in range(20): r=get("https://www.so.com/") r.raise_for_status() r.encoding='utf-8' print(r) print(len(r.text)) print(len(r.content)) except: print("Error")
结果:
runfile('C:/Users/燃/untitled0.py', wdir='C:/Users/燃') <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> <Response [200]> 4986 5294
(3)这是一个简单的html页面,请保持为字符串,完成后面的计算要求。(良好)
题目: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>菜鸟教程(runoob.com)</title> </head> <body> <hl>我的第一个标题</hl> <p id="first">我的第一个段落。</p> </body> <table border="1"> <tr> <td>row 1, cell 1</td> <td>row 1, cell 2</td> </tr> <tr> <td>row 2, cell 1</td> <td>row 2, cell 2</td> <tr> </table> </html>
a 打印head标签内容和你的学号后两位
# -*- encoding:utf-8 -*- from requests import get def getText(url): try: r = get(url, timeout=5) r.raise_for_status() r.encoding = 'utf-8' return r.text except Exception as e: print("Error:", e) return '' from bs4 import BeautifulSoup url = "http://www.runoob.com/" html = getText(url) soup = BeautifulSoup(html) print("head:", soup.head) print("head:", len(soup.head)) print("学号后两位:24=7") print("body:", soup.body) print("body:", len(soup.body)) print("title:", soup.title) print("title_string:", soup.title.string) print("special_id", soup.find(id='cd-login'))

浙公网安备 33010602011771号