爬虫作业

(2)请用requests库的get()函数访问如下一个网站20次,打印返回状态,text()内容,计算text()属性和content属性所返回网页内容的长度。(不同学号选做如下网页,必做及格)‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬

 d: 360搜索主页(尾号7,8学号做)

from requests import *
try:
    for i in range(20):
        r=get("https://www.so.com/")
        r.raise_for_status()
        r.encoding='utf-8'
        print(r)
    print(len(r.text))
    print(len(r.content))
except:
    print("Error")

 

结果:

runfile('C:/Users/燃/untitled0.py', wdir='C:/Users/燃')
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
4986
5294

(3)这是一个简单的html页面,请保持为字符串,完成后面的计算要求。(良好)

题目:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>菜鸟教程(runoob.com)</title> 
</head>
<body>
         <hl>我的第一个标题</hl>
         <p id="first">我的第一个段落。</p> 
</body>
                  <table border="1">
          <tr>
                  <td>row 1, cell 1</td> 
                  <td>row 1, cell 2</td> 
         </tr>
         <tr>
                  <td>row 2, cell 1</td>
                  <td>row 2, cell 2</td>
         <tr>
</table>
</html>

 

a 打印head标签内容和你的学号后两位‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬

 

 

# -*- encoding:utf-8 -*-
from requests import get
def getText(url):
    try:
        r = get(url, timeout=5)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except Exception as e:
        print("Error:", e)
        return ''

from bs4 import BeautifulSoup
url = "http://www.runoob.com/"
html = getText(url)
soup = BeautifulSoup(html)


print("head:", soup.head)
print("head:", len(soup.head))
print("学号后两位:24=7")

print("body:", soup.body)
print("body:", len(soup.body))


print("title:", soup.title)


print("title_string:", soup.title.string)


print("special_id", soup.find(id='cd-login'))

 

posted @ 2020-12-13 23:20  whispe  阅读(90)  评论(0)    收藏  举报