2018-12-03-Python全栈开发-day91-爬虫基础知识

1.爬虫

  从网站获取数据,并且从中筛选自己想要的数据。

2.基本流程

  下载页面(reques)--筛选(beautifulsoup)

  

import requests
from bs4 import BeautifulSoup

data1=requests.get(url='http://www.baidu.com')
data1.encoding=data1.apparent_encoding
print(data1.text)

  基础的从网址下载页面

  

import requests
from bs4 import BeautifulSoup

data1=requests.get(url='http://www.autohome.com.cn/news/')
data1.encoding=data1.apparent_encoding
#

soup = BeautifulSoup(data1.text,features='html.parser')

tag = soup.find(id='auto-channel-lazyload-article')

li_list = soup.find_all('li')#查找所有的li类型,得到列表

for i in li_list:
    '''循环的元素是每个对象'''
    a= i.find('a')
    if a:
        print(a.attrs.get('href'))#打印a标签里面的连接
        try:
            text1=a.find('h3').text#find the title
            img_url=a.find('img').attrs.get('src')
            img_file = requests.get(url=img_url)
            import uuid
            file_name = str(uuid.uuid4())
            with open(file_name,'wb') as f:
                f.write(img_file.content)
        except Exception as e:
            pass
未完待续

 

posted @ 2018-12-03 22:38  brownbearye  阅读(186)  评论(0)    收藏  举报