python爬虫准备

爬虫的原理

1.模拟浏览器的行为，通过网络请求将目标网页抓取到本地

2。使用一定的匹配规则，将目标网页中需要的数据提取出来，把不需要的过滤掉

3.根据需要，把提取出来的数据存储到磁盘中（json、csv、excel、数据库）

需要安装的库

1、requests pip install requests

2、bs4 用来将请求下来的数据进行解析(完成爬虫原理中的第二步)，安装方式： pip install bs4

3、lxml：用来解析html和xml格式数据的，BeautifulSoup相当于只是一个壳，底层还是要基于lxml类似的这种解析器来解析，html5lib、html.parse pip install la、lxml

import  requests
from   bs4 import BeautifulSoup
import json

def get_page():
    #1.请求豆瓣网页url
    url="https://movie.douban.com/cinema/nowplaying/shanghai/"
    #2.发送请求，请求页面的时候应该发送什么数据  为了表明请求的身份
    headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
    #3.GET/POST请求 c采用的是GET请求
    #4.发送请求
    response=requests.get(url,headers=headers)
    text=response.text
    #print(response.text)
    return text

#二、使用一定的匹配规则，将目标网页中需要的数据提取出来，把不需要的过滤掉
def parse_page(text):
      #BeautifulSoup  根据你的规则提取你想要的数据
      #360（是由ie和google浏览器的内核）
      soup=BeautifulSoup(text,'lxml')
      movies=[]
      #找出所有满足条件的数据
      liList=soup.find_all("li",attrs={"data-category":"nowplaying"})
      for li in liList:
          movie={}
          # print(li)
          # print("========")\
          #获取特定标题拿到自己想要的信息
          title=li['data-title']
          score=li['data-score']
          release=li['data-release']
          duration=li['data-duration']
          director=li['data-director']
          actors=li['data-actors']
          img = li.find('img')
          thumbnail=img['src']

          movie['title']=title
          movie['score']=score
          movie['release'] = release
          movie['duration'] = duration
          movie['director'] = director
          movie['actors'] = actors
          movie['thumbnail'] = thumbnail
          movies.append(movie)
          #print(movie)
      return movies
#三、存储数据
def save_data(data):
    with open('douban.json','w',encoding='utf-8')  as  fp:
        #将字典、列表dump成满足json格式的字符串 object是传文件的值
        json.dump(data,fp,ensure_ascii=False)

if __name__ == '__main__':
    text=get_page()
    movies=parse_page(text)
    save_data(movies)

posted on 2019-03-12 21:05 harper7777 阅读(100) 评论(0) 收藏举报

刷新页面返回顶部

python爬虫准备

导航

公告