爬取豆瓣电影前250，借此熟悉python的request，数据入库，正则表达式

1.确定爬取的标签

获取排名 x.find('em')

获取title

问题：

三年前用urllib2，现在不这么用了
itemlist = soup.find_all("div",class_="item") 找到所以class为item的标签存入列表
url = i.find('a',class_='').get('href') 找到a标签的href属性的内容
title = i.find('span',class_='title').text
正则表达式分割 aa11a123/bbb2(sjd,下提取出任意位数数字+/ 以及任意位数数字+(

正则表达式为 [0-9]一位数字，取值范围是0-9； [0-9]+,任意位数字，取值范围为0-9，+代表多个

[0-9]+/ n位数字/ 123/

[0-9]+[/,(] n位数字/以及n位数字( 123/ 2(

python里面的写法规范是 r'正则表达式'

patt = r'[0-9]+[/,(]'

#任意位数的数字+后面跟一位/或者（符号

match = re.findall(patt,infoTmp) infoTmp是待处理的字符串

　　html里不是这么写

爬取过程中第一个网页，li里面的信息是动态加载的，用目前的方法爬不下来。
电脑装了phpstudy，自带mysql，跟本地mysql冲突，到本地mysql的安装目录下，bin文件夹下，cmd管理员模式运行cd C:\D\mysql-8.0.19-winx64\bin mysqld.exe -install 出现：Service successfully installed. 接下来启动mysql：net start mysql
Navicat Premium 出现2059错误，https://www.cnblogs.com/uncle-kay/p/9751805.html改了mysql的密码
navicat过期，注册，下破解机，注意选择破解版本是navicat for mysql的，破解时断网，下注册机关360.参见 navicat15 for mysql激活
数据库添加的sql写法

tumpe = (movie_range,title,infoTmp,rating_num,inq,url)

sql='insert into movie(movie_range,title,info,rating_num,inq,url) values("%s","%s","%s","%s","%s","%s")'%tumpe

# "%s" 不能写成%s，会报错

数据库update的sql写法

sql='''update movie set country="%s",year="%s",

film_Genres="%s",director_actor="%s"

where movie_range = "%s"

'''%tumpe

#!/usr/bin/python

#coding: UTF-8

from urllib.request import Request,urlopen

from urllib.error import URLError, HTTPError

from bs4 import BeautifulSoup

import pymysql

import json

import re

def getConn():

conn= pymysql.connect(

host='localhost',

port = 3306,

user='root',

passwd='password',

db ='douban',

)

return conn

def getContent(url):

req = Request(url)

#增加header头信息

req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36')

try:

response = urlopen(req)

buff = response.read()

html = buff.decode("utf8")

response.close()

except HTTPError as e:

print('The server couldn\'t ful ll the request.')

print('Error code: ', e.code)

except URLError as e:

print('reason:%s' % e.reason)

return html

def saveContent(content, url):

soup = BeautifulSoup(content,"html.parser")

# print (soup)

itemlist = soup.find_all("div",class_="item")

conn = getConn()

cursor = conn.cursor()

num = 0

for i in itemlist:

num += 1

print (num)

movie_range = (i.find('em')).text

url = i.find('a',class_='').get('href')

title = i.find('span',class_='title').text

bd = i.find('div',class_='bd')

info = bd.find('p',class_='').text

infoTmp = info.replace('\xa0','').replace(' ','').replace('\n','')

rating_num = i.find('span',class_='rating_num').text

inq = i.find('span',class_='inq').text

# #--------数据入库---------

tumpe = (movie_range,title,infoTmp,rating_num,inq,url)

sql='insert into movie(movie_range,title,info,rating_num,inq,url) values("%s","%s","%s","%s","%s","%s")'%tumpe

# "%s" 不能写成%s，会报错

cursor.execute(sql)

conn.commit()

cursor.close() #关闭游标

conn.close() #关闭连接

#---------已经入库的信息，根据id作为主键添加年份，国家，电影类型等信息

def saveyear(content, url):

soup = BeautifulSoup(content,"html.parser")

itemlist = soup.find_all("div",class_="item")

conn = getConn()

cursor = conn.cursor()

num = 0

for i in itemlist:

num += 1

print (num)

movie_range = (i.find('em')).text

#作为主键

bd = i.find('div',class_='bd')

info = bd.find('p',class_='').text

infoTmp = info.replace('\xa0','').replace(' ','').replace('\n','')

patt = r'[0-9]+[/,(]'

#任意位数的数字+后面跟一位/或者（符号

match = re.findall(patt,infoTmp)

#导演: 奥利维·那卡什 / 艾力克·托兰达 Toledano主...2011/法国/剧情喜剧

director_actor = infoTmp.split(match[0])[0]

year = match[0].replace('/','')

country_type = infoTmp.split(match[0])[1]

country = country_type.split('/')[0]

film_Genres = country_type.split('/')[1]

tumpe = (country,year,film_Genres,director_actor,movie_range)

# #--------数据入库---------

sql='''update movie set country="%s",year="%s",

film_Genres="%s",director_actor="%s"

where movie_range = "%s"

'''%tumpe

# "%s" 不能写成%s，会报错

cursor.execute(sql)

conn.commit()

cursor.close() #关闭游标

conn.close() #关闭连接

def geturl():

for i in range(0,10):

j = 25*i

print (j)

url = 'https://movie.douban.com/top250?start=%s&filter='%j

print(url)

content = getContent(url)

# saveContent(content, url)

saveyear(content,url)

geturl()

posted @ 2020-02-22 18:28 zdmlcmepl 阅读(220) 评论(0) 收藏举报

刷新页面返回顶部

life is tough，so are you

爬取豆瓣电影前250，借此熟悉python的request，数据入库，正则表达式

公告