爬取豆瓣电影前250,借此熟悉python的request,数据入库,正则表达式
1.确定爬取的标签
获取排名 x.find('em')
获取title
问题:
-
三年前用urllib2,现在不这么用了
-
itemlist = soup.find_all("div",class_="item") 找到所以class为item的标签存入列表
-
url = i.find('a',class_='').get('href') 找到a标签的href属性的内容
-
title = i.find('span',class_='title').text
-
正则表达式分割 aa11a123/bbb2(sjd,下提取出任意位数数字+/ 以及任意位数数字+(
正则表达式为 [0-9]一位数字,取值范围是0-9; [0-9]+,任意位数字,取值范围为0-9,+代表多个
[0-9]+/ n位数字/ 123/
[0-9]+[/,(] n位数字/以及n位数字( 123/ 2(
python里面的写法规范是 r'正则表达式'
patt = r'[0-9]+[/,(]'
#任意位数的数字+后面跟一位/或者(符号
match = re.findall(patt,infoTmp) infoTmp是待处理的字符串
html里不是这么写
-
爬取过程中第一个网页,li里面的信息是动态加载的,用目前的方法爬不下来。
-
电脑装了phpstudy,自带mysql,跟本地mysql冲突,到本地mysql的安装目录下,bin文件夹下,cmd管理员模式运行cd C:\D\mysql-8.0.19-winx64\bin mysqld.exe -install 出现:Service successfully installed. 接下来启动mysql:net start mysql
-
Navicat Premium 出现2059错误,https://www.cnblogs.com/uncle-kay/p/9751805.html改了mysql的密码
-
navicat过期,注册,下破解机,注意选择破解版本是navicat for mysql的,破解时断网,下注册机关360.参见 navicat15 for mysql激活
-
数据库添加 的sql写法
tumpe = (movie_range,title,infoTmp,rating_num,inq,url)
sql='insert into movie(movie_range,title,info,rating_num,inq,url) values("%s","%s","%s","%s","%s","%s")'%tumpe
# "%s" 不能写成%s,会报错
-
数据库update的sql写法
sql='''update movie set country="%s",year="%s",
film_Genres="%s",director_actor="%s"
where movie_range = "%s"
'''%tumpe
#!/usr/bin/python
#coding: UTF-8
from urllib.request import Request,urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup
import pymysql
import json
import re
def getConn():
conn= pymysql.connect(
host='localhost',
port = 3306,
user='root',
passwd='password',
db ='douban',
)
return conn
def getContent(url):
req = Request(url)
#增加header头信息
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36')
try:
response = urlopen(req)
buff = response.read()
html = buff.decode("utf8")
response.close()
except HTTPError as e:
print('The server couldn\'t ful ll the request.')
print('Error code: ', e.code)
except URLError as e:
print('reason:%s' % e.reason)
return html
def saveContent(content, url):
soup = BeautifulSoup(content,"html.parser")
# print (soup)
itemlist = soup.find_all("div",class_="item")
conn = getConn()
cursor = conn.cursor()
num = 0
for i in itemlist:
num += 1
print (num)
movie_range = (i.find('em')).text
url = i.find('a',class_='').get('href')
title = i.find('span',class_='title').text
bd = i.find('div',class_='bd')
info = bd.find('p',class_='').text
infoTmp = info.replace('\xa0','').replace(' ','').replace('\n','')
rating_num = i.find('span',class_='rating_num').text
inq = i.find('span',class_='inq').text
# #--------数据入库---------
tumpe = (movie_range,title,infoTmp,rating_num,inq,url)
sql='insert into movie(movie_range,title,info,rating_num,inq,url) values("%s","%s","%s","%s","%s","%s")'%tumpe
# "%s" 不能写成%s,会报错
cursor.execute(sql)
conn.commit()
cursor.close() #关闭游标
conn.close() #关闭连接
#---------已经入库的信息,根据id作为主键添加年份,国家,电影类型等信息
def saveyear(content, url):
soup = BeautifulSoup(content,"html.parser")
itemlist = soup.find_all("div",class_="item")
conn = getConn()
cursor = conn.cursor()
num = 0
for i in itemlist:
num += 1
print (num)
movie_range = (i.find('em')).text
#作为主键
bd = i.find('div',class_='bd')
info = bd.find('p',class_='').text
infoTmp = info.replace('\xa0','').replace(' ','').replace('\n','')
patt = r'[0-9]+[/,(]'
#任意位数的数字+后面跟一位/或者(符号
match = re.findall(patt,infoTmp)
#导演: 奥利维·那卡什 / 艾力克·托兰达 Toledano主...2011/法国/剧情 喜剧
director_actor = infoTmp.split(match[0])[0]
year = match[0].replace('/','')
country_type = infoTmp.split(match[0])[1]
country = country_type.split('/')[0]
film_Genres = country_type.split('/')[1]
tumpe = (country,year,film_Genres,director_actor,movie_range)
# #--------数据入库---------
sql='''update movie set country="%s",year="%s",
film_Genres="%s",director_actor="%s"
where movie_range = "%s"
'''%tumpe
# "%s" 不能写成%s,会报错
cursor.execute(sql)
conn.commit()
cursor.close() #关闭游标
conn.close() #关闭连接
def geturl():
for i in range(0,10):
j = 25*i
print (j)
url = 'https://movie.douban.com/top250?start=%s&filter='%j
print(url)
content = getContent(url)
# saveContent(content, url)
saveyear(content,url)
geturl()

浙公网安备 33010602011771号