爬取校园新闻首页的新闻

 1 1. 用requests库和BeautifulSoup库,爬取校园新闻首页新闻的标题、链接、正文。
 2 
 3 import requests
 4 from bs4 import BeautifulSoup
 5 
 6 url = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
 7 res = requests.get(url)
 8 res.encoding = 'utf-8'
 9 soup = BeautifulSoup(res.text, 'html.parser')
10 for news in soup.select('li'):
11     if len(news.select('.news-list-title')) > 0:
12         print(news.select('.news-list-title'))
13         t=news.select('.news-list-title')[0].text
14         dt=news.select('.news-list-info')[0].contents[0].text
15         a=news.select('a')[0].attrs['href']
16         print(dt,t,a)
17 
18 2. 分析字符串,获取每篇新闻的发布时间,作者,来源,摄影等信息。
19 
20 for news in soup.select('li'):
21     if len(news.select('.news-list-title'))>0:
22         title = news.select('.news-list-title')[0].text
23         a = news.select('a')[0].attrs['href']
24 
25         resd = requests.get(a)
26         resd.encoding = 'utf-8'
27         soupd = BeautifulSoup(resd.text, 'html.parser')
28         d = soupd.select('#content')[0].text
29         info = soupd.select('.show.info')[0].text
30         print(info)
31         dt = info.lstrip('发布时间:')[:19]#发布时间
32         dt2 = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
33         print(dt2)
34         i = info.find('来源:')
35         if i>0:
36             s = info[info.find('来源:'):].split()[0].lstrip('来源:')#来源
37             print(s)
38         a = info.find('作者:')
39         if a > 0:
40             l = info[info.find('作者:'):].split()[0].replace('作者:')#作者
41             print(l)
42         y = info.find('摄影:')
43         if y > 0:
44             u = info[info.find('摄影:'):].split()[0].replace('摄影:')#摄影
45             print(u)
46 
47 3. 将其中的发布时间由str转换成datetime类型。
48 
49 import requests
50 from bs4 import BeautifulSoup
51 from datetime import datetime
52 
53 gzccurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
54 res = requests.get(gzccurl)
55 res.encoding='utf-8'
56 soup = BeautifulSoup(res.text,'html.parser')
57 
58 for news in soup.select('li'):
59     if len(news.select('.news-list-title'))>0:
60         title = news.select('.news-list-title')[0].text#标题
61         url = news.select('a')[0]['href']#链接
62         time = news.select('.news-list-info')[0].contents[0].text
63         dt = datetime.strptime(time,'%Y-%m-%d')
64         source = news.select('.news-list-info')[0].contents[1].text#来源
65         print(dt,'\n',title,'\n',url,'\n',source,'\n')

 

posted @ 2018-04-02 16:05  091梁耀  阅读(209)  评论(0编辑  收藏  举报