网络爬虫基础练习

import requests
from bs4 import BeautifulSoup

url = "http://localhost:63342/test/test.html?_ijt=shuod2pjvpnr4epahes024vk0l"
res = requests.get(url)
res.encoding = "utf-8"

soup = BeautifulSoup(res.text, 'html.parser')

取出h1标签的文本

print(soup.h1.text)

取出a标签的链接

print(soup.a.attrs['href'])

取出所有li标签的所有内容

print(soup.select('li'))

取出第2个li标签的a标签的第3个div标签的属性

print(soup.select('li')[1].a.select('div')[2].attrs)

取出一条新闻的标题、链接、发布时间、来源

print(soup.select('.post-title')[0].text)
print(soup.select('a')[2].attrs['href'])
print(soup.select('.post-list-info')[0].text)
print(soup.select('.post-list-info')[1].text)

 

posted @ 2018-03-29 20:09  157-符致伟  阅读(94)  评论(0编辑  收藏  举报