七月在线爬虫班学习笔记(三)——爬虫基础知识与简易爬虫实现
第三课的主要内容有:

















css例子
以下四个html页面在浏览器中打开即可看到效果。
css_background_color.html:
<html>
<head>
<style type="text/css">
body {background-color: yellow}
h1 {background-color: #00ff00}
h2 {background-color: transparent}
p {background-color: rgb(250,0,255)}
p.no2 {background-color: gray; padding: 20px;}
</style>
</head>
<body>
<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<p>这是段落</p>
<p class="no2">这个段落设置了内边距。</p>
</body>
</html>
css_board_color.html:
<html>
<head>
<style type="text/css">
p.one
{
border-style: solid;
border-color: #0000ff
}
p.two
{
border-style: solid;
border-color: #ff0000 #0000ff
}
p.three
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff
}
p.four
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff rgb(250,0,255)
}
</style>
</head>
<body>
<p class="one">One-colored border!</p>
<p class="two">Two-colored border!</p>
<p class="three">Three-colored border!</p>
<p class="four">Four-colored border!</p>
<p><b>注释:</b>"border-width" 属性如果单独使用的话是不会起作用的。请首先使用 "border-style" 属性来设置边框。</p>
</body>
</html>
css_font_family.html:
<html>
<head>
<style type="text/css">
p.serif{font-family:"Times New Roman",Georgia,Serif}
p.sansserif{font-family:Arial,Verdana,Sans-serif}
</style>
</head>
<body>
<h1>CSS font-family</h1>
<p class="serif">This is a paragraph, shown in the Times New Roman font.</p>
<p class="sansserif">This is a paragraph, shown in the Arial font.</p>
</body>
</html>
css_text_decoration.html:
<html>
<head>
<style type="text/css">
h1 {text-decoration: overline}
h2 {text-decoration: line-through}
h3 {text-decoration: underline}
h4 {text-decoration:blink}
a {text-decoration: none}
</style>
</head>
<body>
<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<h3>这是标题 3</h3>
<h4>这是标题 4</h4>
<p><a href="http://www.w3school.com.cn/index.html">这是一个链接</a></p>
</body>
</html>
解析xml,下面是课程中使用到的book.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
Python处理XML方法之DOM:
![]()
from xml.dom import minidom
doc = minidom.parse('book.xml')
root = doc.documentElement
# print(dir(root))
print(root.nodeName)
books = root.getElementsByTagName('book')
print(type(books))
for book in books:
titles = book.getElementsByTagName('title')
print(titles[0].childNodes[0].nodeValue)
#results
bookstore
<class 'xml.dom.minicompat.NodeList'>
Harry Potter
Learning XML
Python处理XML方法之SAX:
![]()
1 import string 2 from xml.parsers.expat import ParserCreate 3 4 class DefaultSaxHandler(object): 5 def start_element(self, name, attrs): 6 self.element = name 7 print('element: %s, attrs: %s' % (name, str(attrs))) 8 9 def end_element(self, name): 10 print('end element: %s' % name) 11 12 def char_data(self, text): 13 if text.strip(): 14 print("%s's text is %s" % (self.element, text)) 15 16 handler = DefaultSaxHandler() 17 parser = ParserCreate() 18 parser.StartElementHandler = handler.start_element 19 parser.EndElementHandler = handler.end_element 20 parser.CharacterDataHandler = handler.char_data 21 with open('book.xml', 'r') as f: 22 parser.Parse(f.read())
1 element: bookstore, attrs: {} 2 element: book, attrs: {} 3 element: title, attrs: {'lang': 'eng'} 4 title's text is Harry Potter 5 end element: title 6 element: price, attrs: {} 7 price's text is 29.99 8 end element: price 9 end element: book 10 element: book, attrs: {} 11 element: title, attrs: {'lang': 'eng'}
1 010-12345 2 0 9 3 分组 4 ('010', '12345') 5 010-12345 6 010 7 12345 8 分割 9 <class '_sre.SRE_Pattern'> 10 ['one', 'two', 'three', 'four', ''] 11 ('20', '15', '45')
12 title's text is Learning XML 13 end element: title 14 element: price, attrs: {} 15 price's text is 39.95 16 end element: price 17 end element: book 18 end element: bookstore




实例:
1 import re 2 3 m = re.match(r'\d{3}\-\d{3,8}', '010-12345') 4 # print(dir(m)) 5 print(m.string) 6 print(m.pos, m.endpos) 7 8 # 分组 9 print('分组') 10 m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345') 11 print(m.groups()) 12 print(m.group(0)) 13 print(m.group(1)) 14 print(m.group(2)) 15 16 # 分割 17 print('分割') 18 p = re.compile(r'\d+') 19 print(type(p)) 20 print(p.split('one1two3three3four4')) 21 22 t = '20:15:45' 23 m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t) 24 print(m.groups())
输出结果:
010-12345
0 9
分组
('010', '12345')
010-12345
010
12345
分割
<class '_sre.SRE_Pattern'>
['one', 'two', 'three', 'four', '']
('20', '15', '45')
电商网站数据爬取
selenium直接pip安装即可。pip install selenium
windows上需要使用使用浏览器的驱动,我使用的chrome浏览器,和课程中的一样。驱动是chromedriver。
我这里提供一个下载地址:http://docs.seleniumhq.org/download/
我的驱动是放在tools这个文件夹里的。
下载好驱动后,需要将这个驱动添加到系统属性变量中才行,不然会出错。

准备工作已经完成了。下面我们开始爬取17huo.com这个网站.我们要爬取大衣这个分类里的每个商品的标题、价格。课程的时间已经过去很久,
网站已经改版,我对课程中的代码自己进行了改动,实测可用,成功爬取前三页的信息。0、1、2共三页。
1 from selenium import webdriver 2 import time 3 4 browser = webdriver.Chrome() 5 browser.set_page_load_timeout(50) 6 browser.get('http://www.17huo.com/newsearch/?k=%E5%A4%A7%E8%A1%A3') 7 page_info = browser.find_element_by_css_selector('body > div.wrap > div.search_container > div.pagem.product_list_pager > div') 8 # print(page_info.text) 9 # 共 40 页,每页 60 条 10 pages = int((page_info.text.split(',')[0]).split(' ')[1]) 11 # print(pages) 12 for page in range(pages): 13 if page > 2: 14 break 15 url = 'http://www.17huo.com/newsearch/?k=大衣&page=' + str(page + 1) 16 browser.get(url) 17 browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") 18 time.sleep(5) # 不然会load不完整 19 goods = browser.find_element_by_css_selector( 20 '.book-item-list').find_elements_by_tag_name('a') 21 print('%d页有%d件商品' % ((page + 1), len(goods))) 22 for good in goods: 23 try: 24 title = good.find_element_by_css_selector('a:nth-child(1) > p:nth-child(2)').text 25 #a:nth - child(2) > div:nth - child(3) > div:nth - child(2) 26 price = good.find_element_by_css_selector('span:nth - child(1)').text 27 #span:nth - child(1) 28 print(title, price) 29 except: 30 print(good.text)
部分结果:
1 1页有180件商品 2 3 ¥ 155.00 4 黄格子大衣 5 黄格子大衣 6 7 ¥ 350.00 8 中老年妈妈冬季仿貂绒大衣连帽女装宽松外套羊剪绒上衣 9 KXLCMML1308 10 11 ¥ 350.00 12 中老年女装冬新款羊剪绒加厚仿皮草宽松外套妈妈装大衣 13 KXLCMML1307
情不知所起一往而深
浙公网安备 33010602011771号