Selenium3+python3自动化(二十七)--爬页面源码(page_source)

前言

有时候通过元素的属性查找页面上的某个元素,可能不大好找,这时候可以从源码中爬出想要的信息。selenium的page_source方法可以获取页面源码。

爬页面源码的作用:如,爬出页面上所有的url地址,可以批量请求页面url地址,看是否存在404等异常等

一、page_source

1.selenium的page_source方法可以直接返回页面源码

 

 

 二、re非贪婪模式

1.这里需导入re模块

2.用re的正则匹配:非贪婪模式

3.findall方法返回的是一个list集合

4.匹配出来之后发现有一些不是url链接,可以筛选下

findall 在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。

语法格式为:re.findall(pattern, string, flags=0)

 参考代码:

driver=webdriver.Chrome()
driver.get("https://www.cnblogs.com/canglongdao")
#print(type(driver.page_source))
rs=driver.page_source.encode("utf-8")
print(type(rs),type(str(rs)))
aurl=re.findall('href="(.+?)"',str(rs))
print(aurl)

 运行结果:

<class 'bytes'> <class 'str'>
['//common.cnblogs.com/favicon.ico?v=20200522', '/css/blog-common.min.css?v=7Pwqzj5EBy4dBv4DJNI181rFKP8_OF0hT7jO3o8jAa0', '/skins/book/bundle-book-2.min.css', '/skins/book/bundle-book-mobile.min.css?v=XFoR99E4sMNWcYA_LxWBPY7uXp4-8NCPb1RnsUN1Mwo', 'https://www.cnblogs.com/canglongdao/rss', 'https://www.cnblogs.com/canglongdao/rsd.xml', 'https://www.cnblogs.com/canglongdao/wlwmanifest.xml', 'https://www.cnblogs.com/canglongdao/', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13595372', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594914', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594459', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590722', 'https://www.cnblogs.com/canglongdao/archive/2020/08/31.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590348', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13589720', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587969', 'https://www.cnblogs.com/canglongdao/archive/2020/08/30.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587061', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13586938', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13585477', 'https://www.cnblogs.com/canglongdao/default.html?page=2', 'https://www.cnblogs.com/', 'javascript:void(0);', 'javascript:void(0);', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/', 'https://www.cnblogs.com/canglongdao/', 'https://i.cnblogs.com/EditPosts.aspx?opt=1', 'https://msg.cnblogs.com/send/%E6%98%9F%E7%A9%BA6', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/rss/', 'https://i.cnblogs.com/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/followers/', 'https://home.cnblogs.com/u/canglongdao/followees/', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/p/', 'https://www.cnblogs.com/canglongdao/MyComments.html', 'https://www.cnblogs.com/canglongdao/OtherPosts.html', 'https://www.cnblogs.com/canglongdao/RecentComments.html', 'https://www.cnblogs.com/canglongdao/tag/', 'https://www.cnblogs.com/canglongdao/category/1593317.html', 'https://www.cnblogs.com/canglongdao/category/1694849.html', 'https://www.cnblogs.com/canglongdao/category/1633461.html', 'https://www.cnblogs.com/canglongdao/category/1616592.html', 'https://www.cnblogs.com/canglongdao/category/1609028.html', 'https://www.cnblogs.com/canglongdao/category/1633189.html', 'https://www.cnblogs.com/canglongdao/category/1750002.html', 'https://www.cnblogs.com/canglongdao/category/1566249.html', 'https://www.cnblogs.com/canglongdao/category/1606140.html', 'https://www.cnblogs.com/canglongdao/category/1629226.html', 'https://www.cnblogs.com/canglongdao/category/1588735.html', 'https://www.cnblogs.com/canglongdao/category/1815562.html', 'https://www.cnblogs.com/canglongdao/category/1588084.html', 'https://www.cnblogs.com/canglongdao/category/1589277.html', 'https://www.cnblogs.com/canglongdao/category/1834572.html', 'https://www.cnblogs.com/canglongdao/category/1611757.html', 'https://www.cnblogs.com/canglongdao/category/1589392.html', 'https://www.cnblogs.com/canglongdao/category/1627263.html', 'https://www.cnblogs.com/canglongdao/category/1619655.html', 'https://www.cnblogs.com/canglongdao/category/1657195.html', 'https://www.cnblogs.com/canglongdao/category/1612257.html', 'https://www.cnblogs.com/canglongdao/category/1769926.html', 'https://www.cnblogs.com/canglongdao/category/1635972.html', 'https://www.cnblogs.com/canglongdao/category/1630667.html', 'https://www.cnblogs.com/canglongdao/archive/2020/09.html', 'https://www.cnblogs.com/canglongdao/archive/2020/08.html', 'https://www.cnblogs.com/canglongdao/archive/2020/07.html', 'https://www.cnblogs.com/canglongdao/archive/2020/06.html', 'https://www.cnblogs.com/canglongdao/archive/2020/05.html', 'https://www.cnblogs.com/canglongdao/archive/2020/04.html', 'https://www.cnblogs.com/canglongdao/archive/2020/03.html', 'https://www.cnblogs.com/canglongdao/archive/2020/02.html', 'https://www.cnblogs.com/canglongdao/archive/2020/01.html', 'https://www.cnblogs.com/canglongdao/archive/2019/12.html', 'https://www.cnblogs.com/canglongdao/archive/2019/11.html', 'https://www.cnblogs.com/canglongdao/archive/2019/10.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/12722846.html', 'https://www.cnblogs.com/canglongdao/p/12606952.html', 'https://www.cnblogs.com/canglongdao/p/12019714.html', 'https://www.cnblogs.com/canglongdao/p/12436272.html', 'https://www.cnblogs.com/canglongdao/p/12726642.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12067902.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12601894.html', 'https://www.cnblogs.com/canglongdao/p/13414829.html']

 三、筛选url地址出来

1.加个if语句判断,'http'在url里面说明是正常的url地址了

2.把所有的url地址放到一个集合,就是我们想要的结果

参考代码:

# coding:utf-8
from selenium import webdriver
import re
driver=webdriver.Chrome()
driver.get("https://www.cnblogs.com/canglongdao")
#print(type(driver.page_source))
rs=driver.page_source.encode("utf-8")
# print(type(rs),type(str(rs)))
aurl=re.findall('href="(.+?)"',str(rs))
print(aurl)
url=[]
for i in aurl:
    if 'http' in i:
        url.append(i)
#最终的url集合
print(len(url),url)

 运行结果:

posted on 2020-09-01 15:19  星空6  阅读(1223)  评论(0编辑  收藏  举报

导航