不是孩子了 - 博客园

2021年12月12日

摘要： from lxml import html etree = html.etree # 加载html文件 tree = etree.parse("b.html", etree.HTMLParser()) # ['百度', '谷歌', '搜狗'] # result = tree.xpath("/html 阅读全文

posted @ 2021-12-12 21:03 不是孩子了阅读(43) 评论(0) 推荐(0)

16-xpath入门解析01

摘要： # xpath是在xml文档中搜索内容的一门语言 # html是xml的一个子集 # 安装lxml模块 pip install lxml # xpath解析 from lxml import html etree = html.etree xml = """ <book> <id>1</id> <n 阅读全文

posted @ 2021-12-12 18:30 不是孩子了阅读(48) 评论(0) 推荐(0)

15-bs4-爬取图片实战

摘要： https://m.ivsky.com/ # 1、拿到主页面的源代码，然后提取到子页面的链接地址，href # 2、通过href拿到子页面的内容，从子页面中找到图片的下载地址img -> src # 3、下载图片 import requests from bs4 import BeautifulSo 阅读全文

posted @ 2021-12-12 17:45 不是孩子了阅读(714) 评论(0) 推荐(0)

14-bs4基本使用---爬取菜价

摘要：首先要安装bs4 pip install bs4 from bs4 import BeautifulSoup import requests import csv url = "http://www.maicainan.com/offer/show/id/3242.html" resp = requ 阅读全文

posted @ 2021-12-12 15:59 不是孩子了阅读(126) 评论(0) 推荐(0)

13-re实战---爬取电影天堂网站---爬取子页面的内容

摘要： # 先进入到电影天堂首页，可以看到2021必看热片模块 # 随便点击一个连接，会再打开一个网站，网站下面有下载地址，我们要爬取这个下载地址 import requests import re url = "https://dytt89.com/" headers = { "user-agent": 阅读全文

posted @ 2021-12-12 00:13 不是孩子了阅读(140) 评论(0) 推荐(0)

2021年12月11日

12-re实战---爬取小说网

摘要：我们把小说名、是否完结、男主名字、女主名字都给爬取下来 import requests import re url = "http://m.pinsuu.com/paihang/nanpindushi/" headers = { "User-Agent": "Mozilla/5.0 (Linux; 阅读全文

posted @ 2021-12-11 21:51 不是孩子了阅读(88) 评论(0) 推荐(0)

11-python的re模块的使用（2）

摘要：将提取的内容放到一个组中，通过这个组的名字获取我们想要的内容阅读全文

posted @ 2021-12-11 20:20 不是孩子了阅读(19) 评论(0) 推荐(0)

10-python的re模块的使用（1）

摘要： import re # 前缀r表示这个是正则表达式，没有也可以，但是加上更规范，就像二进制、十六进制一样 # findAll：查找所有满足正则表达式的内容，但是用的不多，因为用的列表，列表效率并不高 list = re.findall(r"\d+", "我的手机号码是10086，你的手机号码是10 阅读全文

posted @ 2021-12-11 19:18 不是孩子了阅读(37) 评论(0) 推荐(0)

09-Re解析---正则表达式

摘要：每一个元字符默认只匹配一个字符，例如一个点匹配的是一个字符，两个点匹配的就是两个字符 / / / / / / / / / / （1）贪婪匹配 *先找到“玩儿”，然后通过.尽可能多的匹配，然后找到最远的那个“游戏” （2）惰性匹配先找到“玩儿”，然后尽可能多的匹配，然后因为有？又是尽可能少的匹配，所阅读全文

posted @ 2021-12-11 18:56 不是孩子了阅读(36) 评论(0) 推荐(0)

08-数据解析概述

摘要： ![](https://img2020.cnblogs.com/blog/2506674/202112/2506674-20211211180556697-1107424489.png) 阅读全文

posted @ 2021-12-11 18:06 不是孩子了阅读(39) 评论(0) 推荐(0)

发量不减

公告