下载安装

https://pypi.python.org/pypi/lxml/3.4.2#downloads
pip install lxml

基本语法

2.1
    表达式	    描述
    nodename	选取此节点的所有子节点。
    /	        从根节点选取。
    //	        从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
    .	        选取当前节点。
    ..	        选取当前节点的父节点。
    @	        选取属性。
    a/text()	选取a标签下的文本
    string(.)	解析出当前节点下所有文字
    string(..)	解析出父节点下所有文字
    
[例子1]
    bookstore	    选取 bookstore 元素的所有子节点。
    /bookstore	    选取根元素 bookstore。
        注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
    bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
    //book	        选取所有 book 子元素，而不管它们在文档中的位置。
    bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
    //@lang	        选取名为 lang 的所有属性
 [例子2]
    /bookstore/book[1]	            选取属于 bookstore 子元素的第一个 book 元素。
    /bookstore/book[last()]	        选取属于 bookstore 子元素的最后一个 book 元素。
    /bookstore/book[last()-1]	    选取属于 bookstore 子元素的倒数第二个 book 元素。
    /bookstore/book[position()<3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
    //title[@lang]	                选取所有拥有名为 lang 的属性的 title 元素。
    //title[@lang='eng']	        选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
    /bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
    /bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。
    
    

2.2 XPath 轴
    ancestor	        选取当前节点的所有先辈（父、祖父等）。
    ancestor-or-self	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
    attribute	        选取当前节点的所有属性。
    child	            选取当前节点的所有子元素。
    descendant	        选取当前节点的所有后代元素（子、孙等）。
    descendant-or-self	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
    following	        选取文档中当前节点的结束标签之后的所有节点。
    namespace	        选取当前节点的所有命名空间节点。
    parent	            选取当前节点的父节点。
    preceding	        选取文档中当前节点的开始标签之前的所有节点。
    preceding-sibling	选取当前节点之前的所有同级节点。
    self	            选取当前节点。
    [例子]
    child::book	        选取所有属于当前节点的子元素的 book 节点。
    attribute::lang	    选取当前节点的 lang 属性。
    child::*	        选取当前节点的所有子元素。
    attribute::*	    选取当前节点的所有属性。
    child::text()	    选取当前节点的所有文本子节点。
    child::node()	    选取当前节点的所有子节点。
    descendant::book	选取当前节点的所有 book 后代。
    ancestor::book	    选择当前节点的所有 book 先辈。
    ancestor-or-self::book	选取当前节点的所有 book 先辈以及当前节点（如果此节点是 book 节点）
    child::*/child::price	选取当前节点的所有 price 孙节点。
    
 
 2.3功能函数 
starts-with	    //div[starts-with(@id,”ma”)]	选取id值以ma开头的div节点
contains	    //div[contains(@id,”ma”)]	    选取所有id值包含ma的div节点
and	            //div[contains(@id,”ma”) and contains(@id,”in”)]	选取id值包含ma和in的div节点
text()	        //div[contains(text(),”ma”)]	    选取节点文本包含ma的div节点
                //*[@id='app']/descendant::div[@class='stock-name']/text()

CCS选择器语法　　

语法	        说明
*	            选择所有节点
#container	    选择id为container的节点
.container	    选择所有class包含container的节点
div,p	        选择所有 div 元素和所有 p 元素
li a	        选取所有li 下所有a节点
ul + p	        选取ul后面的第一个p元素
div#container > ul	选取id为container的div的第一个ul子元素
ul ~p	            选取与ul相邻的所有p元素
a[title]	        选取所有有title属性的a元素
a[href=”http://baidu.com”]	选取所有href属性为http://baidu.com的a元素
a[href*=”baidu”]	        选取所有href属性值中包含baidu的a元素
a[href^=”http”]	            选取所有href属性值中以http开头的a元素
a[href$=”.jpg”]	            选取所有href属性值中以.jpg结尾的a元素
input[type=radio]:checked	选择选中的radio的元素
div:not(#container)	        选取所有id为非container 的div属性
li:nth-child(3)	        选取第三个li元素
li:nth-child(2n)	    选取第偶数个li元素
a::attr(href)	        选取a标签的href属性
a::text	                选取a标签下的文本
  
from lxml import etree # 加载模块
html_data = etree.HTML(html) 
print(etree.tostring(html))
content = html_data.xpath("/html/head/title/text()")
response.xpath("//tr[not(@class)]")
response.xpath("//li[@class=' left pic_logo']/img/@src").extract_first("")
response.xpath('//li[@class="hp-dropDownMenu"][position()<=4]/a')
response.xpath('//table[@class="players_table bott"]//tr[not(@class)]')
response.xpath("//a[@class='noactive' and @id='next']") #多重属性查找
html.xpath('//li/a[@href]/text()') #返回a标签中有属性href的内容
html.xpath('//li/a/@href'))      #返回a标签href属性的值
html.xpath('.//a[@class="huxing"]/span[not(@class="building-area")]/text()' 
for d in html_data.xpath("/html/head/title/:
    print(d.text)
    [ 'addnext', 'addprevious', 'append', 'attrib', 'base', 'clear', 'cssselect', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 
'getnext', 'getparent', 'getprevious', 'getroottree', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'itersiblings', 
'itertext', 'keys', 'makeelement', 'nsmap', 'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag', 'tail', 'text', 'values', 'xpath']

node = html_data.xpath("/html/head/title") 
content = html_data.xpath("body/div/div[@id='id2']/ul/li[1]/div[2]/a/@href")

获取文本    /html/title/text()
获取属性    /html/link/@href
获取列表    /html/a
当前节点    ./
上一级节点  ../
获取列表中第一个    /html/a[1]
取列表中最后一个    /html/a[last()]
取前两个            /html/a[position()<3]
或者                       /html/a[1]|/html/a[3] //a[1]|//[3]
当前节点中某个位置标签     /html//a
选择id或者class固定的标签  /html/a[@id="id"] /html/a[@class="class"]
获取当前标签下所有标签的文本  /html/a//text()
根据文本筛选                  //a[text() = '下一页']
原文：https://blog.csdn.net/qq_40942329/article/details/79755339

Example

 html='''<div id="content">   
   <ul id="useful">
      <li>有效信息1</li>
      <li>有效信息2</li>
      <li>有效信息3</li>
   </ul>
   <ul id="useless">
      <li>无效信息1</li>
      <li>无效信息2</li>
      <li>无效信息3</li>
   </ul>
</div>
<div id="url">
   <a href="http://cighao.com">陈浩的博客</a>
   <a href="http://cighao.com.photo" title="陈浩的相册">点我打开</a>
</div>'''

from lxml import etree
# 假设已经存在 html 变量，值为上面的源码
selector = etree.HTML(html)
# 提取 li 中的有效信息123
content = selector.xpath('//ul[@id="useful"]/li/text()')
for each in content:
    print(each)
#提取 a 中的属性
link = selector.xpath('//a/@href')
for each in link:
    print(each)

title = selector.xpath('//a/@title')
for each in title:
    print(each)   
    
<2>
wb_data = b"""
        <html><body><div>
        <style>
        a{
        color:#456;
        }
        </style>
           <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0 a"><a href="link5.html">fifth item</a>
         <span class="ah" style="background: rgba(0, 0, 0, 0) url('/images/upload/advertisement/1/1480322326.jpg') no-repeat scroll center center;">hello</span>
             </li></ul> 
         </div>
        </body></html>
        """
#print(type(wb_data))
html = etree.HTML(wb_data)
#print(html)
print(html.xpath('//li[@class]/a/text()'))
print(html.xpath("//span[@class='ah']/@style"))

for box in html.xpath('//ul/li'):
    print('a=',box.xpath('a/text()'))
    etree.tostring(box, method='html') #查看html

posted on 2024-02-23 17:35 boye169 阅读(31) 评论(0) 收藏举报

刷新页面返回顶部

下载安装

基本语法

CCS选择器语法

Example

公告

CCS选择器语法