04.Xpath的使用 - 码农教程

一.Xpath简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。相比于BeautifulSoup，Xpath在提取数据时会更有效率。

二.怎么做？

安装: pip install lxml 导包: from lxml import etree 使用: page = etree.HTML('html代码') # 解析并返回了 html 节点(解析HTML文档) print(type(page)) #<class 'lxml.etree._Element'> ''' HTML是个方法： def HTML(text, parser=None, base_url=None): ''' 结果返回的都是列表。

官方文档:https://www.w3school.com.cn/xpath/xpath_nodes.asp

1.选取节点:

X-path使用路径表达式在 XML/HTML 文档中选取节点。节点是通过沿着路径或者 step来选取。最有用的路径表达式: 1、nodename : 选取当前节点的所有子节点。 2、/ : 从根节点选取,也就是从祖先下开始选取。 3、// : 选取所有符合要求的节点 ,不考虑他们的位置。 4、. : 选取当前节点。 5、.. : 选取当前节点的父节点。 6、@ : 选取属性。

 小实验1:
 html_doc = '''
 <html>
 <head>
 <title>The Dormouse's story</title>
 </head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their                             names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>                               and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 </p>
 and they lived at the bottom of a well.
 <p class="story">...</p>
 </body>
 '''
 from lxml import etree
 page = etree.HTML(html_doc) #解析并返回html节点
 #print( type(page) )

 
 #语法 nodename 表示选取当前节点的所有  子节点。
 head_demo = page.xpath('head')  #访问head节点
 print(head_demo)             #返回[<Element head at 0x7f16883d0888>]

 print(page.xpath('body')[0].xpath('p'))
'''
查询body节点下的所有 p 节点。注意:我之前在这里犯了个错误没有写[0],因为page.xpath('body')返回的是列表
(一定牢记)列表没有xpath方法否则会出现AttributeError: 'list' object has no attribute 'xpath'。
'''
 
 #语法 .  表示访问当前节点
 print(page.xpath('.'))  #返回 [<Element html at 0x7f0bfc65e908>]
 print(head_demo.xpath('.'))#返回  [<Element head at 0x7f82507a18c8>] 

 
 #语法 ..  表示选取当前节点的父节点。
 print(head_demo.xpath('..'))
 #返回 [<Element html at 0x7fc98f030948>]

 
 #语法 /  表示从根节点选取
 print(head_demo.xpath('/body')) #这样为空,因为 / (根节点)下只有html
 print(head_demo.xpath('/html/body')) #这样是可以的,搜索 根节点下的html下的body。结果为:                                                  [<Element body at 0x7f537005d8c8>]
 print(page.xpath('/html/body')) #结果一样为[<Element body at 0x7f537005d8c8>]
 #两个结果都一样,原因就是语法 / 是从根节点选取，不管前面是 page还是head_demo。

 *
 #语法 //  表示选取所有符合要求的节点 ,不考虑他们的位置。
 print(page.xpath('//p'))
 print(head_demo.xpath('//p'))
 #这两个结果也是一样的,语法 // 找寻所有符合要求的,不管在和位置，与 page、head_demo一点关系没有。

 *
 #语法 @  表示选取当前节点的属性。
 print(page.xpath('/html/body/p/a')[0].xpath('@href'))
 print(page.xpath('/html/body/p/a')[1].xpath('@href'))

 print(page.xpath('//a')[0].xpath('@href'))
 print(page.xpath('//a')[1].xpath('@href'))
 #这里返回的结果如下,是一样的,只不过一个用了语法 // ,另一个用了 语法 / 。
 ['http://example.com/elsie']
 ['http://example.com/lacie']
 ['http://example.com/elsie']
 ['http://example.com/lacie']

2.谓语(Predicates):

谓语用来查找某个特定的节点或者包含某个指定的值的节点。谓语被嵌在 "方括号" 中。

常用的谓语: last() : 选取最后一个元素。 last()-1 ：倒数第二个元素。 position()<3 ：选位置小于3，也就是前两个。 [1] ：[] 加数字，表示选取第几个。（注意：1就是第1个，而不是0是第一个）

小实验2:

 html_doc ='''
 <html>
 <body>
 <bookstore>
 <book category='WWW'>
 <title lang="eng">Harry Potter</title>
 <price>29.99</price>
 </book>

 <book category='sss'>
 <title lang="eng">Learning XML</title>
 <price>39.95</price>
 </book>

 <book category='QQQ'>
 <title lang="eng">There is a will</title>
 <price>99.99</price>
 </book>
 </bookstore>
 </body>
 </html>
 '''
 from lxml import etree
 demo = etree.HTML(html_doc) #解析并返回html节点

 print(demo.xpath('//book'))
 '''
 [<Element book at 0x7f64ffa69a08>,
 <Element book at 0x7f64ffa69a48>, 
 <Element book at 0x7f64ffa69a88>]
 '''
 print(demo.xpath('//book[1]'))
 #返回结果为:[<Element book at 0x7f64ffa69a08>]
 #注意这个和索引不同，1就是第一个
 print(demo.xpath('//book[2]'))
 # [<Element book at 0x7f64ffa69a48>]
 print(demo.xpath('//book[last()]'))
 #获取最后一个 [<Element book at 0x7f64ffa69a88>] 
 print(demo.xpath('//book[last()-1]'))

 print(demo.xpath('//book[position() < 3]'))
 #[<Element book at 0x7fa849af2948>, <Element book at 0x7fa849af2988>]

 print(demo.xpath('//book[@category='QQQ']'))
 #日后查询图片就可以用这个方法

 print(demo.xpath('//bookstore/book[price>35.00]/title'))
 #选取节点 price >35 的title节点。

3.选取未知节点:

: 匹配所有元素节点。 @: 匹配任何属性节点。 nodename() : 匹配任何类型的节点。 </pre>

小实验3:

 from lxml import etree

 html_doc = '''
 <html>
   <body>
     <bookstore>
            <book category='WWW'>
                   <title lang="eng">Harry Potter</title>
                    <price>29.99</price>
           </book>

           <book category='sss'>
                   <title lang="eng">Learning XML</title>
                  <price>39.95</price>
           </book>
                    
           <book category='QQQ'>
                   <title lang="eng">There is a will</title>
                   <price>99.99</price>
           </book>
     </bookstore>
   </body>
 </html>
 '''

 demo = etree.HTML(html_doc)

 *
 # 语法 *  表示从根节点选取
 print(demo.xpath('//book/*')) #选取所有book的所有元素节点
'''
 [<Element title at 0x7fe919841988>, <Element price at 0x7fe9198419c8>, <Element title at 0x7fe919841a08>, 
<Element price at 0x7fe919841a48>,<Element title at  0x7fe919841a88>, 
<Element price at 0x7fe919841b08>]
'''
 print(demo.xpath('//book[position() = 1]/*')) #选取book位置为1 的所有元素节点
 #[<Element title at 0x7fe919841988>, <Element price at 0x7fe9198419c8>]

 *
 # 语法 @*  表示选取所有元素节点。

4.选取若干路径:

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

例:
 //book/title | //book/price  #选取 book 元素的所有 title 和 price 元素。
 //title | //price          #选取文档中的所有 title 和 price 元素。
 /bookstore/book/title | //price  #选取属于 bookstore 元素的 book 元素的所有 title 元                                          素，以及文档中所有的 price 元素。

5、获取节点中的文本: 注意 'n' 也算一个文本哦。

(1)text()方法

例1:

 print(demo.xpath('//book[position() = 1]/text()'))
 #获取的是当前节点的直接子节点的文本

(2)string():获取所有文本

例2:

 print(demo.xpath('string(//book[position() = 1])'))
 #获取的是当前节点的所有子孙节点的文本