04.BeautifulSoup使用

一、BeautifulSoup

1、简介

是一个可以从HTML或XML文件中提取数据的Python库。 BeautifulSoup最主要的功能是从网页抓取数据，BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。精髓：将HTML 转换成标签对象。（这是利用html的结构性） ''' 首先，一个节点中是可以包含多个子节点和多个字符串的。例如html节点中包含着head和body节点。所以BeautifulSoup就可以将一个HTML的网页用这样一层层嵌套的节点来进行表示。 ''' BeautifulSoup 有四大节点对象： 1、BeautifulSoup：解析网页获得的对象。 2、Tag（重点掌握）：通过BeautifulSoup对象来提取数据，基本都围绕着这个对象来进行操作。 3、NavigableString（可以遍历的字符串）：一般被标签包裹在其中的的文本就是 NavigableString格式。 4、Comment：指的是在网页中的注释以及特殊字符串。</pre>

2、BeautifulSoup的优点？

相对于正则来说更加的简单方便。

二、使用：

安装：pip install beautifulsoup4

导包：from bs4 import BeautifulSoup

指定解释器：BeautifulSoup解析网页需要指定一个可用的解析器，以下是主要几种解析器

bs解析器.png

若是没有指定，会默认使用 html.parser，并且会出现警告，提示你没有指定。

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for 
this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or 
in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 40 of the file /home/pyvip/Python_crawler_pratice
/bs_xpath/bs4的使用.py. To get rid of this warning, pass the additional argument 'features="lxml"' 
to the BeautifulSoup constructor.

 soup = BeautifulSoup(html_str)

提示：如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的，所以要指定某一个解析器。

1、Tag 的使用:

可以利用beautifulsoup4对象加标签名轻松地获取这些标签的内容,这些对象的类 bs4.element.Tag。但是注意，它查找的是在所有内容中的 第一个 符合要求的标签。对于 Tag，它有两个重要的属性，是name 和 attrs. ①创建BeautifulSoup对象： soup = BeautifulSoup(html_str) #将html文件转换成可操作对象。 print( type(soup) ) #返回结果：<class 'bs4.BeautifulSoup'>

• ②获取标签：返回的只有第一个满足要求的标签的所有内容 a1 = soup.a #获取第一个符合条件的标签 print(a1) #返回结果：<a href="http://www.taobao.com">淘宝<a> • ③获取属性： soup.a["href"] #获取第一个符合条件的标签的属性 soup.a.get('href') 二者等价，返回结果均为：'http://www.taobao.com' soup.a.attrs #输出a标签的全部属性,类型为字典 soup.a.name #输出标签的名称,即为 a。 soup.name #beautifulsoup4对象本身特殊,返回的是[document],不是列表。

• ④获取内容： text = soup.a.text print(text) #返回结果：'淘宝'</pre>

（1）亲戚标签选择（遍历文档树）：

属性:

1.children:获取Tag的所有直接子节点,返回<class 'list_iterator'> 迭代器

例：

         p = soup.p
         print(p.children)
         print(list(p.children))

 ['n', <a class="product" href="https://www.baidu.com">关于Python</a>, 'n', 
<a href="http://www.taobao.com">好好学习</a>, 'n', <a href="javascript:void(0)">      人生苦短</a>,
 'n', <a href="javascript:void(0)">我用Python</a>, 'n']

注意：在这里 n 也算是一个子节点哦

2.contents:获取Tag的所有直接子节点，返回<class 'list'> 列表

例1: print(type(p.contents)) #list print(p.contents) #可通过索引获取它的某一个元素。

注：children和contents返回的都是当前Tag下的直接子节点，不过一个是迭代器，一个是列表

3.descendants: 获取Tag的所有子孙节点，返回<class 'generator'>，生成器

例1:

 print(type(p.descendants))
 print(list(p.descendants))

['n', <a href="http://www.taobao.com">淘宝</a>, '淘宝', 'n', <span class="link">
<a class="product" href="https://www.baidu.com">关于Python</a>
<a href="http://www.taobao.com">好好学习</a>
<a href="javascript:void(0)">人生苦短</a>
<a href="javascript:void(0)">我用Python</a>
</span>, 'n', <a class="product" href="https://www.baidu.com">关于Python</a>, 
'关于Python', 'n', <a href="http://www.taobao.com">好好学习</a>, '好好学习', 'n',
 <a href="javascript:void(0)">人生苦短</a>, '人生苦短', 'n', <a href="javascript:void(0)">我用Python</a>, 
'我用Python', 'n', 'n', <span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 贤思齐</span>, 
'关于我: ', <i class="PyWhich py-wechat"></i>, ' 贤思齐', 'n']

4.strings: 如果Tag包含多个字符串,即在子孙节点中所有文本内容,可以用此获取,而后进行遍历(n也会看作一个字符串).返回<class 'generator'>

例1:

 print(type(p.strings))
 print(list(p.strings))

['n', '淘宝', 'n', 'n', '关于Python', 'n', '好好学习', 'n',
 '人生苦短', 'n', '我用Python', 'n', 'n', '关于我: ', ' 贤思齐', 'n']

5.string: 与strings不同，只会返回一个且若有多条文本只会返回 None。

例1: print(p.string) #None

6.stripped_strings: 与strings用法一致,只不过可以去除掉那些多余的空白内容，<class 'generator'>

例1:

 print(type(p.stripped_strings))  #  <class 'generator'>
 print(list(p.stripped_strings)) 

['淘宝', '关于Python', '好好学习', '人生苦短', '我用Python', '关于我:', '贤思齐']、

7.parent: 获取Tag的直接父节点,

例1: print(type(p.parent)) #<class 'bs4.element.Tag'> print(p.parent)

8.parents: 递归得到父辈元素的所有节点,返回一个生成器

9.next_sibling: 获取Tag的下一个兄弟节点

注意:如果查询的Tag下一行是空行，那么输出的是 'n' ,因为'n'也被视作一个节点。例1:也可以这么做 print(p.next_sibling.next_sibling)#获取下一个的下一个

10.next_siblings:返回的是生成器

11.previous_sibling:获取Tag的上一个兄弟节点

12.previous_siblings:返回的是生成器

（2）find_all（搜索文档树）:

1.find_all(name, attrs, recursive, text, **kwargs) #返回所有符合条件的Tag,默认在所有子孙节点搜索。

切记:返回的是BeautifulSoup特有的结果集(<class 'bs4.element.ResultSet'>),里面装的是标签对象。

参数: -name:通过标签名搜索可以是列表,可以找出所有符合的标签,并返回列表。例: BeautifulSoup对象.find_all(['p','a'])#获取所有p标签，和a标签

-attrs:通过attrs搜索,通过传递给attrs一个字典,来搜索参数。例: BeautifulSoup对象.find_all(attrs={'class':'MW'})#获取class为MW的所有Tag.

-text:单独添加text返回的是符合text的字符串列表。例1: BeautifulSoup对象.find_all(text='China')#返回的是China的字符串,要是想得到包含该文本的标签呢？例2: BeautifulSoup对象.find_all('a',text='China')#返回的是文本为China的a标签。

-recursive:通过设置recursive=False,将搜索范围限制在直接子节点中。 recursive 意为递归：True，递归，所有子孙元素;False，不递归，只有子元素。

-kwargs:与正则表达式结合例1: import re beautifulsoup对象.find_all(re.compile('^b')) 返回以b开头的标签

（3）CSS选择器

BeautifulSoup支持发部分的CSS选择器

方法 : BeautifulSoup对象.select() 参数 : str,即可使用CSS选择器的语法找到目标Tag. 返回值 : 切记( 选择的结果以列表形式返回 )

from bs4 import BeautifulSoup soup = BeautifulSoup('html文本','解析工具推荐lxml')

1、通过标签名查找: 例1: soup.select('title') #获取title标签选择所有p标签中的第三个标签 soup.select("p:nth-of-type(3)") 相当于soup.select(p)[2] 选择body标签下的所有a标签 soup.select("body a") 直接子标签查找: beautifulsoup对象.select('p > a') #获取 p标签下的直接子标签兄弟节点标签查找: soup.select("#link1 ~ .mysis")# 选择id=link1后的所有兄弟节点标签 soup.select("#link1 + .mysis")# 选择id=link1后的下一个兄弟节点标签

2、通过类名查找: 例1: soup.select("a.mysis")# 选择a标签，其类属性为mysis的标签 soup.select("a['mysis']") #也可以这样查找,这是属性查找,[]也可写成class=‘mysis’

3、通过id查找: 例1: soup.select('a#link1')#选择a标签,其id属性为link1的标签 4、属性查找: 例1: 选择a标签，其属性中存在myname的所有标签 soup.select("a[myname]") 选择a标签，其属性href=http://example.com/lacie的所有标签 soup.select("a[href='http://example.com/lacie']") 选择a标签，其href属性以http开头 soup.select('a[href^="http"]') 选择a标签，其href属性以lacie结尾 soup.select('a[href$="lacie"]') 选择a标签，其href属性包含.com soup.select('a[href*=".com"]') 从html中排除某标签，此时soup中不再有script标签 [s.extract() for s in soup('script')] 如果想排除多个呢 [s.extract() for s in soup(['script','fram']) ]

5、获取内容:get_text()、strings属性

get_text()方法:返回的是列表。 strings属性:返回的是迭代对象。

例1:

html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body> """ from bs4 import BeautifulSoup ''' 以列表的形式返回 '''

 soup = BeautifulSoup(html_doc, 'html.parser')
 s = soup.select('p.story') #选择 class为story的  p标签。
 s[0].get_text()  # p节点及子孙节点的文本内容
 s[0].get_text("|")  # 指定文本内容的分隔符
 s[0].get_text("|", strip=True)  # 去除文本内容前后的空白
 print(s[0].get("class"))  # p节点的class属性值列表（除class外都是返回字符串）

2、NavigableString 的使用:

-NavigableString: (常用) 介绍:意思为可以遍历的字符串，一般被标签包裹在其中的文本就是NavigableString格式,而获取标签内部的文本需要 string 属性。例1:

 html = '''<html>
 <td>some text</td> 
 <td></td>
 <td><p>more text</p></td>
 <td>even <p>more text</p></td>
 </html>'''

 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html, 'lxml')
 tds = soup.find_all('td')

 for td in tds:
 print(td.string)
 print(type(td.string ))

 for td in tds:
 print(td.text)
 print(td.text)

string 属性的返回类型是 bs4.element.NavigableString，而 text 属性的返回类型是 str。并且若标签内部没有文本 string 属性返回的是None ,而text属性不会返回None</pre>

3、Comment 的使用:

介绍:在网页中的注释以及特殊字符串。Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

html_str="""<!DOCTYPE html>
<html>
<head>
<title>39爬虫少年们</title>
  <meta charset="utf-8">
  <link rel="stylesheet" href="http://www.taobao.com">
  <link rel="stylesheet" href="https://www.baidu.com">
  <link rel="stylesheet" href="http://at.alicdn.com/t/font_684044_un7umbuwwfp.css">
</head>
<body>
<!-- footer start -->
<footer id="footer">
    <div class="footer-box">
        <div class="footer-content"  >
            <p class="top-content"  id="111">
                    <a  href="http://www.taobao.com">淘宝</a> 
                    <span class="link">
                        <a class="product"  href="https://www.baidu.com">关于Python</a> 
                        <a  href="http://www.taobao.com">好好学习</a> 
                        <a href="javascript:void(0)">人生苦短</a> 
                        <a href="javascript:void(0)">我用Python</a>
                    </span>
                <span class="about-me">关于我: <i class="PyWhich py-wechat"></i> 贤思齐</span>
            </p>
            <p class="bottom-content">
                <span>地址： xxxx</span>
                <span>联系方式: <a href="tel:400-1567-315">400-1567-315</a> (24小时在线)</span>
            </p>
        </div>
        <p class="copyright-desc">
            Copyright &copy; 2008 - 2019 xxx有限公司. All Rights Reserved
        </p>
    </div>
</footer>

</body>
</html>
"""</pre>

小测试:

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>

from bs4 import BeautifulSoup ''' 以列表的形式返回 '''

soup = BeautifulSoup(html_doc, 'html.parser')
soup.select('title')  # title标签
soup.select("p:nth-of-type(3)")  # 第三个p节点
soup.select('body a')  # body下的所有子孙a节点
soup.select('p > a')  # 所有p节点下的所有a直接节点
soup.select('p > #link1')  # 所有p节点下的id=link1的直接子节点
soup.select('#link1 ~ .sister')  # id为link1的节点后面class=sister的所有兄弟节点
soup.select('#link1 + .sister')  # id为link1的节点后面class=sister的第一个兄弟节点
soup.select('.sister')  # class=sister的所有节点
soup.select('[class="sister"]')  # class=sister的所有节点
soup.select("#link1")  # id=link1的节点
soup.select("a#link1")  # a节点，且id=link1的节点
soup.select('a[href]')  # 所有的a节点，有href属性
soup.select('a[href="http://example.com/elsie"]')  # 指定href属性值的所有a节点
soup.select('a[href^="http://example.com/"]')  # href属性以指定值开头的所有a节点
soup.select('a[href$="tillie"]')  # href属性以指定值结尾的所有a节点
soup.select('a[href*=".com/el"]')  # 支持正则匹配</pre>