python网络爬虫精解之Beautiful Soup的使用说明_Python

一、Beautiful Soup的介绍

Beautiful Soup是一个强大的解析工具，它借助网页结构和属性等特性来解析网页。

它提供一些函数来处理导航、搜索、修改分析树等功能，Beautiful Soup不需要考虑文档的编码格式。Beautiful Soup在解析时实际上需要依赖解析器，常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html测试实例：

				?

									<!DOCTYPE html>

									<html>

									<head>

									    <meta content="text/html;charset=utf-8" http-equiv="content-type" />

									    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />

									    <meta content="always" name="referrer" />

									    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />

									    <title>百度一下，你就知道 </title>

									</head>

									<body link="#0000cc">

									  <div id="wrapper">

									    <div id="head">

									        <div class="head_wrapper">

									          <div id="u1">

									            <a class="mnav" href="http://news.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新闻 </a>

									            <a class="mnav" href="https://www.hao123.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>

									            <a class="mnav" href="http://map.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地图 </a>

									            <a class="mnav" href="http://v.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">视频 </a>

									            <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">贴吧 </a>

									            <a class="bri" href="//www.baidu.com/more/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多产品 </a>

									          </div>

									        </div>

									    </div>

									  </div>

									</body>

									</html>

1、节点选择器

我们之前了解到，一个网页是由若干个元素节点组成的，通过提取某个节点的具体内容，就可以获取到界面呈现的一些数据。使用节点选择器能够简化我们获取数据的过程，在不使用正则表达式的前提下，精准的获取数据。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.head)

									print(soup.head.title)

									print(soup.a)

【运行结果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>
<title>百度一下，你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>

分析：

第一条打印数据为获取网页的head节点；

第二条打印内容是获取head节点中title节点，获取该节点使用了一个嵌套选择，因为title节点是嵌套在head节点里面的；

第三条打印内容是获取a节点，在源码中我们看到有许多条a节点，而只匹配到第一个a节点就结束了。当有多个节点时，这种选择方式指只会选择第一个匹配的节点，其他后面节点会忽略。

2、提取信息

一般我们需要的数据位于节点名、属性值、文本值中，以下代码展示了如何获取这三个地方的数据：

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.body.name)

									print(soup.body.a.attrs['class'])

									print(soup.body.a.attrs['href'])

									print(soup.body.a.string)

【运行结果】

body
['mnav']
http://news.baidu.com
新闻

分析：

第一条获取body节点名；

第二条获取a节点class属性值；

第三条获取a节点href属性值；

第四条获取a节点的文本值；

3、关联选择

（1）子节点和子孙节点

子节点可以调用contents属性和children属性，子孙节点可以调用descendants属性，他们返回结果都是生成器类型，通过for循环输出匹配到的信息。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									# print(soup.body.contents)

									for i,content in enumerate(soup.body.contents):

									    print(i,content)

【运行结果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
2

（2）父节点和祖先节点

获取某个节点的父节点可以调用parent属性，例如获取实例中title节点的父节点：

				?

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.title.parent)

【运行结果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>

同理，如果是想要获取节点的祖先节点，则可调用parents属性。

（3）兄弟节点

调用next_sibling获取节点的下一个兄弟元素；

调用previous_sibling获取节点的上一个兄弟元素；

调用next_siblings取节点的下一个兄弟节点；

调用previous_siblings获取节点的上一个兄弟节点；

4、方法选择器

find_all（）

查找所有符合条件的元素，其使用方法如下：

				?

									find_all(name,attrs,recursive,text,**kwargs)

（1）name

根据节点名来查询元素，例如查询实例中a标签元素：

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a"))

									for a in soup.find_all(name = "a"):

									    print(a)

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>

（2）attrs

在查询时我们还可以传入标签的属性，attrs参数的数据类型是字典。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【运行结果】

[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]

可以看到，在加上class=“bri”属性时，查询结果就只剩一条a标签元素。

（3）text

text参数可以用来匹配节点的文本，传入的可以是字符串，也可以是正则表达式对象。

				?

									import re

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a",text = re.compile('新闻')))

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>]

只包含文本内容为“新闻”的a标签。

find（）

find（）的使用与前者相似，唯一不同的是，find进匹配搜索到的第一个元素，然后返回单个元素，find_all（）则是匹配所有符合条件的元素，返回一个列表。

5、CSS选择器

使用CSS选择器时，调用select（）方法，传入相应的CSS选择器；

例如使用CSS选择器获取实例中的a标签

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.select('a'))

									for a in soup.select('a'):

									    print(a)

【运行结果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>

获取属性

获取上述a标签中的href属性

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									for a in soup.select('a'):

									    print(a['href'])

【运行结果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

获取文本

获取上述a标签的文本内容，使用get_text()方法，或者是string获取文本内容

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									for a in soup.select('a'):

									    print(a.get_text())

									    print(a.string)