安装
用于用 html 或 xml 字符串中提取数据
pip install bs4使用
例如已经获得一下字符串
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
- 获取网页内容并且用 bs4 解析
from bs4 import BeautifulSoup
response = requests.get(url)
response.encoding = 'utf-8'
# 解析器包括
# HTML 解析器: "html.parser", "lxml", "html5lib"
# XML 解析器: ["lxml-xml"], "xml"
soup = BeautifulSoup(response.text, "html.parser")- 获得制定标签的内容
for text in soup.find_all('div', class_='sister'):
print(text.attrs) # 获得属性
print(text.string.strip()) # 获得文本
# 获得 <body> 部分, 并且格式化打印
print(soup.body.prettify())
print(soup.body.strings) # 获得子节点下的所有文本列表, 不要空白使用 stripped_strings