安装
用于用 html 或 xml 字符串中提取数据
pip install bs4
使用
例如已经获得一下字符串
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
- 获取网页内容并且用 bs4 解析
from bs4 import BeautifulSoup
response = requests.get(url)
response.encoding = 'utf-8'
# 解析器包括
# HTML 解析器: "html.parser", "lxml", "html5lib"
# XML 解析器: ["lxml-xml"], "xml"
soup = BeautifulSoup(response.text, "html.parser")
- 获得制定标签的内容
for text in soup.find_all('div', class_='sister'):
print(text.attrs) # 获得属性
print(text.string.strip()) # 获得文本
# 获得 <body> 部分, 并且格式化打印
print(soup.body.prettify())
print(soup.body.strings) # 获得子节点下的所有文本列表, 不要空白使用 stripped_strings