BeautifulSoup

2020-03-25
1分钟阅读时长

安装

用于用 html 或 xml 字符串中提取数据

pip install bs4

使用

例如已经获得一下字符串

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
  • 获取网页内容并且用 bs4 解析
from bs4 import BeautifulSoup
response = requests.get(url)
response.encoding = 'utf-8'
# 解析器包括
# HTML 解析器: "html.parser", "lxml", "html5lib"
# XML 解析器: ["lxml-xml"], "xml"
soup = BeautifulSoup(response.text, "html.parser")
  • 获得制定标签的内容
for text in soup.find_all('div', class_='sister'):
    print(text.attrs)   # 获得属性
    print(text.string.strip())  # 获得文本

# 获得 <body> 部分, 并且格式化打印
print(soup.body.prettify())
print(soup.body.strings) # 获得子节点下的所有文本列表, 不要空白使用 stripped_strings
下一页 FFmpeg