from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything
===== 创建 BeautifulSoup 对象=====
BeautifulSoup对象需要一段html文本就可以创建了。
下面的代码就创建了一个BeautifulSoup对象:
from BeautifulSoup import BeautifulSoup
doc = ['PythonClub.org ',
'This is paragraph one of ptyhonclub.org.',
'
This is paragraph two of pythonclub.org.',
'']
soup = BeautifulSoup(''.join(doc))
===== 查找HTML内指定元素 =====
BeautifulSoup可以直接用"."访问指定HTML元素
====根据html标签(tag)查找:查找html title====
可以用 soup.html.head.title 得到title的name,和字符串值。
>>> soup.html.head.title
PythonClub.org
>>> soup.html.head.title.name
u'title'
>>> soup.html.head.title.string
u'PythonClub.org'
>>>
也可以直接通过soup.title直接定位到指定HTML元素:
>>> soup.title
PythonClub.org
>>>
====根据html内容查找:查找包含特定字符串的整个标签内容====
下面的例子给出了查找含有"para"的html tag内容:
>>> soup.findAll(text=re.compile("para"))
[u'This is paragraph ', u'This is paragraph ']
>>> soup.findAll(text=re.compile("para"))[0].parent
This is paragraph one of ptyhonclub.org.
>>> soup.findAll(text=re.compile("para"))[0].parent.contents
[u'This is paragraph ', one, u' of ptyhonclub.org.']
==== 根据CSS属性查找HTML内容 ====
soup.findAll(id=re.compile("para$"))
# [This is paragraph one.
,
# This is paragraph two.
]
soup.findAll(attrs={'id' : re.compile("para$")})
# [This is paragraph one.
,
# This is paragraph two.
]
===== 深入理解BeautifulSoup =====
* [[modules:beautifulsoup:encode|BeautifulSoup 编码相关]]
* [[modules:beautifulsoup:tricks|BeautifulSoup 技巧]]