捕获IDS与XPath在Python从URL源

想象我有内容，如：捕获IDS与XPath在Python从URL源

cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>.... 
"""

我想要什么：

id1='test1' 
id2='test2' 
idn='testn'

你能纠正我？

if '<a id=' in cont: 
    ....?

我一定要使用正则表达式在 Python或有通过的XPath的方法来抓住他们？

注：我只希望在标签

来源

2014-11-06 MLSC

为什么不使用类似Bsoup或lxml的东西？ – 2014-11-06 08:11:35

Beautifulsoup似乎确实是一个简单的方法来做到这一点：http://www.crummy.com/software/BeautifulSoup/bs4/doc/ – 2014-11-06 08:12:43

@Vincent Beltman如果你知道一个可靠的方法，它会受到欢迎... – MLSC 2014-11-06 08:12:45

下载BS4这里所有ID：http://www.crummy.com/software/BeautifulSoup/

文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

这应该工作：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(cont) 
for a in soup.select('a'): # Or soup.find_all('a') if you prefer 
    if a.get('id') is not None: 
     print a.get('id')

或者用理解得到清单：

ids = [a.get('id') for a in BeautifulSoup(cont).select('a') if a.get('id') is not None]

来源

2014-11-06 08:15:42

应该将html更改为“cont”。我做了：'汤= BeautifulSoup（续）;对于soup.find_all（'a'）中的ids：print（ids.get（'id'））'并且可以很好地工作 – MLSC 2014-11-06 08:18:15

@MortezaLSC，但它只显示值。 'test1'，'test2'不'ID1 ='test1'' – 2014-11-06 08:19:56

@Avinash拉吉，谢谢你......没问题，我想我应该把它们放入一个列表，并使用它们 – MLSC 2014-11-06 08:22:26

通过列表理解和BeautifulSoup。

>>> from bs4 import BeautifulSoup 
>>> cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>.... 
""" 
>>> soup = BeautifulSoup(cont) 
>>> [i.get('id') for i in soup.findAll('a') if i.get('id') != None] 
['test1', 'test2'] 
>>> [i['id'] for i in soup.findAll('a') if i['id'] != None] 
['test1', 'test2']

来源

2014-11-06 08:26:03

但是有一个问题...！我怎么能否认没有类型的ID ...？只需打印test1和test2？我的结果是现在：'[ '测试1'， '无'， 'test2的'， '无']' – MLSC 2014-11-06 08:57:17

如果尝试这种'[我[ '身份证']因为我在soup.findAll（ 'A'）我[” id']！='None'] ' – 2014-11-06 09:01:25

它返回错误。所以我把它改为：'如果i.get ['id']！='None']'print [i.get（'id'）for soup.findAll（'a'）''错误 – MLSC 2014-11-06 09:04:04

捕获IDS与XPath在Python从URL源

回答

相关问题