使用Python从网站中提取网页元素

我想从本网站的表格和段落文本中提取各种元素。使用Python从网站中提取网页元素

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

这是我使用的代码：

import lxml 
from lxml import html 
from lxml import etree 
import urllib2 
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read() 
x = etree.HTML(source) 
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)") 
growth

什么是提取从一个网站，我想的元素，而无需每次都改变的XPath代码的最佳方式是什么？他们每个月都在同一个网站上发布新数据，但XPath有时会发生一些变化。

来源

2017-02-26 prashanth manohar

什么是你想要的元素一个例子吗？您的XPath无效，无法在此页面上进行测试。 –

我改变了xpath。我需要“制造一瞥”表中的元素。还有段落文字。 –

如果你经常要修改的项目的位置，尝试通过名称检索它们。例如，以下是如何从“新订单”行中的表格中提取元素的方法。

import requests #better than urllib 
from lxml import html, etree 

url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1' 
page = requests.get(url) 
tree = html.fromstring(page.content) 

neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()') 

print(neworders)

或者，如果你想整个HTML表格：

data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..') 

for elements in data: 
    print(etree.tostring(elements, pretty_print=True))

使用BeautifulSoup

from bs4 import BeautifulSoup 
import requests 

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1" 

content = requests.get(url).content 

soup = BeautifulSoup(content, "lxml") 

table = soup.find_all('table')[1] 

table_body = table.find('tbody') 

data= [] 
rows = table_body.find_all('tr') 
for row in rows: 
    cols = row.find_all('td') 
    cols = [ele.text.strip() for ele in cols] 
    data.append([ele for ele in cols if ele]) 

print(data)

来源

2017-02-26 01:50:39

嘿Ettore，有一个小问题。我在这里描述：http://stackoverflow.com/q/42592948/4399016 谢谢！ –

BeautifulSoup救援：

from bs4 import BeautifulSoup 
import urllib2 

r = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655') 
soup = BeautifulSoup(r) 
soup.find('div', {'id': 'home_feature_container'}, 'h4')

此代码是在它的方式来实现所描述的规范。如果您使用soup.find().contents，它会创建元素中包含的每个项目的列表。

至于说明页面上的变化，它真的取决于。如果变化很大，则必须更改soup.find()。否则，您可能能够编写足够通用的代码，以便始终适用。（就像如果div称为home_feature_container总是功能，你永远也不会改变这一点。）

来源

2017-02-26 01:20:15 celestialroad

嗨，你可以展示一个返回一些值的代码示例。有一张表“制造一览”。你能否展示一些正在被你的技术提取和显示的元素。万分感谢！！ –

使用Python从网站中提取网页元素

回答

相关问题