如何刮去HTML表中的数据在python

-1

<tr class="even"> 
 
<td><strong><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'>DEED<br/> 
 
2016002023</a></strong></td> 
 
<td><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'><b> Recording Date: </b>01/12/2016 08:05:17 AM&nbsp;&nbsp;&nbsp;<b>Book Page: </b> <table cellspacing=0 width="100%"><tr><td width="50%" valign="top"><b>Grantor:</b> ARELLANO ISAIAS</td><td width="50%" valign="top"><b>Grantee:</b> ARELLANO ISAIAS, ARELLANO ALICIA</td></tr></table> 
 
<b>Number Pages:</b> 3<br></a></td> 
 
<td></td> 
 
<td></td></tr>

我是新来的Python和拼抢请帮助我如何从该表中抽取数据。要登录，请转到公共登录页面，然后输入日期和日期。数据模型：数据模型具有以下特定顺序的列：“record_date”，“doc_number”，“doc_type”，“role”，“name”，“apn”，“transfer_amount”，“county” ，和“国家”。根据姓名的分配位置，“角色”列可以是“授予者”或“受让人”。如果授权人和受让人有多个姓名，请给每个姓名一个新行，并复制记录日期，文档编号，文档类型，角色和apn。

https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0

来源

2017-03-17 Vishal Gahlot

我想提取这些东西。数据模型：数据模型具有以下特定顺序的列：“record_date”，“doc_number”，“doc_type”，“role”，“name”，“apn”，“transfer_amount”，“county”和“state ”。根据姓名的分配位置，“角色”列可以是“授予者”或“受让人”。如果授权人和受让人有多个姓名，请给每个姓名一个新行，并复制记录日期，文档编号，文档类型，角色和apn。如果您对如何构建csv结果有疑问，请询问我。 –

这看起来像一个安全的网站需要凭据，我只能'你必须登录访问请求的页面。你能否将html表格复制到你的问题中？ – davedwards

好的等待我会把截图 –

您发布不包含所有数据模型中列出的列字段的HTML。然而，它包含的字段，这将产生一个Python dictionary，你可以得到的数据模型中的字段：

import urllib.request 
from bs4 import BeautifulSoup 

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage 

with urllib.request.urlopen(url) as response: 
    html = response.read() 

soup = BeautifulSoup(html, 'html.parser') 

table = soup.find("tr", attrs={"class":"even"}) 

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")] 

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')] 

data = dict(zip(btags, bsibs)) 

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None} 

data_model["record_date"] = data['Recording Date'] 
data_model['role'] = data['Grantee'] 

print(data_model)

输出：

{'apn': None, 
'county': None, 
'doc_number': None, 
'doc_type': None, 
'name': None, 
'record_date': '01/12/2016 08:05:17 AM', 
'role': 'ARELLANO ISAIAS, ARELLANO ALICIA', 
'state': None, 
'transfer_amount': None}

有了这个，你可以这样做：

print(data_model['record_date']) # 01/12/2016 08:05:17 AM 
print(data_model['role'])  # ARELLANO ISAIAS, ARELLANO ALICIA

希望这会有所帮助。

来源

2017-03-18 02:26:05 davedwards

谢谢@downshift :) –

如何刮去HTML表中的数据在python

回答

相关问题