我从https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business 中提取数据并获得了我想要的输出但现在的问题是:我得到的输出是商业支持...和澳大利亚储备银行...不完整的文本,我想打印整个文本不是“.......”。我在jezrael的回答中更换了第9行和第10行,请参阅Fetching content from html and write fetched content in a specific format in CSV,代码为 org = soup.find_all('a', {'class':'nav-item active'})[0].get('title') groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title')
。我正在单独运行它,并且出现错误:列表索引超出范围。我应该用什么来提取完整的句子?我也试过: org = soup.find_all('span',class_="filtered pill")
,当我单独运行但不能用整个代码运行时,它给出了字符串类型的答案。使用beautifulsoup在HTML中的链接标签中获得标题
1
A
回答
1
所有带较长文字的数据都在attribut title
,文字较短。所以加双if
:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")
org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()
groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation
1
我想你正在试图做到这一点。这里的每个链接都有title属性。所以我在这里简单地检查是否存在任何标题属性,如果是,那么我只是简单地打印它。
有空白行,因为有几个链接,所以你可以避免使用条件语句,然后从中获取所有标题title=""
。
>>> l = soup.find_all('a')
>>> for i in l:
... if i.has_attr('title'):
... print(i['title'])
...
Remove
Remove
Reserve Bank of Australia
Business Support and Regulation
Creative Commons Attribution 3.0 Australia
>>>
相关问题
- 1. Python - BeautifulSoup,在标签内获取标签
- 2. 使用BeautifulSoup解析HTML标签
- 3. 使用BeautifulSoup循环浏览HTML标签
- 4. SQL在链接表中获取标签
- 5. 如何在beautifulsoup中获取div标签的内部html属性
- 6. 使用BeautifulSoup从标题标签中提取数据?
- 7. 在BeautifulSoup中匹配标签
- 8. 从BeautifulSoup解析HTML中删除标签
- 9. 拆分HTML链接标签
- 10. 在Rails URL中链接标签链接
- 11. BeautifulSoup获取给定标签后的所有链接
- 12. Python BeautifulSoup webcrawling:获取没有链接或类标签的文本
- 13. 使用beautifulsoup从img标签获取src
- 14. 使用beautifulSoup从无标签的标签中刮掉
- 15. 如何使用BeautifulSoup解析HTML标签内部的HTML标签的内容?
- 16. 使用BeautifulSoup获取HTML文件的脚本和样式标签?
- 17. 查找xapth在beautifulsoup获得目标标签
- 18. 使用BeautifulSoup拉标签值
- 19. 使用REGEX在HTML中查找标题属性和链接标题
- 20. 使用“stringbyevaluatingjavascriptfromstring”为获得目标下,在<a href='XXXX'>标签链接
- 21. html/javascript:隐藏链接目标并在新标签中打开链接
- 22. QTableWidget的获得垂直标题标签
- 23. 如何获得我在Jquery标签中命名的当前标签标题UI
- 24. 使用Javascript获得标签?
- 25. 如何使用php获得链接的标题Domdocument
- 26. 如何使用SimpleHtmlDom在HTML头标签之间插入链接标签
- 27. 在viewpager中获取标签标题
- 28. 用BeautifulSoup中的另一个标签替换一种标签
- 29. 如何在Beautifulsoup中的新标签下包装多个标签?
- 30. JavaScript的 - 在标签中打开链接
非常感谢。你能解释一下这个逻辑“if org =='':”在做什么吗? – Arti123
如果检查html的某些属性标题是空的,就有问题,所以需要的话。如果省略则不会获得文字。 – jezrael
@ jezrael,:),祝你有美好的一天! – Arti123