2017-07-07 96 views
1

我从https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business 中提取数据并获得了我想要的输出但现在的问题是:我得到的输出是商业支持...和澳大利亚储备银行...不完整的文本,我想打印整个文本不是“.......”。我在jezrael的回答中更换了第9行和第10行,请参阅Fetching content from html and write fetched content in a specific format in CSV,代码为 org = soup.find_all('a', {'class':'nav-item active'})[0].get('title') groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title') 。我正在单独运行它,并且出现错误:列表索引超出范围。我应该用什么来提取完整​​的句子?我也试过: org = soup.find_all('span',class_="filtered pill"),当我单独运行但不能用整个代码运行时,它给出了字符串类型的答案。使用beautifulsoup在HTML中的链接标签中获得标题

回答

1

所有带较长文字的数据都在attribut title,文字较短。所以加双if

for i in webpage_urls: 
    wiki2 = i 
    page= urllib.request.urlopen(wiki2) 
    soup = BeautifulSoup(page, "lxml") 

    lobbying = {} 
    #always only 2 active li, so select first by [0] and second by [1] 
    l = soup.find_all('li', class_="nav-item active") 

    org = l[0].a.get('title') 
    if org == '': 
     org = l[0].span.get_text() 

    groups = l[1].a.get('title') 
    if groups == '': 
     groups = l[1].span.get_text() 

    data2 = soup.find_all('h3', class_="dataset-heading") 
    for element in data2: 
     lobbying[element.a.get_text()] = {} 
    data2[0].a["href"] 
    prefix = "https://data.gov.au" 
    for element in data2: 
     lobbying[element.a.get_text()]["link"] = prefix + element.a["href"] 
     lobbying[element.a.get_text()]["Organisation"] = org 
     lobbying[element.a.get_text()]["Group"] = groups 

     #print(lobbying) 
     df = pd.DataFrame.from_dict(lobbying, orient='index') \ 
       .rename_axis('Titles').reset_index() 
     dfs.append(df) 

df = pd.concat(dfs, ignore_index=True) 
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True) 

df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '') 
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '') 

print (df1.head()) 

               Titles \ 
0          Banks – Assets 
1 Consolidated Exposures – Immediate and Ultimat... 
2 Foreign Exchange Transactions and Holdings of ... 
3 Finance Companies and General Financiers – Sel... 
4     Liabilities and Assets – Monthly 

               link \ 
0   https://data.gov.au/dataset/banks-assets 
1 https://data.gov.au/dataset/consolidated-expos... 
2 https://data.gov.au/dataset/foreign-exchange-t... 
3 https://data.gov.au/dataset/finance-companies-... 
4 https://data.gov.au/dataset/liabilities-and-as... 

       Organisation       Group 
0 Reserve Bank of Australia Business Support and Regulation 
1 Reserve Bank of Australia Business Support and Regulation 
2 Reserve Bank of Australia Business Support and Regulation 
3 Reserve Bank of Australia Business Support and Regulation 
4 Reserve Bank of Australia Business Support and Regulation 
+0

非常感谢。你能解释一下这个逻辑“if org =='':”在做什么吗? – Arti123

+1

如果检查html的某些属性标题是空的,就有问题,所以需要的话。如果省略则不会获得文字。 – jezrael

+0

@ jezrael,:),祝你有美好的一天! – Arti123

1

我想你正在试图做到这一点。这里的每个链接都有title属性。所以我在这里简单地检查是否存在任何标题属性,如果是,那么我只是简单地打印它。

有空白行,因为有几个链接,所以你可以避免使用条件语句,然后从中获取所有标题title=""

>>> l = soup.find_all('a') 
>>> for i in l: 
...  if i.has_attr('title'): 
...    print(i['title']) 
... 
Remove 
Remove 
Reserve Bank of Australia 

Business Support and Regulation 













Creative Commons Attribution 3.0 Australia 
>>> 
+0

感谢,它的一个URL的工作,我现在已经运行的程序的所有URL。让我们看看会输出什么。 – Arti123

+0

@ shashank,在运行一些URL的同时,我得到了同样的东西。我认为它应该循环循环。 – Arti123

+0

您能否详细说明您想要做什么?我的意思是你打算如何去取数据 – Shashank

相关问题