2017-07-26 133 views
0

我对python和Python使用了新的3.6.2,并试图使用特定关键字从第2页中抓取数据。到目前为止,我可以将数据导入Python IDLE窗口,但是我很难将数据导出到CSV.I已尝试使用BeautifulSoup 4和pandas,但无法导出。这是迄今为止我所做的。任何帮助将非常感激。从前两页抓取网页内容并使用python和BS4将抓取的数据导出到csv

import csv 
import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search- 
alias%3Dautomotive&field- 
keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0" 
request = requests.get(url)  
soup = BeautifulSoup(request.content, "lxml") 
#filename = auto.csv 
#with open(str(auto.csv,"r+","\n")) as csvfile: 
    #headers = "Count , Asin \n" 
    #fo.writer(headers) 
for url in soup.find_all('li'): 
    Nand = url.get('data-asin') 
    #print(Nand) 
    Result = url.get('id') 
    #print(Result) 
    #d=(str(Nand), str(Result)) 


df=pd.Index(url.get_attribute('url')) 
#with open("auto.txt", "w",newline='') as dumpfile: 
    #dumpfilewriter = csv.writer(dumpfile) 
    #for Nand in soup: 
     #value = Nand.__gt__   
     #if value: 
      #dumpfilewriter.writerows([value]) 
df.to_csv(dumpfile) 
dumpfile.close() 
csvfile.csv.writer("auto.csv," , ',' ,'|' , "\n") 
+0

有人可以帮我这个。试图将结果导出到csv。我需要csv中的“数据asin”和“id”。 – Sunny

+0

我忘了使用python 3.6.2版本添加我使用的版本 – Sunny

+0

修复你的缩进,代码无法执行 –

回答

0

我在请求加入user-agent现场逃生自动阻止机器人。你有很多None,因为你没有指定你想要的<li>标签。我也将它添加到代码中。

import requests 
from bs4 import BeautifulSoup 
import pandas as pd 


url = "http://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Dautomotive&field-keywords=helmets+for+men&rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ajr=0" 
request = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})  
soup = BeautifulSoup(request.content, "lxml") 

res = [] 

for url in soup.find_all('li', class_ = 's-result-item'): 
    res.append([url.get('data-asin'), url.get('id')]) 

df = pd.DataFrame(data=res, columns=['Nand', 'Result'])  
df.to_csv('path/where/you/want/to/store/file.csv') 

编辑:用于加工需要构建生成的URL,你会然后传递给主处理模块(你已经有了)循环中的所有页面。看看这个网页:http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&page=2&keywords=helmets+for+men&ie=UTF8&qid=1501133688&spIA=B01N0MAT2E,B01MY1ZZDS,B01N0RMJ1H

EDIT_2:让我们循环参数page。您可以手动将page添加到您传递给requests.get()的网址。

import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

base_url = "http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&keywords=helmets+for+men&ie=UTF8" 
#excluding page from base_url for further adding 
res = [] 

for page in range(1,72): # such range is because last page for needed category is 71 

    request = requests.get(base_url + '&page=' + str(page), headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}) # here adding page  
    if request.status_code == 404: #added just in case of error 
     break 
    soup = BeautifulSoup(request.content, "lxml") 

    for url in soup.find_all('li', class_ = 's-result-item'): 
     res.append([url.get('data-asin'), url.get('id')]) 

df = pd.DataFrame(data=res, columns=['Nand', 'Result'])  
df.to_csv('path/where/you/want/to/store/file.csv') 
+0

@Nandish,有没有反馈?我的解决方案有用吗? –

+0

超级..感谢代码..它的工作。我没有忘记班级的成果项目'。 – Sunny

+0

很高兴它有帮助,祝你好运=) –

0

Question: Help me with exporting the data of variable "Nand" and "Result" to csv file

with open("auto.csv", 'w') as fh: 
    writer = csv.DictWriter(fh, fieldnames=['Nand', 'Result']) 
    writer.writeheader() 
    data = {} 
    for url in soup.find_all('li'): 
     data['Nand'] = url.get('data-asin') 
     data['Result'] = url.get('id') 
     writer.writerow(data) 

与Python测试:3.4.2

+0

我在第140行获得语法错误,在__init__ self.writer = writer(f,dialect,* args,** kwds) TypeError:参数1必须有一个“写入”方法 – Sunny

+1

没关系stovfl。是的代码工作正常..谢谢。 – Sunny