2017-02-18 99 views
3

我有取得AWS S3对象的代码。如何用Python的csv.DictReader读取StreamingBody?如何使用csv.DictReader读取存储在S3中的csv?

import boto3, csv 

session = boto3.session.Session(aws_access_key_id=<>, aws_secret_access_key=<>, region_name=<>) 
s3_resource = session.resource('s3') 
s3_object = s3_resource.Object(<bucket>, <key>) 
streaming_body = s3_object.get()['Body'] 

#csv.DictReader(???) 
+0

'csv.DictReader(streaming_body)'? – Leon

+0

'csv.DictReader(streaming_body)'返回错误“TypeError:参数1必须是迭代器”。 在传递它之前运行read()和decode()(我不希望这样做,因为它会将整个文件加载到内存中),它会分别从文件中返回每个字符。 – Jon

回答

4

的代码将是这样的:

import boto3 
import csv 

# get a handle on s3 
s3 = boto3.resource(u's3') 

# get a handle on the bucket that holds your file 
bucket = s3.Bucket(u'bucket-name') 

# get a handle on the object you want (i.e. your file) 
obj = bucket.Object(key=u'test.csv') 

# get the object 
response = obj.get() 

# read the contents of the file and split it into a list of lines 

lines = response[u'Body'].read().split() 

# now iterate over those lines 
for row in csv.DictReader(lines): 

    # here you get a sequence of dicts 
    # do whatever you want with each line here 
    print(row) 

您可以压缩这在实际的代码了一点,但我试图保持它一步一步的,以显示与boto3对象层次。

编辑根据您如何避免整个文件读入内存评论:我还没有遇到这个需求着说权威,但我会尝试包裹流,所以我可以得到一个文本文件类迭代器。例如,你可以使用codecs库类似,以取代上述的CSV分析部:

for row in csv.DictReader(codecs.getreader('utf-8')(response[u'Body'])): 
    print(row) 
+0

@Jon,这是否回答你的问题? – gary

+0

是的。任何方式来做到这一点,以便我不必将整个文件读入()到内存中? – Jon

+0

'codecs.getreader()'解决方案为我解决这个问题 –

相关问题