2015-09-27 52 views
2

自从我开始使用mrjob并且我已经尝试了某些中低级任务以来,我只有几天的时间了。现在,我坚持将常用抓取[now onwards will be know as CC]位置作为emr的输入使用python mrjob将Comomn抓取位置作为使用mrjob的Amazon EMR输入python

我的配置文件看起来像这样:

runners: 
    emr: 
    aws_access_key_id: <AWS Access Key> 
    aws_secret_access_key: <AWS Secret Access Key> 
    aws_region: us-east-1 
    ec2_key_pair: cslab 
    ec2_key_pair_file: ~/cslab.pem 
    ec2_instance_type: m1.small 
    num_ec2_instances: 5 
    local: 
    base_tmp_dir: /tmp 

Big thing small :I am trying to get the number of words in a web page of a site

Big thing big: Is my code below

我的代码:

import warc 

class MRcount(MRJob): 
    # ... 

    def mapper(self, _, s3_path): 
     s3_url_parsed = urlparse.urlparse(s3_url) 
     bucket_name = s3_url_parsed.netloc 
     key_path = s3_url_parsed.path[1:] 
     conn = boto.connect_s3() 
     bucket = conn.get_bucket('aws-publicdatasets', validate=False) 
     key = Key(bucket, s3_path) 
     webpage_text = record.payload.read() 
     yield record.header['warc-target-uri'],len(webpage_text.split() 
if __name__ == '__main__': 
    MRcount.run()) 

一切都很好,直到但现在,当我尝试运行它。

CMD:

$ python mr_crawl.py -r emr s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/wet.paths.gz 

错误:

boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently 
<?xml version="1.0" encoding="UTF-8"?> 
<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message <RequestId>06660583263444FC</RequestId><Bucket>smarkets-db</Bucket><HostId>TCZJTKZ8wo8V1h0xjkOI6grojs/r9IBkhMOcvolXv06QEtxTX89M55aLTPGOo/ht</HostId><Endpoint>eu-west-bucket.s3.amazonaws.com</Endpoint></Error> 

我想这是因为我在配置文件中的区域,并删除它,但我得到一个新的错误

我的新配置文件:

runners: 
    emr: 
    aws_access_key_id: <AWS Access Key> 
    aws_secret_access_key: <AWS Secret Access Key> 
    ec2_key_pair: cslab 
    ec2_key_pair_file: ~/cslab.pem 
    ec2_instance_type: m1.small 
    num_ec2_instances: 5 
    local: 
    base_tmp_dir: /tmp 

我收到以下错误SSH错误:

using configs in /etc/mrjob.conf 
using existing scratch bucket mrjob-4db6342a70e021ad 
using s3://mrjob-4db6342a70e021ad/tmp/ as our scratch dir on S3 
creating tmp directory /tmp/word_count.20140603.181541.006786 
writing master bootstrap script to /tmp/word_count.20140603.181541.006786/b.py 
Copying non-input files into s3://mrjob-4db6342a70e021ad/tmp/word_count.matthew.20140603.181541.006786/files/ 
Waiting 5.0s for S3 eventual consistency 
Creating Elastic MapReduce job flow 
Job flow created with ID: j-3DCN7LULSRILW 
Created new job flow j-3DCN7LULSRILW 
Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid 
Logs are in s3://mrjob-4db6342a70e021ad/tmp/logs/j-3DCN7LULSRILW/ 
Scanning S3 logs for probable cause of failure 
Waiting 5.0s for S3 eventual consistency 
Terminating job flow: j-3DCN7LULSRILW 
Traceback (most recent call last): 
    File "word_count.py", line 16, in <module> 
    MRcount.run() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 494, in run 
    mr_job.execute() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 512, in execute 
    super(MRJob, self).execute() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 147, in execute 
    self.run_job() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 208, in run_job 
    runner.run() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run 
    self._run() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 809, in _run 
    self._wait_for_job_to_complete() 
    File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete 
    raise Exception(msg) 
Exception: Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid 

感谢,

+0

ssh-key的名称必须与aws控制台中的名称相同。 – Pykler

+0

@Pykler我没有在我的代码中提供ssh-key。 – The6thSense

回答

1

在你MrJob配置你需要设置ec2_key_pair根据您的AWS控制台密钥对的列表

runners: 
    emr: 
    aws_access_key_id: <AWS Access Key> 
    aws_secret_access_key: <AWS Secret Access Key> 
    ec2_key_pair: cslab # <---- this name doesnt exist inside aws, so aws doesnt know the public key to use 
    ec2_key_pair_file: ~/cslab.pem # <-- you can comment this out if you dont need to login to the machine via ssh 
    ec2_instance_type: m1.small 
    num_ec2_instances: 5 
    local: 
    base_tmp_dir: /tmp 

要查看您拥有的密钥对列表在aws中,see this doc