2015-10-06 66 views
1

在CDH 5.4,我试图用建立在Twitter上分析演示:蜂巢错误而查询包含水槽流外部表

  1. 水槽用于捕捉鸣叫到HDFS文件夹
  2. 蜂巢查询使用Hive-Serde的推文

步骤1成功。我可以看到这些推文正在被捕获并正确导向到所需的HDFS文件夹。我观察到一个临时文件被创建第一个,然后转换为永久文件:

-rw-r--r-- 3 root hadoop  7548 2015-10-06 06:39 /user/flume/tweets/FlumeData.1444127932782 
-rw-r--r-- 3 root hadoop  10034 2015-10-06 06:39 /user/flume/tweets/FlumeData.1444127932783.tmp 

我使用下面的表声明:

CREATE EXTERNAL TABLE tweets(
    id bigint, 
    created_at string, 
    lang string, 
    source string, 
    favorited boolean, 
    retweet_count int, 
    retweeted_status 
    struct<text:string,user:struct<screen_name:string,name:string>>, 
    entities struct<urls:array<struct<expanded_url:string>>, 
    user_mentions:array<struct<screen_name:string,name:string>>, 
    hashtags:array<struct<text:string>>>, 
    text string, 
    user struct<location:string,geo_enabled:string,screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string>, 
    in_reply_to_screen_name string) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION 'hdfs://master.ds.com:8020/user/flume/tweets'; 

但是,当我查询这个表,我得到下面的错误:

hive> select count(*) from tweets; 

Ended Job = job_1443526273848_0140 with errors 
... 
Diagnostic Messages for this Task: 
Error: java.io.IOException: java.lang.reflect.InvocationTargetException 
     at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreation 
     ... 11 more 

Caused by: java.io.FileNotFoundException: File does not exist: /user/flume/tweets/FlumeData.1444128601078.tmp 
     at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) 
     ... 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask 
MapReduce Jobs Launched: 

Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 1.19 sec HDFS Read: 10492 HDFS Write: 0 FAIL 

我认为这个问题涉及到临时文件,该文件由蜂巢查询催生了地图,减少工作,而不是被读取。是否可以进行一些变通或配置更改以成功处理此问题?

回答

0

我有同样的经历,我加入到我的水槽的配置解决它的文件HDFS下沉低于 some_agent.hdfssink.hdfs.inUsePrefix = . hdfs.inUseSuffix = .temp

配置希望它可以帮助你。