2017-05-15 78 views
0

我知道这是最重复的问题之一。我几乎到处都是这样,没有资源可以解决我所面临的问题。 以下是我的问题陈述的简化版本。但在实际的数据是有点复杂,所以我必须使用UDF无法打开别名的迭代器<alias_name>

我的输入文件:(input.txt中)

NotNeeded1,NotNeeded11;Needed1 
NotNeeded2,NotNeeded22;Needed2 

我所要的输出是

Needed1 
Needed2 

所以,我写下面的UDF (Java代码):

package com.company.pig; 

import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple; 

public class myudf extends EvalFunc<String>{ 
    public String exec(Tuple input) throws IOException { 
     if (input == null || input.size() == 0) 
      return null; 
     String s = (String)input.get(0); 
     String str = s.split("\\,")[1]; 
     String str1 = str.split("\\;")[1]; 
     return str1; 
    } 
} 

它包装成

rollupreg_extract-jar-with-dependencies.jar 

下面是我的猪shell代码

grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar; 
grunt> DEFINE myudf com.company.pig.myudf; 
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(','); 
grunt> extract = FOREACH data GENERATE myudf($1); 
grunt> DUMP extract; 

我也得到了以下错误:

2017-05-15 15:58:15,493 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 
2017-05-15 15:58:15,577 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 
2017-05-15 15:58:15,659 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 
2017-05-15 15:58:15,774 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128 
2017-05-15 15:58:15,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 
2017-05-15 15:58:16,184 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 
2017-05-15 15:58:16,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050 
2017-05-15 15:58:16,396 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200 
2017-05-15 15:58:16,576 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 
2017-05-15 15:58:16,580 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file 
2017-05-15 15:58:16,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 
2017-05-15 15:58:16,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 
2017-05-15 15:58:17,258 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar 
2017-05-15 15:58:17,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 
2017-05-15 15:58:17,294 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 
2017-05-15 15:58:17,354 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 
2017-05-15 15:58:17,510 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050 
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200 
2017-05-15 15:58:17,753 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 
2017-05-15 15:58:17,820 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat 
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 
2017-05-15 15:58:17,884 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library 
2017-05-15 15:58:17,889 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80] 
2017-05-15 15:58:17,922 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 
2017-05-15 15:58:18,525 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1 
2017-05-15 15:58:18,692 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023 
2017-05-15 15:58:18,879 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 
2017-05-15 15:58:18,973 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023 
2017-05-15 15:58:19,029 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/ 
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023 
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract 
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C: R: 
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023] 
2017-05-15 15:58:29,156 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 
2017-05-15 15:58:29,156 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs 
2017-05-15 15:58:29,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 
2017-05-15 15:58:29,790 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 
2017-05-15 15:58:29,791 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050 
2017-05-15 15:58:29,793 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200 
2017-05-15 15:58:30,311 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 
2017-05-15 15:58:30,312 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050 
2017-05-15 15:58:30,313 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200 
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed! 
2017-05-15 15:58:30,467 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file 
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion PigVersion  UserId StartedAt  FinishedAt  Features 
2.7.3.2.5.0.0-1245    root 2017-05-15 15:58:16  2017-05-15 15:58:30  UNKNOWN 

Failed! 

Failed Jobs: 
JobId Alias Feature Message Outputs 
job_1494853652295_0023 data,extract MAP_ONLY  Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225, 

Input(s): 
Failed to read data from "/pig_hdfs/input.txt" 

Output(s): 
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225" 

Counters: 
Total records written : 0 
Total bytes written : 0 
Spillable Memory Manager spill count : 0 
Total bags proactively spilled: 0 
Total records proactively spilled: 0 

Job DAG: 
job_1494853652295_0023 


2017-05-15 15:58:30,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract 
Details at logfile: /pig/pig_1494863836458.log 

我知道抱怨

Failed to read data from "/pig_hdfs/input.txt" 

但我相信这不是真正的问题。如果我不使用udf并直接转储数据,我会得到输出结果。所以,这不是问题。

回答

1

首先,您不需要使用udf来获得所需的输出。您可以在装载语句中使用分号作为分号并获取所需的列。

data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';'); 
extract = FOREACH data GENERATE $1; 
DUMP extract; 

如果你坚持使用UDF,那么你必须将记录加载到单个字段,然后使用udf.Also,你的UDF是incorrect.You应该拆分字符串s的“;”作为从猪脚本传递的分隔符。

String s = (String)input.get(0); 
String str1 = s.split("\\;")[1]; 

而且在你的猪脚本,你需要将整个记录加载到1场,并使用UDF现场$ 0

REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar; 
DEFINE myudf com.company.pig.myudf; 
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray); 
extract = FOREACH data GENERATE myudf($0); 
DUMP extract; 
+0

我做到了。但猪正在给出同样的问题。如果我在加载它后注册了'jar,然后是'DUMP data'',那么即使没有使用UDF,它也会出现同样的问题。 –

+0

如果您使用了2列,那么你不认为PigStorage(';')会将它分成两列吗? PIG不是这里的问题。请检查您的输入和脚本。如果您使用UDF,您是否更改了您的Java udf和pigscript?如果您不使用udf,注册udf有什么意义? –

+0

我的数据很复杂。为了保持简单,我提出了问题的示例数据。没有'REGISTER ''当我发布'DUMP数据'时,它正在正确加载数据。当我注册jar时,我所有的'DUMP'和'STORE'命令都失败了。我找到了解决方案..由于UDF jar而失败。这是因为'org.apache.pig' v0.12依赖关系不兼容。我尝试过其他版本的相同的依赖项,但没有奏效。最后依赖于[this](http://stackoverflow.com/a/21396370/2716962)答案解决了这个问题 –