在目录上运行字数或猪脚本以在单独的文件中生成结果

我是Hadoop/Pig的新手。在目录上运行字数或猪脚本以在单独的文件中生成结果

我有一个目录有几个文件。现在我需要对这些字数进行统计。我可以使用Hadoop示例示例wordcount并在目录上运行以获取输出，但输出将位于单个文件中。如果我希望每个文件的输出应该位于不同的文件中，我该怎么办？我也可以用猪。并将该目录作为猪的输入。但是，我怎样才能读取目录中的文件名，然后将其提供给LOAD？我的意思是：
假设我有一个目录test，其中有5个文件test1，test2，test3，test4，test5。现在我希望将每个文件的单词计数分开放在单独的文件中。我知道我可以提供个人姓名并执行此操作，但这需要很长时间。是否有可能从目录中读取文件名并将它们作为输入提供给猪的LOAD？

来源

2012-07-11 Uno

您可以扩展PigStorage代码以将文件名添加到元组中，请参阅Code Sample寻找问题“问：我从包含不同文件的目录加载数据，如何找出数据来自哪里？。对于输出，您可以执行PigStorage的类似扩展以写入不同的输出文件。

来源

2012-07-12 10:26:42 alexeipab

谢谢alexeipab。链接看起来很有趣。但是，如何在Java中使用嵌入式Pig工作？有什么想法吗。 – Uno 2012-07-12 17:01:03

如果您使用Pig版本0.10.0或更高版本，则可以利用source tagging和MultiStorage的组合来跟踪这些文件。

例如，如果您有相关的文件和内容如下输入目录pigin：

pigin 
|-test1 => "hello" 
|-test2 => "world" 
|-test3 => "Apache" 
|-test4 => "Hadoop" 
|-test5 => "Pig"

下面的脚本将读取每个脚本的每个文件的内容写入到不同的目录。

%declare inputPath 'pigin' 
%declare outputPath 'pigout' 

-- Define MultiStorage to write output to different directories based on the 
-- first element in the tuple 
define MultiStorage org.apache.pig.piggybank.storage.MultiStorage('$outputPath','0'); 

-- Load the input files, prepending each tuple with the file name 
A = load '$inputPath' using PigStorage(',', '-tagsource'); 

-- Write output to different directories 
store A into '$outputPath' using MultiStorage();

以上脚本将创建看起来像以下内容的输出目录树：

pigout 
|-test1 
| `-test1-0 => "test1 hello" 
|-test2 
| `-test2-0 => "test2 world" 
|-test3 
| `-test3-0 => "test3 Apache" 
|-test4 
| `-test4-0 => "test4 Hadoop" 
|-test5 
| `-test5-0 => "test5 Pig"

的-0在文件名的末尾对应于所产生的输出的减速器。如果您有多个reducer，则每个目录可能会看到多个文件。

来源

2012-07-28 01:34:03 cyang

在目录上运行字数或猪脚本以在单独的文件中生成结果

回答

相关问题