2014-10-09 61 views
1

我有猪数据集,看起来像这样:猪 - 计算

6009544 "NY" 6009545 "NY" 
6009544 "NY" 6009545 "NY" 
6009548 "NY" 6009546 "OR" 
6009546 "OR" 6009546 "OR" 
6009545 "NY" 6009546 "OR" 
6009548 "NY" 6009547 "AZ" 
6009547 "AZ" 6009547 "AZ" 
6009547 "AZ" 6009548 "NY" 
6009544 "NY" 6009548 "NY" 

的第一行被读取,像这样:“专利6009544起源于纽约,并引用专利6009545起源于纽约。 “对于每个州,我试图找到源自相同州的专利的百分比。所以我的期望输出应该是

NY: .5 
OR: 1 
AZ: .5 

因为专利6项,起源于纽约,3引用的专利也起源于纽约。源自俄勒冈州的1项专利引用了也起源于纽约的专利。在源自亚利桑那州的2项专利中,1引用了也起源于亚利桑那州的专利。

任何人都可以提出一个很好的方式来执行这个猪吗?

回答

1

你可以试试吗?

input.txt 
6009544 "NY" 6009545 "NY" 
6009544 "NY" 6009545 "NY" 
6009548 "NY" 6009546 "OR" 
6009546 "OR" 6009546 "OR" 
6009545 "NY" 6009546 "OR" 
6009548 "NY" 6009547 "AZ" 
6009547 "AZ" 6009547 "AZ" 
6009547 "AZ" 6009548 "NY" 
6009544 "NY" 6009548 "NY" 

PigScript: 
A = LOAD 'input.txt' AS line; 
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray); 
C = GROUP B BY f2; 
D = FOREACH C { 
       FilterByPatent = FILTER B BY f2==f4; 
       CityPatentCount = COUNT(B.f2); 
       GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount); 
       } 
DUMP D; 

Output: 
(AZ,0.5) 
(NY,0.5) 
(OR,1.0) 
+0

这种方法的伟大工程 - 谢谢! – Luke 2014-10-09 14:33:55

0

我利用空间的样本数据和独立的数据更改:

A = load '/padata' using PigStorage(' ') as (pno:int,pcity:chararray,pci:int,pccity:chararray); 

b = group A by pcity ; 

r = foreach b { 

       copcity= COUNT(A.pcity) ; 

       samdata = FILTER A by pcity==pccity; 

       csamdata = COUNT(samdata); 

       percent = (float)csamdata/(float)copcity; 

       generate group,percent ; 

       } 

dump r ; 

输出: -

("AZ",0.5) 

("NY",0.5) 

("OR",1.0)