2017-04-25 101 views
0

我想在两个文件加入后过滤记录。使用PIG加入后过滤数据

文件BX-Books.csv包含书籍数据。并且文件BX-Book-Ratings.csv包含书评分数据,其中ISBN是来自两个文件的共同列。文件之间的内部连接使用此列完成。
我想获得2002年出版的书籍。

我已经使用了下面的脚本,但我得到了0条记录。

grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray); 
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray); 
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN; 
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4; 
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002'; 
+0

“描述BxJoin_Mod”是什么?输出?你是否也有2002年的YearOfPublication数据? – Amit

+0

grunt> DESCRIBE BxJoin_Mod; BxJoin_Mod:{ISBN:chararray,BookXRecords :: BookTitle:chararray,BookXRecords :: BookAuthor:chararray,BookXRecords :: YearOfPublication:chararr ay,BookXRecords :: Publisher:chararray} –

+0

是的,我的数据有YearOfPublication == 2002 –

回答

0

我创建了一个test.csv,test-rating.csv和一个Pig脚本,它们都可以工作。它工作得很好。

test.csv

1;abc;author1;2002 
2;xyz;author2;2003 

测试rating.csv

user1;1;3 
user2;2;5 

猪脚本:

A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray); 
describe A; 
dump A; 

B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray); 
describe B; 
dump B; 

C = JOIN A BY ISBN, B BY ISBN; 
describe C; 
dump C; 

D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3; 
describe D; 
dump D; 

E = FILTER D BY $3 == '2002'; 
describe E; 
dump E; 

输出:

A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray} 
(1,abc,author1,2002) 
(2,xyz,author2,2003) 
B: {user: chararray,ISBN: chararray,rating: chararray} 
(user1,1,3) 
(user2,2,5) 
C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray} 
(1,abc,author1,2002,user1,1,3) 
(2,xyz,author2,2003,user2,2,5) 
D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray} 
(1,abc,author1,2002) 
(2,xyz,author2,2003) 
E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray} 
(1,abc,author1,2002) 
0

要求:获取发表在2002年前

不要求有2个数据集的书籍。 只有使用“BookXRecords”,才能实现。

grunt>BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray); 
grunt>A=FILTER BookXRecords BY year ='2002'; 
grunt>dump A;