2014-11-04 77 views
1

我有一个CSV如下,其中行由“+++”而不是新行终止。如何通过在字符串“+++”存在的位置执行换行加载csv?使用自定义换行符加载CSV

VTS,51,0071,9739965515,NM,GP,INF01,V,19,072219,291014,0000.0000,N,00000.0000,E,07AE 

VTS,01,0097,9739965515,SP,GP,18,072253,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,169,B205+++VTS,51,0071,9739965515,NM,GP,INF01,V,18,072311,291014,0000.0000,N,00000.0000,E,C24E+++VTS,01,0097,9739965515,NM,GP,19,072311,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,171,B358 

VTS,51,0071,9739965515,NM,GP,INF01,V,18,072319,291014,0000.0000,N,00000.0000,E,012F 
VTS,51,0071,9739965515,NM,GP,INF01,V,19,072326,291014,0000.0000,N,00000.0000,E,B2E6+++VTS,01,0097,9739965515,NM,GP,18,072326,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,173,EAA0 
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072333,291014,0000.0000,N,00000.0000,E,9896 
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072340,291014,0000.0000,N,00000.0000,E,9B23 

首先,我需要打破新行或“+++”符号存在并加载数据的行。然后,再次在第二列中使用值01进行过滤。

预期输出:

VTS,01,0097,9739965515,SP,GP,18,072253,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,169,B205 
VTS,01,0097,9739965515,NM,GP,19,072311,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,171,B358 
VTS,01,0097,9739965515,NM,GP,18,072326,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,173,EAA0 
+0

什么是您预期的输出? – 2014-11-04 07:55:07

+0

@SivasakthiJayaraman预计产量为 – 2014-11-04 08:16:50

+0

我更新了解决方案,请验证并让我知道这是否适合您。 – 2014-11-04 12:06:59

回答

1

PigScript:

A = LOAD 'input.csv' AS (line:chararray); 
B = FOREACH A { 
       splitRow = TOKENIZE(line,'+++'); 
       GENERATE FLATTEN(splitRow) AS newList; 
       } 
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newList,',',16)); 
D = FILTER C BY $1==01; 
DUMP D; 

输出:

(VTS,01,0097,9739965515,SP,GP,18,072253,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,169,B205) 
(VTS,01,0097,9739965515,NM,GP,19,072311,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,171,B358) 
(VTS,01,0097,9739965515,NM,GP,18,072326,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,173,EAA0) 
+0

你能解释一下上面的步骤洙,我可以理解什么内部工作 – 2014-11-06 10:58:53

+0

当我增加了一个额外的行E = foreach D产生7美元,15美元; DUMP E;我得到的结果是(060037,061114,0068,00,4000,00,999,149,9594) (060113,061114,0068,00,4000,00,999,152,B927) 为什么不必要地将所有这些字段的其余部分$ 7 and $ 15 – 2014-11-06 11:01:34

+0

更改行“C = FOREACH B GENERATE FLATTEN(STRSPLIT(newList,',',23));”即,而不是16给出23.基本上它的总列数。我错误地给了16. – 2014-11-06 18:07:23