2014-09-22 48 views
3

后台打印目录CSV文件格式:sample.csv加载CSV数据到使用多列HBase的表水槽

8600000US00601,00601,006015-DigitZCTA,0063-DigitZCTA,11102 
8600000US00602,00602,006025-DigitZCTA,0063-DigitZCTA,12869 
8600000US00603,00603,006035-DigitZCTA,0063-DigitZCTA,12423 
8600000US00604,00604,006045-DigitZCTA,0063-DigitZCTA,33548 
8600000US00606,00606,006065-DigitZCTA,0063-DigitZCTA,10603 

我Flume.Conf代码:

agent.sources = spool 
agent.channels = fileChannel2 
agent.sinks = sink2 

agent.sources.spool.type = spooldir 
agent.sources.spool.spoolDir = /home/cloudera/cloudera 
agent.sources.spool.fileSuffix = .completed 
agent.sources.spool.channels = fileChannel2 
#agent.sources.spool.deletePolicy = immediate 

agent.sinks.sink2.type = org.apache.flume.sink.hbase.HBaseSink 
agent.sinks.sink2.channel = fileChannel2 
agent.sinks.sink2.table = sample 
agent.sinks.sink2.columnFamily = s1 
agent.sinks.sink2.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer 
agent.sinks.sink1.serializer.regex = ^([^,]+),([^,]+),([^,]+),([^,]+)$ 
#agent.sinks.sink2.serializer.regexIgnoreCase = true 
agent.sinks.sink1.serializer.colNames =col1,col2,col3,col4 
agent.sinks.sink2.batchSize = 100 
agent.channels.fileChannel2.type=memory 

我能使用水槽将数据加载到单个列中,但无法使用正则表达式,任何帮助将其加载到多个列中,以便我可以将它加载到hbase.Thanks中的多个列中。

+0

你有没有得到你的答案? – 2015-07-13 06:46:03

+0

如果你有答案,请分享。谢谢。 – sayan 2015-08-26 06:35:34

+0

我有同样的问题:(请分享!!! – akaliza 2015-12-23 11:08:47

回答

0

我得到了它的答案,在我上面的代码中存在正则表达式问题。

我通过纠正正则表达式来解决它。

agent.sources = spool 
agent.channels = fileChannel2 
agent.sinks = sink2 

agent.sources.spool.type = spooldir 
agent.sources.spool.spoolDir = /home/cloudera/cloudera 

#agent.sources.spool.type = exec 
#agent.sources.spool.command = tail -F /home/cloudera/cloudera/data.csv 
agent.sources.spool.fileSuffix = .completed 
agent.sources.spool.channels = fileChannel2 
#agent.sources.spool.deletePolicy = immediate 

agent.sinks.sink2.type = org.apache.flume.sink.hbase.HBaseSink 
agent.sinks.sink2.channel =fileChannel2 
agent.sinks.sink2.table =sample 
agent.sinks.sink2.columnFamily=s1 
agent.sinks.sink2.serializer =org.apache.flume.sink.hbase.RegexHbaseEventSerializer 
agent.sinks.sink2.serializer.regex =(.+),(.+),(.+),(.+),(.+) 
agent.sinks.sink2.serializer.rowKeyIndex = 0 
agent.sinks.sink2.serializer.colNames =ROW_KEY,col2,col3,col4,col5 
agent.sinks.sink2.serializer.regexIgnoreCase = true 
agent.channels.fileChannel2.type = FILE 
agent.sinks.sink2.batchSize =100 
agent.channels.fileChannel2.type=memory 
0

这样的事情对我的作品:

agent.sinks.s1.type = hbase 
agent.sinks.s1.table = test 
agent.sinks.s1.columnFamily = r 
agent.sinks.s1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer 
agent.sinks.s1.serializer.rowKeyIndex = 0 
agent.sinks.s1.serializer.regex = ^(\\S+),(\\d+),(\\d+),(\\d)$ 
agent.sinks.s1.serializer.colNames = ROW_KEY,r:colA,r:colB,r:colC 

如果你想指定rowkey,而不是随机的,你可以使用:

agent.sinks.s1.serializer.rowKeyIndex = 0 
agent.sinks.s1.serializer.colNames = ROW_KEY,r:colA,r:colB,r:colC 

这里是链接,如果你想获得更多灵活性。 http://www.rittmanmead.com/2014/05/trickle-feeding-log-data-into-hbase-using-flume/

简而言之,我认为这是因为正则表达式不正确。

+0

这对我来说很好:agent.sinks.sink2.serializer.regex =(。+),(。+),(。+),(。+ ),(。+) – RaviJ 2016-09-22 12:45:52