Pig与Cassandra集成：简单的分布式查询需要几分钟才能完成。这是正常的吗？

我建立了Cassandra + Pig/Hadoop的测试集成。 8个节点是Cassandra + TaskTracker节点，1个节点是JobTracker/NameNode。Pig与Cassandra集成：简单的分布式查询需要几分钟才能完成。这是正常的吗？

我发射了卡桑德拉客户端和创建的卡桑德拉分布的Readme.txt列出的数据的一个简单的位：

[[email protected]] create keyspace Keyspace1; 
    [[email protected]] use Keyspace1; 
    [[email protected]] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type; 
    [[email protected]] set Users[jsmith][first] = 'John'; 
    [[email protected]] set Users[jsmith][last] = 'Smith'; 
    [[email protected]] set Users[jsmith][age] = long(42)

然后我跑CASSANDRA_HOME列出的示例猪查询（使用pig_cassandra）：

grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)}); 
grunt> cols = FOREACH rows GENERATE flatten(columns); 
grunt> colnames = FOREACH cols GENERATE $0; 
grunt> namegroups = GROUP colnames BY (chararray) $0; 
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group; 
grunt> orderednames = ORDER namecounts BY $0; 
grunt> topnames = LIMIT orderednames 50; 
grunt> dump topnames;

花了大约3分钟完成。

HadoopVersion PigVersion  UserId StartedAt    FinishedAt       Features 
    1.0.0    0.9.1   root 2012-01-12  22:16:53  2012-01-12 22:20:22  GROUP_BY,ORDER_BY,LIMIT 
Success! 

Job Stats (time in seconds): 
JobId Maps Reduces MaxMapTime  MinMapTIme  AvgMapTime  MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs 
job_201201121817_0010 8  1  12  6  9  21  21  21  colnames,cols,namecounts,namegroups,rows  GROUP_BY,COMBINER  
job_201201121817_0011 1  1  6  6  6  15  15  15  orderednames SAMPLER 
job_201201121817_0012 1  1  9  9  9  15  15  15  orderednames ORDER_BY,COMBINER  hdfs://xxxx/tmp/temp-744158198/tmp-1598279340, 

Input(s): 
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users" 

Output(s): 
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340" 

Counters: 
Total records written : 3 
Total bytes written : 63 
Spillable Memory Manager spill count : 0 
Total bags proactively spilled: 0 
Total records proactively spilled: 0

记录中没有错误或警告。

这是正常的，还是有什么问题？

来源

2012-01-13 marathon

是的，这很正常，因为在Hadoop上运行Map/Reduce作业通常需要大约1分钟的时间才能启动。 Pig根据脚本的复杂性生成多个Map/Reduce作业。

来源

2012-01-13 09:35:22 Brainlag

Pig与Cassandra集成：简单的分布式查询需要几分钟才能完成。这是正常的吗？

回答

相关问题