2015-02-23 54 views
1

我试图用this example(下组合柱段)实行卡桑德拉:如何使用CQL填充Cassandra中的相关表?

所以,我创建的表鸣叫,它看起来像如下:

cqlsh:twitter> SELECT * from tweets; 

tweet_id        | author  | body 
--------------------------------------+-------------+-------------- 
73954b90-baf7-11e4-a7d0-27983e9e7f51 | gwashington | I chopped... 

(1 rows) 

现在我想填充时间线,这是一个使用CQL的相关表,我不知道如何去做。我已经试过SQL方法,但它没有工作:

cqlsh:twitter> INSERT INTO timeline (user_id, tweet_id, author, body) SELECT 'gmason', 73954b90-baf7-11e4-a7d0-27983e9e7f51, author, body FROM tweets WHERE tweet_id = 73954b90-baf7-11e4-a7d0-27983e9e7f51; 
Bad Request: line 1:55 mismatched input 'select' expecting K_VALUES 

所以我有两个问题:

  1. 如何填充时间表表的SQL,所以它会涉及到鸣叫
  2. 如何确保时间轴物理布局将按照该示例中所示创建?

感谢。

编辑:

这是解释我上面的问题#2(画面从here取):

This is explanation for my question #2 above:

回答

3

tldr;

  1. 使用cqlsh COPY出口tweets,修改文件,使用COPY导入timeline

  2. 使用cassandra-cli验证物理结构。

长版...

  1. 我会去在这一个不同的方式,并认为它会在cqlsh使用本地COPY命令可能会更容易。

我跟着类似examples found here。在cqlsh中创建tweetstimeline表之后,我按照指示将行插入到tweets中。我tweets表则是这样的:

[email protected]:stackoverflow> SELECT * FROM tweets; 

tweet_id        | author  | body 
--------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------- 
05a5f177-f070-486d-b64d-4e2bb28eaecc |  gmason | Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state. 
b67fe644-4dbe-489b-bc71-90f809f88636 | jmadison |                     All men having power ought to be distrusted to a certain degree. 
819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1 | gwashington |                 To be prepared for war is one of the most effectual means of preserving peace. 

我然后出口他们是这样的:

[email protected]:stackoverflow> COPY tweets TO '/home/aploetz/tweets_20150223.txt' 
WITH DELIMITER='|' AND HEADER=true; 

3 rows exported in 0.052 seconds. 

然后我编辑的tweets_20150223.txt file,在前面加一个user_id列和复制几排,像这样的:

userid|tweet_id|author|body 
gmason|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state. 
jmadison|b67fe644-4dbe-489b-bc71-90f809f88636|jmadison|All men having power ought to be distrusted to a certain degree. 
gwashington|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace. 
jmadison|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace. 
ahamilton|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace. 
ahamilton|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state. 

我保存的文件timeline_20150223.txt,并将其导入到timeline吨能,例如:

[email protected]:stackoverflow> COPY timeline FROM '/home/aploetz/timeline_20150223.txt' 
WITH DELIMITER='|' AND HEADER=true; 

6 rows imported in 0.016 seconds. 
  • 是,timeline将是宽行的表,分区上user_id,然后在tweet_id聚类。我通过运行cassandra-cli工具和timeline列族(表)验证了“引擎盖下”结构。在这里,您可以看到行是如何被user_id分区,每列有tweet_id UUID作为其名称的一部分:
  • -

    [[email protected]] list timeline; 
    Using default limit of 100 
    Using default cell limit of 100 
    ------------------- 
    RowKey: ahamilton 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585904) 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585904) 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585904) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585715) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585715) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585715) 
    ------------------- 
    RowKey: gmason 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585150) 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585150) 
    => (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585150) 
    ------------------- 
    RowKey: gwashington 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585475) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585475) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585475) 
    ------------------- 
    RowKey: jmadison 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585597) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585597) 
    => (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585597) 
    => (name=b67fe644-4dbe-489b-bc71-90f809f88636:, value=, timestamp=1424707827585348) 
    => (name=b67fe644-4dbe-489b-bc71-90f809f88636:author, value=6a6d616469736f6e, timestamp=1424707827585348) 
    => (name=b67fe644-4dbe-489b-bc71-90f809f88636:body, value=416c6c206d656e20686176696e6720706f776572206f7567687420746f206265206469737472757374656420746f2061206365727461696e206465677265652e, timestamp=1424707827585348) 
    
    4 Rows Returned. 
    Elapsed time: 35 msec(s). 
    
    +1

    写得很好! +1 – 2015-02-24 05:58:44

    +1

    非常详细,谢谢! – jazzblue 2015-02-24 16:37:42

    2
    1. 为了做到这一点,你需要使用一个ETL工具。使用Hadoop或Spark。 CQL中没有INSERT/SELECT,这是有原因的。在现实世界中,您需要从应用程序中执行2次插入 - 每次插入一次。

    2. 您将不得不相信,当您使用分区键和集群键的主键时,这将以宽行格式存储数据。

    +0

    谢谢,罗马。另外,关于我的问题#2,我在上面编辑了以下预期时间线物理布局的图片。你知道这是如何自动组织“时间表”表:宽行?谢谢。 – jazzblue 2015-02-23 15:41:44

    +0

    布莱斯在这个答案上做了很棒的工作,而我在第一天忙于新工作时太忙了:) – 2015-02-24 05:57:59