2015-10-16 82 views
3

我有一个RDD与元组的形式是:PySpark - 转换的RDD成一个键值对RDD,与值列表是

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ... 

我想是要变换成关键 - 值对RDD,其中,所述第一场将首先串(键)和第二场字符串(值)的列表,即,欲把它转化为以下形式:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ... 

回答

6
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")]) 

>>> result = rdd.map(lambda x: (x[0], list(x[1:]))) 

>>> print result.collect() 
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])] 

说明lambda x: (x[0], list(x[1:]))

  1. x[0]将使所述第一元件是 输出的第一个元素
  2. x[1:]将使第一个除外的所有元素是 在第二元件
  3. list(x[1:])将迫使该要一个列表 ,因为默认将是一个元组
+0

正是我需要的,谢谢! – nikos