2017-08-15 68 views
-1

我有以下Pyspark劈裂列表内的列表,元组

[('HOMICIDE', [('2017', 1)]), 
('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
('ROBBERY', [('2017', 1)])] 

如何,当我试图使用映射其作为“AttributeError的投掷转换为

[('HOMICIDE', ('2017', 1)), 
('DECEPTIVE PRACTICE', ('2015', 10)), 
('DECEPTIVE PRACTICE', ('2014', 3)), 
('DECEPTIVE PRACTICE', ('2017', 14)), 
('DECEPTIVE PRACTICE', ('2016', 14))] 

:‘名单’对象没有属性'map'“

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)])]) 
y = rdd.map(lambda x : (x[0],tuple(x[1]))) 

回答

2

maprdd而不是蟒蛇列表的方法,所以你需要先并行列表,然后你可以使用flatMap拉平内列出:

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), 
         ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
         ('ROBBERY', [('2017', 1)])]) 

rdd.flatMap(lambda x: [(x[0], y) for y in x[1]]).collect() 

# [('HOMICIDE', ('2017', 1)), 
# ('DECEPTIVE PRACTICE', ('2017', 14)), 
# ('DECEPTIVE PRACTICE', ('2016', 14)), 
# ('DECEPTIVE PRACTICE', ('2015', 10)), 
# ('DECEPTIVE PRACTICE', ('2013', 4)), 
# ('DECEPTIVE PRACTICE', ('2014', 3)), 
# ('ROBBERY', ('2017', 1))] 
+0

得到它的工作...感谢您的信息... –

2

而不是列表理解呢?

y = [(x[0], i) for x in rdd for i in x[1]] 

返回

[('HOMICIDE', ('2017', 1)), ('DECEPTIVE PRACTICE', ('2017', 14)), ('DECEPTIVE PRACTICE', ('2016', 14)), ('DECEPTIVE PRACTICE', ('2015', 10)), ('DECEPTIVE PRACTICE', ('2013', 4)), ('DECEPTIVE PRACTICE', ('2014', 3))] 
+0

它在python中运行良好,当我使用pyspark时,我必须将数据移动到磁盘....我认为它的我的坏不要提及sc.parallelize在我的 题 。谢谢@ason​​gtoruin –

+0

@SachinSukumaran我的不好!无论如何,另一个答案似乎有你覆盖。 – asongtoruin