如何使用列作为索引来查找使用SparkSQL的另一列中的单词？

我的数据框就像这样：如何使用列作为索引来查找使用SparkSQL的另一列中的单词？

而且我想使用的索引列表，从TOP5找到单词的对应词。

例如，如果在第一行中，words是[I ,am , a ,student, how, about, you]和top5是[5,4,0,1,2]然后我想以言表words，其指数是top5数的新列，所以结果是I , am , a, how, about。我该怎么做？

来源

2017-04-06 Liu Chong

由于值的top5数目是固定的，你可以很容易地使用括号标记或getItem。利用问题的例子：

from pyspark.sql.functions import col, array 

df = sc.parallelize([ 
    (["I", "am", "a", "student", "how", "about", "you"], [5, 4, 0, 1, 2]) 
]).toDF(["words", "top5"])

，您可以：

df.select([col("words")[col("top5")[i]] for i in range(5)])

或：

df.select([col("words").getItem(col("top5")[i]) for i in range(5)])

既赋予相同的结果：

+--------------+--------------+--------------+--------------+--------------+ 
|words[top5[0]]|words[top5[1]]|words[top5[2]]|words[top5[3]]|words[top5[4]]| 
+--------------+--------------+--------------+--------------+--------------+ 
|   about|   how|    I|   am|    a| 
+--------------+--------------+--------------+--------------+--------------+

如果您想要一个数组列包装上面的一个u唱array功能：

df.select(array(*[ 
    col("words").getItem(col("top5")[i]) for i in range(5) 
]).alias("top5mapped"))

+----------------------+ 
|top5mapped   | 
+----------------------+ 
|[about, how, I, am, a]| 
+----------------------+

来源

2017-04-06 13:31:50 user6910411

我可以在scala中提供解决方案。我希望有所帮助。

我假设你有一个名为df的数据框中的数据。

val result = df.rdd // gives you an rdd of row 
.map { row => 
     val id = row.getString(0) // first column 
     val words = row.getAs[Seq[String]]("words").toArray // second column 
     val top5 = row.getAs[Seq[Int]]("top5").toArray // third column 

     val requiredValues = new ListBuffer[String]() // to store the result 

     top5.foreach(x => requiredValues += words(x)) // extract data for "words5" for ever value in "top5" 

     (id,words,top5,requiredValues.toArray) 
     }

来源

2017-04-06 11:58:34

# prepare data => 
data = [[12345,['I' ,'am' , 'a' ,'student', 'how', 'about', 'you'],[5,4,0,1,2]], 
     [12346,['And','I', 'want', 'to', 'use', 'the', 'list', 'of' ,'indexes'],[1,2,5,6,7]], 
     [365464,['whose','index','to', 'the', 'number', 'of'],[4,0,2,3,1]] 
     ] 

def getdata(row): 
    words = row[1] 
    top5 = row[2] 
    result = [[words[index] for index in top5]] 
    return row+result 

rdd.map(getdata).collect()

来源

2017-04-06 13:38:55 Pushkr

如何使用列作为索引来查找使用SparkSQL的另一列中的单词？

回答

相关问题