在Spark MLlib中，如何将字符串转换为spark scala中的整数？

据我所知，MLlib只支持整数。
然后，我想将字符串转换为scala中的整数。例如，我在txtfile中有很多reviewerID，productID。在Spark MLlib中，如何将字符串转换为spark scala中的整数？

reviewerID productID 
03905X0912 ZXASQWZXAS 
0325935ODD PDLFMBKGMS 
...

来源

2017-05-14 DaehyunPark

你能详细阐述_你要什么用整数做“据我所知，只有MLlib整数支持。”？你将使用什么算法。为您的**真实**问题提供解决方案会容易得多。这可能是ALS吗？或者其他推荐算法？ –

我将使用ALS算法，矩阵分解。 – DaehyunPark

StringIndexer是解决方案。它将用估计器和变压器装入ML管道。本质上，一旦设置了输入列，它就会计算每个类别的频率并将它们从0开始编号。如果需要，您可以在管道末端添加IndexToString以替换原始字符串。

有关更多详细信息，请参阅ML文档以了解“估算，转换和选择特征”。

在你的情况下，它会像：

import org.apache.spark.ml.feature.StringIndexer 

val indexer = new StringIndexer().setInputCol("productID").setOutputCol("productIndex") 
val indexed = indexer.fit(df).transform(df) 
indexed.show()

来源

2017-05-15 03:17:53 sourabh

您可以为每个reviewerID（productID）添加一个具有唯一ID的新行。您可以通过以下方式添加新行。

通过monotonicallyIncreasingId：

import spark.implicits._ 
val data = spark.sparkContext.parallelize(Seq(
    ("123xyx", "ab"), 
    ("123xyz", "cd") 
)).toDF("reviewerID", "productID") 
data.withColumn("uniqueReviID", monotonicallyIncreasingId).show()

通过使用zipWithUniqueId：

val rows = data.rdd.zipWithUniqueId.map { 
    case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq) 
} 

val finalDf = spark.createDataFrame(rows, StructType(StructField("uniqueRevID", LongType, false) +: data.schema.fields)) 

finalDf.show()

您还可以通过在SQL语法使用row_number()做到这一点：

import spark.implicits._ 
val data = spark.sparkContext.parallelize(Seq(
    ("123xyx", "ab"), 
    ("123xyz", "cd") 
)).toDF("reviewerID", "productID").createOrReplaceTempView("review") 
val tmpTable1 = spark.sqlContext.sql(
    "select row_number() over (order by reviewerID) as id, reviewerID, productID from review")

希望这有助于！

来源

2017-05-14 14:33:01

At通过使用zipWithUniqueId，发生错误。 scala> val rows = data.rdd.zipWithUniqueId.map { | case（r：Row，id：Long）=> Row.fromSeq（id +：r.toSeq） | } ：29：错误：未找到：类型行情况下（R：行，ID：长）=> Row.fromSeq（ID +：r.toSeq） ^ ：29：错误：未找到：值行 case（r：Row，id：Long）=> Row.fromSeq（id +：r.toSeq） – DaehyunPark

你试过我的例子吗？如果是的话，它应该工作。 –

在Spark MLlib中，如何将字符串转换为spark scala中的整数？

回答

相关问题