2014-09-26 101 views
3

我是Spark以及SparkR的新手。我已经成功安装了Spark和SparkR。如何在SparkR中构建逻辑回归模型

当我试图建立逻辑回归模型R和Spark在csv文件存储在HDFS中,我得到了错误“不正确的维数”。

我的代码是:

points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/Henry/data.csv"), readPartition)) 

collect(points) 

w <- runif(n=D, min = -1, max = 1) 

cat("Initial w: ", w, "\n") 

# Compute logistic regression gradient for a matrix of data points 
gradient <- function(partition) { 
    partition = partition[[1]] 
    Y <- partition[, 1] # point labels (first column of input file) 
    X <- partition[, -1] # point coordinates 
    # For each point (x, y), compute gradient function 

    dot <- X %*% w 
    logit <- 1/(1 + exp(-Y * dot)) 
    grad <- t(X) %*% ((logit - 1) * Y) 
    list(grad) 
} 


for (i in 1:iterations) { 
    cat("On iteration ", i, "\n") 
    w <- w - reduce(lapplyPartition(points, gradient), "+") 
} 

错误消息为:数据的

On iteration 1 
Error in partition[, 1] : incorrect number of dimensions 
Calls: do.call ... func -> FUN -> FUN -> Reduce -> <Anonymous> -> FUN -> FUN 
Execution halted 
14/09/27 01:38:13 ERROR Executor: Exception in task 0.0 in stage 181.0 (TID 189) 
java.lang.NullPointerException 
    at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
    at org.apache.spark.scheduler.Task.run(Task.scala:54) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:701) 
14/09/27 01:38:13 WARN TaskSetManager: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: 
     edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) 
     org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
     org.apache.spark.rdd.RDD.iterator(RDD.scala:229) 
     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
     org.apache.spark.scheduler.Task.run(Task.scala:54) 
     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) 
     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     java.lang.Thread.run(Thread.java:701) 
14/09/27 01:38:13 ERROR TaskSetManager: Task 0 in stage 181.0 failed 1 times; aborting job 
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 181.0 failed 1 times, most recent failure: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) Driver stacktrace: 

尺寸(样品):

data <- read.csv("/home/Henry/data.csv") 

dim(data) 

[1] 17 541

什么可能是这个可能的原因错误?

+0

我想你忘了告诉我们dim(data)'的结果。 – voidHead 2014-09-26 22:43:15

+0

@voidHead,我添加了dim(数据) – Hanry 2014-09-29 13:38:51

回答

0

的问题是,textFile()读一些文本数据并返回的分布式集合,每个对应一个行的文本文件。因此在程序partition[, -1]后面的失败。该程序的真正意图似乎是将points视为分布式数据集合。我们正在努力在SparkR中提供数据帧支持(SPARKR-1)。

要解决该问题,只需使用字符串操作简单地操纵partition即可正确提取X,Y。其他一些方法包括(我认为您可能已经看到过这种情况)从一开始就生成不同类型的分布式集合,如下所示:examples/logistic_regression.R

+0

的输出同时reduce和lapplyPartition正在从接口中删除(https://issues.apache.org/jira/browse/SPARK-7230),所以当DataFrames将可用,此程序将完全无法使用 – piccolbo 2015-05-20 16:38:48