推断标记的LDA/pLDA [主题建模工具箱]

我一直试图通过训练标记LDA模型和使用TMT工具箱（斯坦福nlp组）的pLDA进行推理的代码。我已经通过以下链接提供的例子了： http://nlp.stanford.edu/software/tmt/tmt-0.3/ http://nlp.stanford.edu/software/tmt/tmt-0.4/推断标记的LDA/pLDA [主题建模工具箱]

这里是我想要的标记LDA推断代码

val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7"); 

val model = LoadCVB0LabeledLDA(modelPath);` 

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1); 

val text = { 
    source ~>        // read from the source file 
    Column(4) ~>       // select column containing text 
    TokenizeWith(model.tokenizer.get)  //tokenize with model's tokenizer 
} 

val labels = { 
    source ~>        // read from the source file 
    Column(2) ~>       // take column two, the year 
    TokenizeWith(WhitespaceTokenizer())  
} 

val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv","")); 

val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex); 

val perDocTopicDistributions = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset); 

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions); 

TSVFile(outputPath+"-word-topic-distributions.tsv").write({ 
    for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield { 
    require(terms.id == dId); 
    (terms.id, 
    for ((term,dist) <- (terms.value zip dists)) yield { 
     term + " " + dist.activeIterator.map({ 
     case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob 
     }).mkString(" "); 
    }); 
    } 
});

错误

found : scalanlp.collection.LazyIterable[(String, Array[Double])] required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])] EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

我知道这是一种类型不匹配错误。但我不知道如何解决这个scala。基本上我不明白我应该如何提取 1.根据doc主题分布 2.根据推断命令输出后的doc标签分布。

请帮忙。与pLDA相同。我到达了推理命令，然后无能为力。

来源

2012-07-28 Rohit Jain

Scala类型系统比Java更加复杂，理解它会让你成为更好的程序员。问题就出在这里：

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

，因为无论模型或数据集或perDocTopicDistributions是类型：

scalanlp.collection.LazyIterable[(String, Array[Double])]

而EstimateLabeledLDAPerWordTopicDistributions.apply需要一个

Iterable[(String, scalala.collection.sparse.SparseArray[Double])]

调查的最佳方式这种类型的错误是看ScalaDoc（例如tmt的那个：http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package），如果你找不到问题出在哪里ea甲硅烷，你应该明确你的变量的代码里面一样的类型如下：

val perDocTopicDistributions:LazyIterable[(String, Array[Double])] = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)

如果我们一起来看一下，以edu.stanford.nlp.tmt.stage的javadoc的：

def 
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])] 

def 
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]

它现在应该清楚，InferCVB0LabeledLDADocumentTopicDistributions的返回不能直接用于馈送EstimateLabeledLDAPerWordTopicDistributions。

我从来没有使用斯坦福nlp，但这是由api如何工作，所以你只需要在调用函数之前将scalanlp.collection.LazyIterable[(String, Array[Double])]转换为Iterable[(String, scalala.collection.sparse.SparseArray[Double])]。

如果你看scaladoc关于如何做这个转换，这很简单。在包装阶段，在包装内。斯卡拉我可以读import scalanlp.collection.LazyIterable;

所以我知道在哪里看，实际上里面http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable你有变成一个LazyIterable到可迭代一个toIterable方法，还是你有你的内部数组转换成SparseArray

再次，我期待到package.scala为舞台包装内TMT，我看到：import scalala.collection.sparse.SparseArray;我找scalala文档：

http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray

事实证明，构造函数看似复杂到我，所以它的声音这很像我不得不查看工厂方法的伴侣对象。事实证明，我正在寻找的方法在那里，它被称为像往常一样适用于斯卡拉。

def 
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]

利用这一点，你可以编写具有以下签名的函数：

def f: Array[Double] => SparseArray[Double]

一旦这项工作完成后，你可以把你的InferCVB0LabeledLDADocumentTopicDistributions结果到了一个非延迟迭代稀疏阵列与一行代码：

result.toIterable.map { case (name, values => (name, f(values)) }

来源

2012-08-03 09:57:20 Edmondo1984

推断标记的LDA/pLDA [主题建模工具箱]

回答

相关问题