斯坦福CoreNLP tokenize.whitespace属性不工作在中文

我使用斯坦福CoreNLP pos-tagging和NER在pre-tokenized中文文本，我读官方文档https://stanfordnlp.github.io/CoreNLP/tokenize.html，说tokenize.whitespace选项'如果设置为真，仅在遇到空白时才分隔词语“。这正是我想要的。斯坦福CoreNLP tokenize.whitespace属性不工作在中文

但我使用python，pycorenlp与CoreNLP服务器进行交互，并对java一无所知。然后我阅读anwser How to NER and POS tag a pre-tokenized text with Stanford CoreNLP?，并认为唯一可以做的就是在我的post-request属性字典中添加'tokenize.whitespace'='true'和另一个属性，但它不起作用。我跑我的服务器是这样的：

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 150000

，并在我的jupyter笔记本：

from pycorenlp import StanfordCoreNLP 
nlp = StanfordCoreNLP('http://localhost:9000') 

output = nlp.annotate('公司 作为 物联网 行业', properties={ 
    'annotators': 'pos,ner', 
    'tokenize.whitespace': 'true', # first property 
    'ssplit.eolonly': 'true', # second property 
    'outputFormat': 'json' 
}) 

for sentence in output['sentences']: 
    print(' '.join([token['word'] for token in sentence['tokens']]))

这给：

公司 作为 物 联网 行业

的CoreNLP仍然令牌化令牌 '物联网'，只是就像我不添加这两个属性一样。然后我尝试创建一个.properties文件，并在命令行上使用它，而不是StanfordCoreNLP-chinese.properties，但它也不起作用。在我test.properties：

tokenize.whitespace=true 
ssplit.eolonly=true

然后我跑的服务器是这样的：

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties 'test.properties' -port 9000 -timeout 150000

不过它的表现就好像我什么都没有改变。有人知道我错过了什么吗？任何帮助表示赞赏:)

来源

2017-07-25 nichen

最后我解决了我自己的问题。

对于中文文本使用tokenize.whitespace = true似乎很难，似乎它永远不会工作;相反，加

'tokenize.language': 'Whitespace'

你的属性字典或等价地，添加

tokenize.language: Whitespace

你的属性文件把事情做对。

此属性写在同一页https://stanfordnlp.github.io/CoreNLP/tokenize.html#options，我没有注意过。这有点令人困惑，为什么它存在两个属性用于相同的目的。

来源

2017-07-26 03:14:49 nichen

斯坦福CoreNLP tokenize.whitespace属性不工作在中文

回答

相关问题