韩文语言分割器

韩语处理器有哪些最好的分词器？韩文语言分割器

我在Solr4.0试图CJKTokenizer。它正在进行标记，但准确度非常低。

2012-11-20 gangatharan

您正在寻找免费/ OSS tokenizer，不是吗？恐怕我知道CJKV语言的唯一标记器或多或少地正确工作，这是商业化的东西。 –

POSTECH/K是一款韩国形态分析仪，可以在不费力气的情况下对韩国数据进行标记和标记。该软件在训练和测试的语料库上报告了90.7％（参见http://nlp.postech.ac.kr/download/postag_k/9908_cljournal_gblee.pdf）。

对于我一直在努力的multilingual corpus project的韩国数据，POS标签达到了81％。

但是，有一个问题，你必须使用Windows来运行软件。但我有一个脚本来绕过限制，这里的脚本：

#!/bin/bash -x 
############################################################################### 
## Sejong-Shell is a script to call POSTAG/SEJONG tagger on Unix Machine 
## because POSTAG/Sejong is only usable in Korean Microsoft Windows environment 
## the original POSTAG/Sejong can be downloaded from 
## http://isoft.postech.ac.kr/Course/CS730b/2005/index.html 
## 
## Sejong-Shell is dependent on WINdows Emulator. 
## The WINE program can be downloaded from 
## http://www.winehq.org/download/ 
## 
## The shell scripts accepts the input files from one directory and 
## outputs the tagged files into another while retaining the filename 
############################################################################### 

cd <source-file_dir> 
#<source_-ile_dir> is the directory that saves the textfiles that needs tagging 
for file in `dir -d *` 
do 
    echo $file 
    sudo cp <source-file_dir>/"$file" <POSTAG-Sejong_dir>/input.txt 
    # <POSTAG-Sejong_dir> refers to the directory where the pos-tagger is saved 
    wine start /Unix "$HOME/postagsejong/sjTaggerInteg.exe" 
    sleep 30 
    # This is necessary so that the file from the current loop won't be 
    # overlapping with the next, do increase the time for sleep if the file 
    # is large and needs more than 30 sec for POSTAG/Sejong to tag. 
    sudo cp <POSTAG-Sejong_dir>/output.txt <target-file_dir>/"$file" 
    # <target-file_dir> is where you want the output files to be stored 
done 

# Instead of the sleep command to prevent the overlap: 
# $sleep 30 
# Alternatively, you can manually continue a loop with the following 
# command that continues a loop after a keystroke input: 
# $read -p "Press any key to continue…"

注意，对于POSTECH/K编码是euc-kr，所以如果它是utf8。您可以使用以下脚本将utf8转换为euc-kr。

#!/usr/bin/python # -*- coding: utf-8 -*- 

''' 
pre-sejong clean 
''' 

import codecs 
import nltk 
import os, sys, re, glob 
from nltk.tokenize import RegexpTokenizer 

reload(sys) 
sys.setdefaultencoding('utf-8') 

cwd = './gizaclean_ko' #os.getcwd() 
wrd = './presejong_ko' 

kr_sent_tokenizer = nltk.RegexpTokenizer(u'[^！？.?!]*[！？."www.*"]') 


for infile in glob.glob(os.path.join(cwd, '*.txt')): 
# if infile == './extract_ko/singapore-sling.txt': continue 
# if infile == './extract_ko/ion-orchard.txt': continue 
     print infile 
     (PATH, FILENAME) = os.path.split(infile) 
     reader = open(infile) 
     writer = open(os.path.join(wrd, FILENAME).encode('euc-kr'),'w') 
     for line in reader: 
       para = []urlread = lambda url: urllib.urlopen(url).read() 
       para.append (kr_sent_tokenizer.tokenize(unicode(line,'utf-8').strip())) 
       for sent in para[0]: 
      newsent = sent.replace(u'\xa0', ' '.encode('utf-8')) 
      newsent2 = newsent.replace(u'\xe7', 'c'.encode('utf-8')) 
      newsent3 = newsent2.replace(u'\xe9', 'e'.encode('utf-8')) 
      newsent4 = newsent3.replace(u'\u2013', '-') 
      newsent5 = newsent4.replace(u'\xa9', '(c)') 
      newsent6 = newsent5.encode('euc-kr').strip() 
      print newsent6 
      writer.write(newsent6+'\n')

（源世宗壳：醴陵谭2011年建立的南洋理工大学的基础文本 - 多语言语料库（NTU-MC）最后一年的项目新加坡：新加坡南洋理工大学第。 44.）

来源

2013-01-20 00:51:56 alvas

我建议你使用这个site。它的性能非常好，虽然它只是作为RESTful服务提供的功能。

来源

2013-06-21 01:19:15 Kim

韩文语言分割器

回答

相关问题