2012-11-20 103 views
5

韩语处理器有哪些最好的分词器?韩文语言分割器

我在Solr4.0试图CJKTokenizer。它正在进行标记,但准确度非常低。

+1

您正在寻找免费/ OSS tokenizer,不是吗?恐怕我知道CJKV语言的唯一标记器或多或少地正确工作,这是商业化的东西。 –

回答

2

POSTECH/K是一款韩国形态分析仪,可以在不费力气的情况下对韩国数据进行标记和标记。该软件在训练和测试的语料库上报告了90.7%(参见http://nlp.postech.ac.kr/download/postag_k/9908_cljournal_gblee.pdf)。

对于我一直在努力的multilingual corpus project的韩国数据,POS标签达到了81%。

但是,有一个问题,你必须使用Windows来运行软件。但我有一个脚本来绕过限制,这里的脚本:

#!/bin/bash -x 
############################################################################### 
## Sejong-Shell is a script to call POSTAG/SEJONG tagger on Unix Machine 
## because POSTAG/Sejong is only usable in Korean Microsoft Windows environment 
## the original POSTAG/Sejong can be downloaded from 
## http://isoft.postech.ac.kr/Course/CS730b/2005/index.html 
## 
## Sejong-Shell is dependent on WINdows Emulator. 
## The WINE program can be downloaded from 
## http://www.winehq.org/download/ 
## 
## The shell scripts accepts the input files from one directory and 
## outputs the tagged files into another while retaining the filename 
############################################################################### 

cd <source-file_dir> 
#<source_-ile_dir> is the directory that saves the textfiles that needs tagging 
for file in `dir -d *` 
do 
    echo $file 
    sudo cp <source-file_dir>/"$file" <POSTAG-Sejong_dir>/input.txt 
    # <POSTAG-Sejong_dir> refers to the directory where the pos-tagger is saved 
    wine start /Unix "$HOME/postagsejong/sjTaggerInteg.exe" 
    sleep 30 
    # This is necessary so that the file from the current loop won't be 
    # overlapping with the next, do increase the time for sleep if the file 
    # is large and needs more than 30 sec for POSTAG/Sejong to tag. 
    sudo cp <POSTAG-Sejong_dir>/output.txt <target-file_dir>/"$file" 
    # <target-file_dir> is where you want the output files to be stored 
done 

# Instead of the sleep command to prevent the overlap: 
# $sleep 30 
# Alternatively, you can manually continue a loop with the following 
# command that continues a loop after a keystroke input: 
# $read -p "Press any key to continue…" 

注意,对于POSTECH/K编码是euc-kr,所以如果它是utf8。您可以使用以下脚本将utf8转换为euc-kr。

#!/usr/bin/python # -*- coding: utf-8 -*- 

''' 
pre-sejong clean 
''' 

import codecs 
import nltk 
import os, sys, re, glob 
from nltk.tokenize import RegexpTokenizer 

reload(sys) 
sys.setdefaultencoding('utf-8') 

cwd = './gizaclean_ko' #os.getcwd() 
wrd = './presejong_ko' 

kr_sent_tokenizer = nltk.RegexpTokenizer(u'[^!?.?!]*[!?."www.*"]') 


for infile in glob.glob(os.path.join(cwd, '*.txt')): 
# if infile == './extract_ko/singapore-sling.txt': continue 
# if infile == './extract_ko/ion-orchard.txt': continue 
     print infile 
     (PATH, FILENAME) = os.path.split(infile) 
     reader = open(infile) 
     writer = open(os.path.join(wrd, FILENAME).encode('euc-kr'),'w') 
     for line in reader: 
       para = []urlread = lambda url: urllib.urlopen(url).read() 
       para.append (kr_sent_tokenizer.tokenize(unicode(line,'utf-8').strip())) 
       for sent in para[0]: 
      newsent = sent.replace(u'\xa0', ' '.encode('utf-8')) 
      newsent2 = newsent.replace(u'\xe7', 'c'.encode('utf-8')) 
      newsent3 = newsent2.replace(u'\xe9', 'e'.encode('utf-8')) 
      newsent4 = newsent3.replace(u'\u2013', '-') 
      newsent5 = newsent4.replace(u'\xa9', '(c)') 
      newsent6 = newsent5.encode('euc-kr').strip() 
      print newsent6 
      writer.write(newsent6+'\n')  

源世宗壳:醴陵谭2011年建立的南洋理工大学的基础文本 - 多语言语料库(NTU-MC)最后一年的项目新加坡:新加坡南洋理工大学第。 44.)

0

我建议你使用这个site。它的性能非常好,虽然它只是作为RESTful服务提供的功能。