停用词删除并保存新文件

我有文本文件，我需要从中删除停用词。我有存储在文本文件中的停用词。我将“stop-word”文本文件加载到我的Perl脚本中，并将停用词存储在名为“stops”的数组中。停用词删除并保存新文件

目前我正在加载一组不同的文本文件，我将它们存储在一个单独的数组中，然后进行模式匹配以查看是否有任何单词确实是停用词。我可以打印停用词并知道哪些文件正在发生，但是如何从文本文件中删除它们并存储新的文本文件，以便它没有停用词？

即停用词：的一个到和到

文本文件： “女孩驾驶撞向一个男人”

生成的文件：女孩驾驶坠毁男子

我将文件加载到：

$dirtoget = "/Users/j/temp/"; 
opendir(IMD, $dirtoget) || die("Cannot open directory");` 
@thefiles = readdir(IMD);` 

foreach $f (@thefiles) { 
if ($f =~ m/\.txt$/) { 

    open(FILE, "/Users/j/temp/$f") or die "Cannot open FILE"; 

    while (<FILE>) { 
     @file = <FILE>;

下面是模式匹配循环：

foreach $word(split) { 
       foreach $x (@stop) { 
        if ($x =~ m/\b\Q$word\E\b/) { 
       $word=''; 
         print $word,"\n";

设置$word为空。

或者我可以这样做：

$word = '' if exists $stops{$word};

我只是不知道如何设置输出文件不再包含匹配的单词。将数组中不匹配的单词存储并输出到文件是否很愚蠢？

来源

2011-03-03 jenniem001

原地覆盖文件是可能的，但麻烦。这样做的Unix的方式是只输出非禁用词到标准输出（其中print的默认操作），重定向

./remove_stopwords.pl textfile.txt > withoutstopwords.txt

然后用该文件withoutstopwords.txt进行。这也允许在流水线中使用该程序。

来源

2011-03-03 17:08:11

，将打印所有必须被删除，但我怎么从删除的话原始文件？ – jenniem001 2011-03-03 17:21:59

'mv withoutstopwords.txt textfile.txt'。或者将它们保存在一个数组中，然后写出来。 – 2011-03-03 17:28:15

短：

use strict; 
use warnings; 
use English qw<$LIST_SEPARATOR $NR>; 

my $stop_regex 
    = do { 
     local $LIST_SEPARATOR = '\\E|\\Q'; 
     eval "qr/\\b(\\[email protected]{stop}\\E)\\b/"; 
    }; 
@ARGV = glob('/Users/j/temp/*.txt'); 
while (<>) { 
    next unless m/$stop_regex/; 
    print "Stop word '$1' found at $ARGV line $NR\n"; 
}

什么你想用这句话做？如果你想替换他们，那么你可以这样做：

use English qw<$INPLACE_EDIT $LIST_SEPARATOR $NR>; 
local $INPLACE_EDIT = 'bak'; 

... 
while (<>) { 
    if (m/$stop_regex/) 
     s/$stop_regex/$something_else/g; 
    } 
    print; 
}

随着$INPLACE_EDIT活跃，PERL将转储打印成“.bak的”文件，并将其移动到下一个文件时，它会返回到.bak原始文件。如果这就是你想要做的。

来源

2011-03-03 17:13:19 Axeman

可以使用substitution operator从您的文件中删除的话：

use warnings; 
use strict; 

my @stop = qw(foo bar); 
while (<DATA>) { 
    my $line = $_; 
    $line =~ s/\b$_\b//g for @stop; 
    print $line; 
} 

__DATA__ 
here i am 
with a foo 
and a bar too 
lots of foo foo food

打印：

here i am 
with a 
and a too 
lots of food

来源

2011-03-03 17:13:31 toolic

如果我编辑你的代码以接收我的文件：'使用警告; open（STOPWORD，“/Users/j/stopWordList.txt”）或死“无法打开：$！\ n”; @stops = ; $ dirtoget =“/ Users/j/temp /”; opendir（IMD，$ dirtoget）||死（“无法打开目录”）; @thefiles = readdir（IMD）; 的foreach $ F（@thefiles）{ \t如果（$ F =〜米/ \。TXT $ /）{ \t \t开放（FILE， “/用户/ J /温度/ $ F”）或死亡“不能打开文件”; \t \t而（）{ \t \t \t我$行= $ _; \t \t \t $ line =〜s/\ b $ _ \ b // g for @stops; \t \t \t print $ line; \t \t} \t}} ' 这似乎只是打印整个文件？ – jenniem001 2011-03-03 17:34:22

它应该打印输入文件的所有行，并删除停用词，就像我的示例所示。 – toolic 2011-03-03 18:12:55

@ jenniem001 - 尝试'chomp（@ stops = ）'。没有调用'chomp'，所有的停用词在它们的末尾都会换行。 – mob 2011-03-03 19:45:20

停用词删除并保存新文件

回答

相关问题