寻找双打的话

我将不得不写（对于运动），其检查是否存在同样的话一个文本文件中的perl程序，然后将它们打印到一个新文件（不重复）。寻找双打的话

有人可以帮助我。我明白，使用m //函数我可以查找单词，但是如何查找我可能不知道的单词？例如：如果文本文件有：

喂，你好，你怎么样？我不妨把这个文件复制到一个新的而不是的'你好'之一。当然，我不知道文件中是否有任何重复的单词......这是该程序搜索重复单词的想法。

我有出去的字母顺序排列的话一个基本的脚本，但找到重复单词的第2步......我想不通。这里的脚本（希望这是正确至今）：

#!/usr/bin/perl 
use strict; 
use warnings; 

my $source = shift(@ARGV); 
my $cible = shift(@ARGV); 

open (SOURCE, '<', $source) or die ("Can't open $source\n"); 
open (CIBLE, '>', $cible) or die ("Can't open $cible\n"); 

my @lignes = <SOURCE>; 
my @lignes_sorted = sort (@lignes); 

print CIBLE @lignes_sorted; 

chomp @lignes; 
chomp @lignes_sorted; 

print "Original text : @lignes\n"; 

sleep (1); 

print "Sorted text : @lignes_sorted\n"; 

close(SOURCE); 
close (CIBLE);

来源

2013-03-16 joesh

谢谢Kamituel，我只是再次编辑它，以便脚本正确。阅读指示的时间太晚（发布后）。 – joesh 2013-03-16 15:17:29

当你死亡时，包含错误信息：'$！'：'die（“无法打开$ source：$！\ n”）;' – 2013-03-17 03:56:25

嗨，Andy，你能解释为什么我必须替换'或死'与'$！：死'？你是这个意思吗？ – joesh 2013-03-17 08:51:52

在Perl：

#!/usr/bin/perl -w 
use strict; 

my $source = shift(@ARGV); 
my $cible = shift(@ARGV); 

open (SOURCE, '<', $source) or die ("Can't open $source\n"); 
open (CIBLE, '>', $cible) or die ("Can't open $cible\n"); 

my @input = sort <SOURCE>; 
my %words =(); 
foreach (@input) { 
    foreach my $word (split(/\s/)) { 
     print CIBLE $word." " unless (exists $words{$word}); 
     $words{$word} = 1; 
    } 
} 

close(SOURCE); 
close (CIBLE);

基本思想是（使用split功能）来分割整个文本为单个单词，然后建立一个哈希这个词作为重点。在阅读下一个单词时，只需检查这个单词是否已经在散列中。如果是 - 它是重复的。

对于字符串Hello, Hello, how are you?它打印：Hello, how are you?。

来源

2013-03-16 15:25:33 kamituel

太好了，谢谢。我的原始文件每行一个字 - 这是一个特殊的文本。用你的代码将单词放在一行上。我应该研究如何将我的代码与您的代码结合起来。有什么建议么？ – joesh 2013-03-16 17:04:50

已经知道了。我需要在行上添加一个“\ n” - 打印CIBLE $ word。“\ n”unles（存在$ words {$ word}） - 非常感谢。我需要查看'。'因为我不记得在代码中做了什么。 – joesh 2013-03-16 17:09:06

另一个小问题是，脚本不会按字母顺序将单词排序为我的原始脚本。我将不得不看看我能否找出在哪里放置排序命令？ – joesh 2013-03-17 08:40:33

-1

不知道如何做到这一点在Perl，但可以很容易做到使用sed和一对夫妇FO Unix工具吧。该算法将是：

分隔各个单词由一个换行符替换空间
排序的话
通过与-c选项的uniq发送的排序词列表（词数）
删除，让您单次出现（在第一列的1个计数）

该命令会全力以赴的话

（由ENTER TAB和\ n替换\ T）

sed 's/[ \t,.][ \t,.]*/\n/g' filename | sort | uniq -c | sed '/^ *\<1\>/d'

希望有所帮助。

来源

2013-03-16 15:23:08 unxnut

从句子重复数据删除的话是比它听起来更复杂。例如，如果在空白处分割句子，您将得到诸如Hello,之类的“单词”，其中包含非单词字符，并且该单词被视为不重复的真实单词Hello。有许多因素需要考虑，但假设一个最简单的情况下，除空白的所有字符组成合法的话，你可以这样做：

$ perl -anlwe '@F=grep !$seen{$_}++, @F; print "@F";' hello.txt 
Hello, how are you? 
yada Yada this is test material dupe Dupe 

$ cat hello.txt 
Hello, Hello, how are you? 
yada Yada this is test material dupe dupe Dupe

正如你所看到的，它没有考虑yada和Yada重复。它也不会考虑Hello重复Hello,。您可以通过添加lc或uc的用途来调整此情况以除去案例依赖关系，并允许使用不同的分隔符而不仅仅是空白。

我们在这里做的是使用散列%seen来跟踪之前出现的单词。其基本程序是：

while (<>) {   # reading input file or stdin 
    @F = split;  # splitting $_ on whitespace by default 
    @F = grep !$seen{$_}++, @F; # remove duplicates 
    print "@F";  # print array elements space-separated 
}

的!$seen{$_}++的功能是，在第一次进入一个新的关键，表达式将返回true，而所有其他时间错误。它是如何工作的？这些都是发生在不同的步骤：

$seen{$_}  # value for key $_ is fetched 
$seen{$_}++ # value for key $_ is incremented, undef -> 1 
       # $foo++ returns the value *before* it is incremented, 
       # so it returns undef 
!$seen{$_}++ # this is now "! undef", meaning "not false", as in true.

对于1及以上的价值观，这些都是真，not运营商他们都否认了假。

来源

2013-03-16 15:58:43 TLP

非常感谢。这比我习惯的复杂一点，但我会仔细研究一下。 – joesh 2013-03-16 17:03:41

我一直在寻找你建议的'while'解决方案，但我不确定我知道将它放在脚本中的位置。 – joesh 2013-03-17 08:48:38

如果您不担心发现不同大小写的重复单词，那么您可以使用单个替换来完成此操作。

use strict; 
use warnings; 

my ($source, $cible) = @ARGV; 

my $data; 
{ 
    open ($source_fh, '<', $source) or die ("Can't open $source\n"); 
    local $/; 
    $data = <$source_fh>; 
} 

$data =~ s/\b(\w+)\W+(?=\1\b)//g; 

open (my $cible_fh, '>', $cible) or die ("Can't open $cible\n"); 
print $cible_fh $data;

来源

2013-03-17 03:43:32 Borodin

寻找双打的话

回答

相关问题