2010-11-13 59 views
6

我有一个使用OCR软件创建的文本文件 - 大小约为1兆字节。 一些不常见的字符出现在整个文档上,其中大部分都是OCR错误。如何使用Unix命令收集文本文件中的字符使用情况统计信息?

我想查找文档中使用的所有字符,以便轻松发现错误(如UNIQ命令,但对于字符,而不是线)。

我在Ubuntu上。 我应该用什么Unix命令来显示文本文件中使用的所有字符?

回答

8

这应该做你要找的内容:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c 

的前提是sed本身把每个字符的文件到线,则通常sort | uniq -c序列剔除掉所有但之一每个独特的字符发生,并提供每次发生的次数。

此外,您可以在整个序列的末尾附加| sort -n,以按每个字符出现的次数排序输出。例如:

$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n 
    1 
    1 e 
    1 h 
    1 o 
    2 l 
+0

来到这里后正是这一点。但是,唉,我太慢了。但这里有一个有用的提示。最后,如果你想把它全部放在一行上,你可以添加“| perl -pe'chomp'” – OmnipotentEntity 2010-11-13 19:02:58

+0

非常感谢。这正是我所期待的。清洁和简单 – Hali 2010-11-15 20:40:46

0

至于使用* nix命令,回答上面是好的,但它并没有得到使用统计信息。但是,如果你真的想要文件中的统计信息(比如最常用的,中等的,最常用的等),那么这个Python应该做到这一点。

def get_char_counts(fname): 
    f = open(fname) 
    usage = {} 
    for c in f.read(): 
     if c not in usage: 
      usage.update({c:1}) 
     else: 
      usage[c] += 1 
    return usage 
+0

但上面的答案也有统计数字。谢谢你回答我的问题。 – Hali 2010-11-15 20:48:44

+0

@Hali - 没有看到上面的编辑(而且答案需要一大块Python) – new123456 2010-11-15 22:10:31

1

这将做到这一点:

#!/usr/bin/perl -n 
# 
# charcounts - show how many times each code point is used 
# Tom Christiansen <[email protected]> 

use open ":utf8"; 

++$seen{ ord() } for split //; 

END { 
    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) { 
     printf "%04X %d\n", $cp, $seen{$cp}; 
    } 
} 

运行本身,该程序产生:

$ charcounts /tmp/charcounts | head 
0020 46 
0065 20 
0073 18 
006E 15 
000A 14 
006F 12 
0072 11 
0074 10 
0063 9 
0070 9 

如果你想要的字符的文字字符和/或名称,也这很容易添加。

如果你想要更复杂的东西,这个程序通过Unicode属性计算出字符。这对你的目的来说可能已经足够了,如果没有,你应该可以适应它。

#!/usr/bin/perl 
# 
# unicats - show character distribution by Unicode character property 
# Tom Christiansen <[email protected]> 

use strict; 
use warnings qw<FATAL all>; 

use open ":utf8"; 

my %cats; 

our %Prop_Table; 
build_prop_table(); 

if (@ARGV == 0 && -t STDIN) { 
    warn <<"END_WARNING"; 
$0: reading UTF-8 character data directly from your tty 
\tSo please type stuff... 
\t and then hit your tty's EOF sequence when done. 
END_WARNING 

} 

while (<>) { 
    for (split(//)) { 

     $cats{Total}++; 

     if (/\p{ASCII}/) { $cats{ASCII}++ } 
     else    { $cats{Unicode}++ } 

     my $gcat = get_general_category($_); 
     $cats{$gcat}++; 

     my $subcat = get_general_subcategory($_); 
     $cats{$subcat}++; 

    } 
} 

my $width = length $cats{Total}; 

my $mask = "%*d %s\n"; 

for my $cat(qw< Total ASCII Unicode >) { 
    printf $mask, $width => $cats{$cat} || 0, $cat; 
} 
print "\n"; 

my @catnames = qw[ 
    L Lu Ll Lt Lm Lo 
    N Nd Nl No 
    S Sm Sc Sk So 
    P Pc Pd Ps Pe Pi Pf Po 
    M Mn Mc Me 
    Z Zs Zl Zp 
    C Cc Cf Cs Co Cn 
]; 

#for my $cat (sort keys %cats) { 
for my $cat (@catnames) { 
    next if length($cat) > 2; 
    next unless $cats{$cat}; 

    my $prop = length($cat) == 1 
       ? (" " . q<\p> . $cat   ) 
       : (  q<\p> . "{$cat}" . "\t") 
      ; 

    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat}); 

    printf $mask, $width => $cats{$cat}, $desc; 
} 

exit; 

sub get_general_category { 
    my $_ = shift(); 
    return "L" if /\pL/; 
    return "S" if /\pS/; 
    return "P" if /\pP/; 
    return "N" if /\pN/; 
    return "C" if /\pC/; 
    return "M" if /\pM/; 
    return "Z" if /\pZ/; 

    die "not reached one: $_"; 
} 

sub get_general_subcategory { 
    my $_ = shift(); 

    return "Lu" if /\p{Lu}/; 
    return "Ll" if /\p{Ll}/; 
    return "Lt" if /\p{Lt}/; 
    return "Lm" if /\p{Lm}/; 
    return "Lo" if /\p{Lo}/; 

    return "Mn" if /\p{Mn}/; 
    return "Mc" if /\p{Mc}/; 
    return "Me" if /\p{Me}/; 

    return "Nd" if /\p{Nd}/; 
    return "Nl" if /\p{Nl}/; 
    return "No" if /\p{No}/; 

    return "Pc" if /\p{Pc}/; 
    return "Pd" if /\p{Pd}/; 
    return "Ps" if /\p{Ps}/; 
    return "Pe" if /\p{Pe}/; 
    return "Pi" if /\p{Pi}/; 
    return "Pf" if /\p{Pf}/; 
    return "Po" if /\p{Po}/; 

    return "Sm" if /\p{Sm}/; 
    return "Sc" if /\p{Sc}/; 
    return "Sk" if /\p{Sk}/; 
    return "So" if /\p{So}/; 

    return "Zs" if /\p{Zs}/; 
    return "Zl" if /\p{Zl}/; 
    return "Zp" if /\p{Zp}/; 

    return "Cc" if /\p{Cc}/; 
    return "Cf" if /\p{Cf}/; 
    return "Cs" if /\p{Cs}/; 
    return "Co" if /\p{Co}/; 
    return "Cn" if /\p{Cn}/; 

    die "not reached two: <$_> " . sprintf("U+%vX", $_); 

} 

sub build_prop_table { 

    for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) { 

     L   Letter 
     Lu   Uppercase_Letter 
     Ll   Lowercase_Letter 
     Lt   Titlecase_Letter 
     Lm   Modifier_Letter 
     Lo   Other_Letter 

     M   Mark (combining characters, including diacritics) 
     Mn   Nonspacing_Mark 
     Mc   Spacing_Mark 
     Me   Enclosing_Mark 

     N   Number 
     Nd   Decimal_Number (also Digit) 
     Nl   Letter_Number 
     No   Other_Number 

     P   Punctuation 
     Pc   Connector_Punctuation 
     Pd   Dash_Punctuation 
     Ps   Open_Punctuation 
     Pe   Close_Punctuation 
     Pi   Initial_Punctuation (may behave like Ps or Pe depending on usage) 
     Pf   Final_Punctuation (may behave like Ps or Pe depending on usage) 
     Po   Other_Punctuation 

     S   Symbol 
     Sm   Math_Symbol 
     Sc   Currency_Symbol 
     Sk   Modifier_Symbol 
     So   Other_Symbol 

     Z   Separator 
     Zs   Space_Separator 
     Zl   Line_Separator 
     Zp   Paragraph_Separator 

     C   Other (means not L/N/P/S/Z) 
     Cc   Control (also Cntrl) 
     Cf   Format 
     Cs   Surrogate (not usable) 
     Co   Private_Use 
     Cn   Unassigned 

End_of_Property_List 

      my($short_prop, $long_prop) = $line =~ m{ 
       \b 
       (\p{Lu} \p{Ll} ?) 
       \s + 
       (\p{Lu} [\p{L&}_] +) 
       \b 
      }x; 

      $Prop_Table{$short_prop} = $long_prop; 

    } 

} 

例如:

$ unicats book.txt 
2357232 Total 
2357199 ASCII 
    33 Unicode 

1604949 \pL Letter 
    74455 \p{Lu} Uppercase_Letter 
1530485 \p{Ll} Lowercase_Letter 
     9 \p{Lo} Other_Letter 
    10676 \pN Number 
    10676 \p{Nd} Decimal_Number 
    19679 \pS Symbol 
    10705 \p{Sm} Math_Symbol 
    8365 \p{Sc} Currency_Symbol 
    603 \p{Sk} Modifier_Symbol 
     6 \p{So} Other_Symbol 
111899 \pP Punctuation 
    2996 \p{Pc} Connector_Punctuation 
    6145 \p{Pd} Dash_Punctuation 
    11392 \p{Ps} Open_Punctuation 
    11371 \p{Pe} Close_Punctuation 
    79995 \p{Po} Other_Punctuation 
548529 \pZ Separator 
548529 \p{Zs} Space_Separator 
    61500 \pC Other 
    61500 \p{Cc} Control 
+0

看起来太干净了:-)谢谢你回答我的问题。 – Hali 2010-11-15 20:44:54

+0

@Hali,判断不是cleanliless的编程语言,而是程序员。任何生成乱码的程序员都是白痴和威胁。 – tchrist 2010-11-15 21:03:11