省略一组名称

好吧，我确定有人在某个地方肯定已经为此提出了一个算法，所以我想我会在我离开前自己去重新创建它。省略一组名称

我有一个任意（用户输入的）非空文本字符串的列表。每个字符串可以是任意长度（0除外），它们都是唯一的。我想将它们展示给用户，但我想将它们修剪成我确定的固定长度，并用省略号（...）替换它们的一部分。问题是我希望所有的输出字符串都是唯一的。

例如，如果我有字符串：

的Microsoft Internet Explorer 6
微软的Internet Explorer 7
微软的Internet Explorer 8
Mozilla Firefox浏览器3
Mozilla Firefox浏览器4
Google Chrome 14

然后我不想修剪字符串的末尾，因为这是独特的部分（不想显示“Microsoft Internet ...”3次），但可以剪掉中间部分：

微软... RER 6
微软... RER 7
微软... RER 8
是Mozilla Firefox 3
的Mozilla Firefox 4
谷歌浏览器14

其他时候，中间部分可能是独一无二的，我要修剪的结尾：

公司会议，2010年5月25日的

分钟 - 仅供内部使用
公司会议，2010年6月24日的纪要 - 内部使用公司会议，23/7/2010只有
分钟 - 仅供内部使用

将变成：

公司会议，2010年5月25日的分...
公司会议，2010年6月24日的分...
公司会议，23/7/2010纪要...

我猜应该可能永远ellipsize的非常开头的字符串，即使否则将被允许的，因为这将是怪异。我猜想它可以在字符串中的多个位置上省略椭圆，但在合理范围内 - 也许2次会好，但3次或更多似乎过多。或者可能的次数并不像剩余块的大小那么重要：椭圆之间少于大约5个字符将是毫无意义的。

输入（数量和大小）不会很大，所以性能不是一个主要问题（好吧，只要算法不会像列举所有可能的字符串一样愚蠢，直到找到一个集合这样可行！）。

我想这些要求看起来很具体，但我其实很宽松 - 我只是想描述一下我的想法。

之前有过类似的事情吗？是否有一些现有的算法或库这样做？我搜索了一些，但到目前为止还没有发现任何东西（但也许我只是在使用google搜索而已）。我必须相信有人想要解决这个问题！

来源

2011-02-14 Ken

这听起来像的longest common substring problem.

申请更换最长共同子串用省略号所有字符串。如果字符串仍然过长，并且允许您使用其他省略号，请重复。

您必须认识到，您可能无法“椭圆化”一组足够符合长度要求的字符串。

来源

2011-02-14 20:28:17 erickson

嗯，这不是一个不好的起点，但我不认为这是我想要的。也许我的例子没有被选择清楚，但我不要求椭圆只替换相等的子字符串：只是输出字符串是唯一的。例如，如果给出两个输入“Herzkreislaufwiederbelebung”和“Geschwindigkeitsbegrenzung”，并且我想修剪到长度= 12（包括圆点），则可以返回“Herzkreis ...”和“Geschwind ...”。 – Ken 2011-02-14 20:37:23

@Ken听起来像你可以让他们呆滞。 – Orbling 2011-02-14 20:56:27

排序字符串。保留每个字符串的前X个字符。如果此前缀对字符串前后不唯一，则前进至唯一字符（与前后字符串相比）。（如果找不到唯一字符，则该字符串没有唯一部分，请参阅帖子底部）在这些唯一字符前后添加省略号。

注意，这可能仍然看起来很有趣：

Microsoft Office -> Micro...ffice 
Microsoft Outlook -> Micro...utlook

我不知道你要做到这一点用什么语言，但这里有一个Python实现。

def unique_index(before, current, after, size): 
    '''Returns the index of the first part of _current_ of length _size_ that is 
     unique to it, _before_, and _after_. If _current_ has no part unique to it, 
     _before_, and _after_, it returns the _size_ letters at the end of _current_''' 
    before_unique = False 
    after_unique = False 
    for i in range(len(current)-size): 
     #this will be incorrect in the case mentioned below 
     if i > len(before)-1 or before[i] != current[i]: 
      before_unique = True 
     if i > len(after)-1 or after[i] != current[i]: 
      after_unique = True 
     if before_unique and after_unique: 
      return i 

    return len(current)-size 

def ellipsize(entries, prefix_size, max_string_length): 
    non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this. 

    #If you want to preserve order then make a copy and make a mapping from the copy to the original 
    entries.sort() 

    ellipsized = [] 

    # you could probably remove all this indexing with something out of itertools 
    for i in range(len(entries)): 
     current = entries[i] 

     #entry is already short enough, don't need to truncate 
     if len(current) <= max_string_length: 
      ellipsized.append(current) 
      continue 

     #grab empty strings if there's no string before/after 
     if i == 0: 
      before = '' 
     else: 
      before = entries[i-1] 
     if i == len(entries)-1: 
      after = '' 
     else: 
      after = entries[i+1] 

     #Is the prefix unique? If so, we're done. 
     current_prefix = entries[i][:prefix_size]  
     if not before.startswith(current_prefix) and not after.startswith(current_prefix): 
      ellipsized.append(current[:max_string_length] + '...') #again, possibly -3 

     #Otherwise find the unique part after the prefix if it exists. 
     else: 
      index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size) 
      if index == prefix_size: 
       header = '' 
      else: 
       header = '...' 
      if index + non_prefix_size == len(current): 
       trailer = '' 
      else: 
       trailer = '...' 
      ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer) 
    return ellipsized

另外，你提到的字符串本身是唯一的，但它们是否都有独特的部分？例如，“Microsoft”和“Microsoft Internet Explorer 7”是两个不同的字符串，但第一个没有与第二个不同的部分。如果是这种情况，那么您必须在规范中添加一些内容，以便使该案例明确无误。（如果添加“Xicrosoft”，“MXcrosoft”，“MiXrosoft”等与这两个字符串混合在一起，则存在否唯一字符串比原始字符串短以代表“Microsoft”）（另一种思考方式它：如果你有所有可能的X字符串，你不能将它们全部压缩到X-1或更少的字符串。就像没有压缩方法可以压缩所有的输入一样，因为这本质上是一种压缩方法）。结果来自原帖：

>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20): 
    print entry 

Google Chrome 14 
Microso...et Explorer 6 
Microso...et Explorer 7 
Microso...et Explorer 8 
Mozilla Firefox 3 
Mozilla Firefox 4 
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40): 
    print entry 

Minutes of Comp...5/25/2010 -- Internal use... 
Minutes of Comp...6/24/2010 -- Internal use... 
Minutes of Comp...7/23/2010 -- Internal use...

来源

2011-02-14 22:01:35 user470379

省略一组名称

回答

相关问题