2012-04-06 81 views
3

作为项目的一部分,我和其他一些人正在处理URL分类。我们试图实现的其实很简单:只需查看URL并在其中找到相关关键字并相应地对页面进行分类。URL分类的模式匹配

例如:如果URL是:http://cnnworld/sports/abcd,我们将其划分类别下的“体育”

要做到这一点,我们有格式的映射数据库:关键词 - >类别现在

我们目前正在做的是,对于每个URL,我们都会继续读取数据库中的所有数据项,并使用String.find()方法来查看关键字是否出现在URL中。一旦找到了,我们就停下来。

但这种方法有一些问题,主要是:

(一)我们的数据库非常大,这样的重复查询的运行极为缓慢

(II)页面可以属于不止一类和我们的方法不处理这种情况。当然,确保这一点的一个简单方法是在发现类别匹配后继续查询数据库,但这只会让事情变得更慢。

我在考虑替代品,并想知道是否可以完成反向工作 - 解析url,找到其中的单词,然后仅查询这些单词的数据库。

一个朴素的算法可以在O(n^2)中运行 - 查询数据库中的所有发生在url内的子字符串。

我想知道是否有更好的方法来实现这一点。有任何想法吗 ??预先感谢您:)

回答

0

如果您的类别少于(多)个关键字,则可以为每个类别创建一个正则表达式,它将匹配该类别的任何关键字。然后你会针对每个类别的正则表达式运行你的URL。这也将解决匹配多个类别的问题。

0

我认为你的建议是拆分URL以找到有用的位然后查询这些项目听起来像是一个体面的方式。

我一起扔了一些Java,可能有助于说明我认为这将需要的代码方式。最有价值的部分是可能的正则表达式,但我希望它的通用算法可以帮助一些还有:

import java.io.UnsupportedEncodingException; 
import java.net.URLDecoder; 
import java.util.List; 

public class CategoryParser 
{ 
    /** The db field that keywords should be checked against */ 
    private static final String DB_KEYWORD_FIELD_NAME = "keyword"; 

    /** The db field that categories should be pulled from */ 
    private static final String DB_CATEGORY_FIELD_NAME = "category"; 

    /** The name of the table to query */ 
    private static final String DB_TABLE_NAME = "KeywordCategoryMap"; 

    /** 
    * This method takes a URL and from that text alone determines what categories that URL belongs in. 
    * @param url - String URL to categorize 
    * @return categories - A List<String&rt; of categories the URL seemingly belongs in 
    */ 
    public static List<String> getCategoriesFromUrl(String url) { 

     // Clean the URL to remove useless bits and encoding artifacts 
     String normalizedUrl = normalizeURL(url); 

     // Break the url apart and get the good stuff 
     String[] keywords = tokenizeURL(normalizedUrl); 

     // Construct the query we can query the database with 
     String query = constructKeywordCategoryQuery(keywords); 

     System.out.println("Generated Query: " + query); 

     // At this point, you'd need to fire this query off to your database, 
     // and the results you'd get back should each be a valid category 
     // for your URL. This code is not provided because it's very implementation specific, 
     // and you already know how to deal with databases. 


     // Returning null to make this compile, even though you'd obviously want to return the 
     // actual List of Strings 
     return null; 
    } 

    /** 
    * Removes the protocol, if it exists, from the front and 
    * removes any random encoding characters 
    * Extend this to do other url cleaning/pre-processing 
    * @param url - The String URL to normalize 
    * @return normalizedUrl - The String URL that has no junk or surprises 
    */ 
    private static String normalizeURL(String url) 
    { 
     // Decode URL to remove any %20 type stuff 
     String normalizedUrl = url; 
     try { 
      // I've used a URLDecoder that's part of Java here, 
      // but this functionality exists in most modern languages 
      // and is universally called url decoding 
      normalizedUrl = URLDecoder.decode(url, "UTF-8"); 
     } 
     catch(UnsupportedEncodingException uee) 
     { 
      System.err.println("Unable to Decode URL. Decoding skipped."); 
      uee.printStackTrace(); 
     } 

     // Remove the protocol, http:// ftp:// or similar from the front 
     if (normalizedUrl.contains("://")) 
     { 
      normalizedUrl = normalizedUrl.split(":\\/\\/")[1]; 
     } 

     // Room here to do more pre-processing 

     return normalizedUrl; 
    } 

    /** 
    * Takes apart the url into the pieces that make at least some sense 
    * This doesn't guarantee that each token is a potentially valid keyword, however 
    * because that would require actually iterating over them again, which might be 
    * seen as a waste. 
    * @param url - Url to be tokenized 
    * @return tokens - A String array of all the tokens 
    */ 
    private static String[] tokenizeURL(String url) 
    { 
     // I assume that we're going to use the whole URL to find tokens in 
     // If you want to just look in the GET parameters, or you want to ignore the domain 
     // or you want to use the domain as a token itself, that would have to be 
     // processed above the next line, and only the remaining parts split 
     String[] tokens = url.split("\\b|_"); 

     // One could alternatively use a more complex regex to remove more invalid matches 
     // but this is subject to your (?:in)?ability to actually write the regex you want 

     // These next two get rid of tokens that are too short, also. 

     // Destroys anything that's not alphanumeric and things that are 
     // alphanumeric but only 1 character long 
     //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+"); 

     // Destroys anything that's not alphanumeric and things that are 
     // alphanumeric but only 1 or 2 characters long 
     //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+"); 

     return tokens; 
    } 

    private static String constructKeywordCategoryQuery(String[] keywords) 
    { 
     // This will hold our WHERE body, keyword OR keyword2 OR keyword3 
     StringBuilder whereItems = new StringBuilder(); 

     // Potential query, if we find anything valid 
     String query = null; 

     // Iterate over every found token 
     for (String keyword : keywords) 
     { 
      // Reject invalid keywords 
      if (isKeywordValid(keyword)) 
      { 
       // If we need an OR 
       if (whereItems.length() > 0) 
       { 
        whereItems.append(" OR "); 
       } 

       // Simply append this item to the query 
       // Yields something like "keyword='thisKeyword'" 
       whereItems.append(DB_KEYWORD_FIELD_NAME); 
       whereItems.append("='"); 
       whereItems.append(keyword); 
       whereItems.append("'"); 
      } 
     } 

     // If a valid keyword actually made it into the query 
     if (whereItems.length() > 0) 
     { 
      query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME 
        + " WHERE " + whereItems.toString() + ";"; 
     } 

     return query; 
    } 

    private static boolean isKeywordValid(String keyword) 
    { 
     // Keywords better be at least 2 characters long 
     return keyword.length() > 1 
       // And they better be only composed of letters and numbers 
       && keyword.matches("\\w+") 
       // And they better not be *just* numbers 
       // && !keyword.matches("\\d+") // If you want this 
       ; 
    } 

    // How this would be used 
    public static void main(String[] args) 
    { 
     List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification"); 
     List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965"); 
    } 
} 

的SO链接生成的查询将如下所示:房间

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification' 

充足为优化,但我想它比检查每个可能的关键字字符串要快得多。

0

Aho-corasick算法最适合用一次遍历搜索中间字符串。您可以形成关键字的树(aho-corasick树)。在最后一个节点包含一个映射到特定关键字的数字。

现在,您只需遍历树上的URL字符串。当你得到一些数字(在我们的场景中作为标志工作),这意味着我们得到了一些映射类别。在哈希映射上继续使用该数字,并找到相应的类别以供进一步使用。

我想这会帮助你。 转到此链接:good animation of aho-corasick by ivan

2

在我们的商业分类,我们有4M的关键字数据库:)我们也搜索HTML的身体,有办法解决这个号码:

  1. 使用Aho-Corasick,我们使用了一种专门用于处理网页内容的修改算法,例如,将tab:space,\ r,\ n视为空格,因为只有一个,因此两个空格将被视为一个空格,并且也忽略下/大写。
  2. 另一种选择是将所有关键字放在树中(例如std :: map),这样搜索变得非常快,缺点是这需要内存,而且很多,但如果它在服务器上,那么你不会感觉不到。
+0

请不要粘贴签名或添加任何帖子或问题。如果你愿意,你可以在你的个人资料上使用它,但它会在SO线程上产生噪音。 – ForceMagic 2012-10-14 01:00:57