2010-10-19 182 views
0

我的数据库包含以文本字段形式存储的URL,每个URL都包含报告日期的表示,报告本身缺失。如何从URL格式解析日期?

所以我需要从URL字段中的日期解析为一个字符串表示,例如:

2010-10-12 
2007-01-03 
2008-02-07 

什么是提取日期的最好方法?

有些是这种格式:

http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html 

http://e.com/data/invoices/2010/09/invoices-report-thursday-september-2-2010.html 

http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html 

http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html 

http://e.com/data/invoices/2010/08/invoices-report-monday-august-30th-2010.html 

http://e.com/data/invoices/2009/05/invoices-report-friday-may-8th-2009.html 

http://e.com/data/invoices/2010/10/invoices-report-wednesday-october-6th-2010.html 

http://e.com/data/invoices/2010/09/invoices-report-tuesday-september-21-2010.html 

的使用注意事项不一致的th月的情况下,如这两个翌日:

http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html 

http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html 

其他人则在这格式(在日期开始之前带有三个连字符,末尾没有一年,并且在report之前可选使用invoices-):

http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-1.html 

http://e.com/data/invoices/2010/09/invoices-report---thursday-september-2.html 

http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-15.html 

http://e.com/data/invoices/2010/09/invoices-report---monday-september-13.html 

http://e.com/data/invoices/2010/08/report---monday-august-30.html 

http://e.com/data/invoices/2009/05/report---friday-may-8.html 

http://e.com/data/invoices/2010/10/report---wednesday-october-6.html 

http://e.com/data/invoices/2010/09/report---tuesday-september-21.html 

回答

5

你想这样的正则表达式:

"^http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})" 

这利用一切了通过URL的/年/月/部分始终是相同的,而且没有数量如下直到一天月份。你有这些之后,你不会在意别的什么。

第一个捕获组是当年,第二个月和第三个。这一天可能没有领先的零点;根据需要将字符串转换为整数和格式,或者只是获取字符串长度,如果不是两个,则将其连接到字符串“0”。

举个例子:

import java.util.regex.*; 

class URLDate { 
    public static void 
    main(String[] args) { 
    String text = "http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html"; 
    String regex = "http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})"; 
    Pattern p = Pattern.compile(regex); 
    Matcher m = p.matcher(text); 
    if (m.find()) { 
     int count = m.groupCount(); 
     System.out.format("matched with groups:\n", count); 
     for (int i = 0; i <= count; ++i) { 
      String group = m.group(i); 
      System.out.format("\t%d: %s\n", i, group); 
     } 
    } else { 
     System.out.println("failed to match!"); 
    } 
    } 
} 

给出了输出:

matched with groups: 
    0: http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html 
    1: 2010 
    2: 09 
    3: 1 

(注意,要使用Matcher.matches()而不是Matcher.find(),你将不得不通过追加,使图案吃整个输入字符串.*$)。

+0

完美。感谢关于'matches()'和'find()'的警告。 – snoopy 2010-10-19 19:00:28