如何使用的Xapian索引网页

我使用的Ubuntu 12.04，当它返回一个URL，Python 2.7版如何使用的Xapian索引网页

我从给定的URL获取内容代码：

def get_page(url): 
'''Gets the contents of a page from a given URL''' 
    try: 
     f = urllib.urlopen(url) 
     page = f.read() 
     f.close() 
     return page 
    except: 
     return "" 
    return ""

要过滤的内容通过get_page(url)提供的页面：

def filterContents(content): 
'''Filters the content from a page''' 
    filteredContent = '' 
    regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]') 
    for words in regex.findall(content): 
     word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""") 
     for word in word_list: 
      filteredContent = filteredContent + word 
    return filteredContent 

def split_string(source, splitlist): 
    return ''.join([ w if w not in splitlist else ' ' for w in source])

如何索引Xapian的filteredContent这样，当我询问，我得到的返回URLs查询出现在？

来源

2013-04-20 VeilEclipse

我不完全清楚你的filterContents()和split_string()实际上是在做什么（扔掉一些HTML标签内容，然后分开文字），所以让我来谈谈一个类似的问题，它没有将复杂性折叠到它。

我们假设我们有一个函数strip_tags()，它返回HTML文档的文本内容，以及您的get_page()函数。我们想建立地方

每个文件指的是资源表示来自特定URL拉
在表示（已经通过strip_tags()通过）的“话”成为搜索项的Xapian的数据库索引这些文件
每个文档都包含其所有从中拉出的网址，作为其document data。

所以，你可以指标如下：

import xapian 
def index_url(database, url): 
    text = strip_tags(get_page(url)) 
    doc = xapian.Document() 

    # TermGenerator will split text into words 
    # and then (because we set a stemmer) stem them 
    # into terms and add them to the document 
    termgenerator = xapian.TermGenerator() 
    termgenerator.set_stemmer(xapian.Stem("en")) 
    termgenerator.set_document(doc) 
    termgenerator.index_text(text) 

    # We want to be able to get at the URL easily 
    doc.set_data(url) 
    # And we want to ensure each URL only ends up in 
    # the database once. Note that if your URLs are long 
    # then this won't work; consult the FAQ on unique IDs 
    # for more: http://trac.xapian.org/wiki/FAQ/UniqueIds 
    idterm = 'Q' + url 
    doc.add_boolean_term(idterm) 
    db.replace_document(idterm, doc) 

# then index an example URL 
db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN) 

index_url(db, "https://stackoverflow.com/")

搜索是那么简单的，但如果需要，它可以明显地变得更加复杂：

qp = xapian.QueryParser() 
qp.set_stemmer(xapian.Stem("en")) 
qp.set_stemming_strategy(qp.STEM_SOME) 
query = qp.parse_query('question') 
query = qp.parse_query('question and answer') 
enquire = xapian.Enquire(db) 
enquire.set_query(query) 
for match in enquire.get_mset(0, 10): 
    print match.document.get_data()

这将显示 'https://stackoverflow.com/'，因为当您没有登录时，“主题和答案”在主页上。

我建议您查看Xapian getting started guide这两个概念和代码。

来源

2013-04-22 13:28:21

谢谢你的时间和帮助。如何显示页面内容和URL？ – VeilEclipse 2013-04-24 09:32:58

掌握Xapian的概念。例如，您可以在文档数据中放入任何您想要的东西;正确的处理方式取决于你的情况和你在做什么，所以我不能给出具体的建议。 – 2013-04-25 14:35:44

如何使用的Xapian索引网页

回答

相关问题