2015-09-06 82 views
0

比方说在引用的一个人,我有:查找的文档

  • 一个数据库13000人项,包括first name, name, birthday, street, zip code, city

  • 一个长文本其中包括一个特定人的个人资料。因为它是由OCR processesed它可能包含spelling errors

在这里你可以阅读这些文字:

Harry Potter, born 25.03.1995, resident at Jahnstreet 43, London is a series of seven fantasy novels written by British author J. K. Rowling. The series chronicles the adventures of a young wizard, Harry Potter, the titular character, and his friends Ronald Weasley and Hermione Granger, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's quest to defeat the Dark wizard Lord Voldemort, who aims to become immortal, conquer the wizarding world, subjugate non-magical people, and destroy all those who stand in his way, especially Harry Potter. Since the release of the first novel, Harry Potter and the Philosopher's Stone, on 30 June 1997, the books have gained immense popularity, critical acclaim and commercial success worldwide.[2] The series has also had some share of criticism, including concern about the increasingly dark tone as the series progressed. As of May 2015, the books have sold more than 450 million copies worldwide, making the series the best-selling book series in history, and have been translated into 73 languages.[3][4] The last four books consecutively set records as the fastest-selling books in history, with the final installment selling roughly 11 million copies in the United States within the first 24 hours of its release. A series of many genres, including fantasy, coming of age and the British school story (with elements of mystery, thriller, adventureand romance), it has many cultural meanings and references.[5] According to Rowling, the main theme is death.[6] There are also many other themes in the series, such as prejudice and corruption.[7] 


现在我想找到被引用在数据库中的人该文件


我hav关于如何做到这一点的不同想法。但我不知道哪一个带来最好的结果? 你更喜欢哪种方式?推荐?感谢

  1. 我分裂阵列中的文本,并在数据库中经历各birthday,并与JavaScript的text.search('25.03.1995')寻找它时,有一击,我经过的下一个领域如。 text.searc('Harry')。如果有几个点击,我找到了正确的记录。

    • 利弊:易于实施,无需数据库命令,纯JavaScript
    • 利弊:如果OCR犯了一个错误,并读取如。 Harly而不是Harry我无法识别它。如果日期格式不同,则会发生相同的情况
  2. 首先,我通过数据库的帮助来索引文本。接下来我采用类似于第一个例子的方法。而经过数据库中的每个列,但现在数据库CONTAINS

    • 优点:更快,更好的结果?
    • 缺点:我需要一个良好的全文本搜索数据库
  3. 我分裂了文本,并在数据库列与SQL搜索每个单一的世界 - LIKE

    • 利弊:我不必索引文件,比包含更好?
    • 缺点:没有那么快,作为文本索引?

感谢您的帮助在这件事

+0

也许某种模糊搜索可以帮助您克服OCR错误。试试这个例子 - http://glench.github.io/fuzzyset.js/ –

回答

1

我想是因为你将不得不有时排序通过多个可能的匹配和13000项并不需要大量的内存OCR错误。所以使用第一种方法可能会更容易,并完全在JS中完成。无论哪种方式,你必须尝试解析CSV。

这取决于我认为OCR有多糟糕。如果不好,全文索引可能会有所帮助。

您也可以尝试在npm中使用类似natural模块的字符串距离。

+0

感谢您的帮助! –

+0

好的。我添加了另一个想法,我刚刚。 –