从python中的MS word文件中提取文本

18

您可以对antiword进行子流程调用。 Antiword是一个Linux命令行工具，用于从文档doc中转储文本。对于简单的文档工作得很好（显然它失去了格式化）。它可以通过apt，并可能作为RPM，或者你可以自己编译它。

来源

2008-09-24 04:13:03

+0

antiword可以将Word文档转换为DocBook XML，这将保留（至少一些）格式。 – 2015-09-30 11:38:45

+0

如果反义词不可用，`catdoc`也适用。 – Xiflado 2016-04-13 20:06:13

4

看看how the doc format works和create word document using PHP in linux。前者特别有用。 Abiword是我推荐的工具。有limitations虽然：

However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

来源

2008-09-24 03:17:26 Swati

+0

不仅如此，但！即使是以Word 97格式保存的最基本的文本，如果不依赖于单词来为您（COM）完成，几乎不可能轻松获取。大多数Word文档不是HTML！ – 2008-09-24 03:30:30

+0

Abiword并不认为它是一个HTML文档，并且考虑工具有多广泛......我认为实现它并不容易。 Abiword是帮助您阅读MS Word文件的工具...并且由于作者关心文本检索，这就足够了。 – Swati 2008-09-24 03:42:19

+0

啊，我一直认为abiword只是另一个文字处理器！男人，这会让我一阵头疼。 – 2008-09-24 12:11:05

3

我不知道，如果你要多少运气，而无需使用COM。 .doc格式非常复杂，通常在保存时称为Word的“内存转储”！

在Swati，这是在HTML中，这是罚款和丹迪，但大多数文件文件不是很好！

来源

2008-09-24 03:19:53

10

OpenOffice.org可以用Python编写脚本：see here。

因为OOo可以完美加载大多数MS Word文件，所以我会说这是你最好的选择。

来源

2008-09-24 03:23:42

+10

不完美。关闭，但我的经验（OO 2.0 - 3.0）远非完美。 – SpliFF 2009-05-26 15:17:30

+4

由于MS Word N + 1打开MS Word N文件并且比MS Word N + 1打开MS Words N-1文件更好，恕我直言 – voyager 2009-09-29 14:50:55

5

我知道这是一个老问题，但最近我试图找到一种方法，从MS Word文件中提取文本，并且是迄今为止最好的解决办法，我发现与wvLib：

http://wvware.sourceforge.net/

安装库后，在Python中使用它非常容易：

import commands 

exe = 'wvText ' + word_file + ' ' + output_txt_file 
out = commands.getoutput(exe) 
exe = 'cat ' + output_txt_file 
out = commands.getoutput(exe)

而就是这样。非常多，我们正在使用commands.getouput函数来运行一些shell脚本，即wvText（从Word文档中提取文本，cat用于读取文件输出）。之后，Word文档中的全部文本将放在out变量中，随时可以使用。

希望这将有助于任何人在未来有类似的问题。

来源

2009-01-01 01:14:38 David

4

（注：我张贴这对this question为好，但在这里似乎相关，所以请原谅转贴）

现在，这是非常丑陋的，漂亮哈克，但它似乎为我工作的基本文字提取。显然，使用这个在Qt的程序你必须产卵它等的过程，但在命令行中，我砍死在一起是：

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

这就是：

解压文件-p 。DOCX：-p == “解压到stdout”

grep '可以<瓦特：T'：含有抓斗只是线 '<瓦特：T'（<瓦特：吨>是Word 2007的XML元素用于“文本”，据我可以告诉）

的sed的/ < [^ <]> // G'*：删除所有标签内

-v的grep“^ [[：空间：]] $'*：删除空白行

有可能是一个更有效的方法来做到这一点，但它似乎对我已经测试过的少数文档工作。

据我所知，unzip，grep和sed都有Windows和任何Unix的端口，所以它应该是合理的跨平台的。 Despit是一个丑陋的黑客一点;）

来源

2009-08-11 05:38:51

4

如果您的意图是使用纯粹的python模块而不调用子进程，您可以使用zipfile python模块。然而

content = "" 
# Load DocX into zipfile 
docx = zipfile.ZipFile('/home/whateverdocument.docx') 
# Unpack zipfile 
unpacked = docx.infolist() 
# Find the /word/document.xml file in the package and assign it to variable 
for item in unpacked: 
    if item.orig_filename == 'word/document.xml': 
     content = docx.read(item.orig_filename) 

    else: 
     pass

您的内容字符串需要清理，这样做的一个方法是：

# Clean the content string from xml tags for better search 
fullyclean = [] 
halfclean = content.split('<') 
for item in halfclean: 
    if '>' in item: 
     bad_good = item.split('>') 
     if bad_good[-1] != '': 
      fullyclean.append(bad_good[-1]) 
     else: 
      pass 
    else: 
     pass 

# Assemble a new string with all pure content 
content = " ".join(fullyclean)

但肯定是清理串一个更优雅的方式，可能使用re模块。希望这有助于。

来源

2009-11-12 16:18:16 benjamin

11

本杰明的答案是一个很好的答案。我刚才巩固...

import zipfile, re 

docx = zipfile.ZipFile('/path/to/file/mydocument.docx') 
content = docx.read('word/document.xml') 
cleaned = re.sub('<(.|\n)*?>','',content) 
print cleaned

来源

2009-12-28 03:39:54 Chad

29

使用机Python模块的docx。以下是如何从一个文档中提取所有文本：

document = docx.Document(filename) 
docText = '\n\n'.join([ 
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs 
]) 
print docText

见Python DocX site

还检查了Textract其拉出表等

解析XML与regexs调用cthulu。不要这样做！

来源

2009-12-30 12:17:09 mikemaccana

3

Unoconv也可能是一个不错的选择：http://linux.die.net/man/1/unoconv

来源

2012-05-16 11:35:57 fccoelho

1

只是一个选项，阅读“文档”文件，而无需使用COM：miette。应该在任何平台上工作。

来源

2013-02-12 09:25:30 alecxe

1

如果你已经安装了LibreOffice，you can simply call it from the command line to convert the file to text，然后将文本加载到Python中。

来源

2015-05-08 11:31:23 markling

1

这是一个老问题吗？我相信这样的事情不存在。只有回答和未回答的问题。这一个是相当没有答案，或者如果你愿意，一半答案。那么，不使用COM互操作来阅读* .docx（MS Word 2007及更高版本）文档的方法都会被覆盖。但是，仅使用Python从* .doc（MS Word 97-2000）中提取文本的方法缺少。这是复杂的吗？这样做：不是真的，要明白：那是另一回事。

当我没有找到任何完成的代码时，我阅读了一些格式规范，并挖掘出了一些其他语言提出的算法。

MS Word（* .doc）文件是一个OLE2复合文件。不要打扰你很多不必要的细节，把它想象成存储在文件中的文件系统。它实际上使用FAT结构，所以定义成立。（嗯，也许你可以在Linux上循环挂载它）这样，你可以在一个文件中存储更多的文件，比如图片等。在* .docx中通过使用ZIP存档来完成相同的操作。 PyPI上有可用于读取OLE文件的软件包。像（olefile，compoundfiles，...）我使用了compoundfiles包来打开* .doc文件。但是，在MS Word 97-2000中，内部子文件不是XML或HTML，而是二进制文件。由于这还不够，每个都包含关于其他人的信息，所以您必须至少阅读其中的两个，并相应地解开存储的信息。要充分理解，请阅读我从中采用该算法的PDF文档。

下面的代码是很匆忙编写和测试少量的文件。据我所知，它按预期工作。有时会出现一些乱码，并且几乎总是出现在文本的末尾。中间也可能有一些奇怪的字符。

那些只希望搜索文字的人会很高兴。不过，我敦促任何能够帮助改进此代码的人这样做。


doc2text module: 
""" 
This is Python implementation of C# algorithm proposed in: 
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf 

Python implementation author is Dalen Bernaca. 
Code needs refining and probably bug fixing! 
As I am not a C# expert I would like some code rechecks by one. 
Parts of which I am uncertain are: 
    * Did the author of original algorithm used uint32 and int32 when unpacking correctly? 
     I copied each occurence as in original algo. 
    * Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not? 
    * Did I interpret each C# command correctly? 
     I think I did! 
""" 

from compoundfiles import CompoundFileReader, CompoundFileError 
from struct import unpack 

__all__ = ["doc2text"] 

def doc2text (path): 
    text = u"" 
    cr = CompoundFileReader(path) 
    # Load WordDocument stream: 
    try: 
     f = cr.open("WordDocument") 
     doc = f.read() 
     f.close() 
    except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all." 
    # Extract file information block and piece table stream informations from it: 
    fib = doc[:1472] 
    fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0] 
    lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0] 
    tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l 
    tableName = ("0Table", "1Table")[tableFlag] 
    # Load piece table stream: 
    try: 
     f = cr.open(tableName) 
     table = f.read() 
     f.close() 
    except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName 
    cr.close() 
    # Find piece table inside a table stream: 
    clx = table[fcClx:fcClx+lcbClx] 
    pos = 0 
    pieceTable = "" 
    lcbPieceTable = 0 
    while True: 
     if clx[pos]=="\x02": 
      # This is piece table, we store it: 
      lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0] 
      pieceTable = clx[pos+5:pos+5+lcbPieceTable] 
      break 
     elif clx[pos]=="\x01": 
      # This is beggining of some other substructure, we skip it: 
      pos = pos+1+1+ord(clx[pos+1]) 
     else: break 
    if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table." 
    # Read info from pieceTable, about each piece and extract it from WordDocument stream: 
    pieceCount = (lcbPieceTable-4)/12 
    for x in xrange(pieceCount): 
     cpStart = unpack("l", pieceTable[x*4:x*4+4])[0] 
     cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0] 
     ofsetDescriptor = ((pieceCount+1)*4)+(x*8) 
     pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8] 
     fcValue = unpack("L", pieceDescriptor[2:6])[0] 
     isANSII = (fcValue & 0x40000000) == 0x40000000 
     fc  = fcValue & 0xbfffffff 
     cb = cpEnd-cpStart 
     enc = ("utf-16", "cp1252")[isANSII] 
     cb = (cb*2, cb)[isANSII] 
     text += doc[fc:fc+cb].decode(enc, "ignore") 
    return "\n".join(text.splitlines())

来源

2015-06-01 20:20:14 Dalen

2

阅读Word 2007和更高版本的文件，包括.DOCX文件，你可以使用python-docx包：

from docx import Document 
document = Document('existing-document-file.docx') 
document.save('new-file-name.docx')

从Word 2003中读取.doc文件及更早版本，让子进程调用antiword。首先，您需要安装antiword：

sudo apt-get install antiword

然后只是把它从你的Python脚本：

import os 
input_word_file = "input_file.doc" 
output_text_file = "output_file.txt" 
os.system('antiword %s > %s' % (input_word_file, output_text_file))

来源

2016-08-03 18:17:37

从python中的MS word文件中提取文本

回答

相关问题