文件中的unicode字符串包含不同的

我的系统是fedora。由于某种原因，一条记录的最后一个字段是一个Unicode字符串（使用来自qemu中来宾机器的memcpy副本数据）。 unicode字符串是Windows注册表项名称。文件中的unicode字符串包含不同的

smss.exe|NtOpenKey|304|4|4|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@C^@o^@n^@t^@r^@o^@l^@\^@S^@e^@s^@s^@i^@o^@n^@ ^@M^@a^@n^@a^@g^@e^@r^@ smss.exe|NtClose|304|4|4|0|System|NtOpenKey|4|0|2147484532|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@ services.exe|NtOpenKey|680|624|636|0|\^@R^@E^@G^@I^@S^@T^@R^@Y^@\^@M^@A^@C^@H^@I^@N^@E^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@S^@e^@r^@v^@i^@c^@e^@s^@

下面是一些十六进制代码：使用 '|'作为分割字符。前6个字段是ascii sting。最后一个字段是一个窗口unicode字符串（我认为它是utf-16代码）。

0000000 6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
0000010 7965 337c 3430 347c 347c 307c 5c7c 5200
0000020 6500 6700 6900 7300 7400 7200 7900 5c00
0000030 4d00 6100 6300 6800 6900 6e00 6500 5c00
0000040 5300 7900 7300 7400 6500 6d00 5c00 4300
0000050 7500 7200 7200 6500 6e00 7400 4300 6f00
0000060 6e00 7400 7200 6f00 6c00 5300 6500 7400
0000070 5c00 4300 6f00 6e00 7400 7200 6f00 6c00
0000080 5c00 5300 6500 7300 7300 6900 6f00 6e00
0000090 2000 4d00 6100 6e00 6100 6700 6500 7200

我将使用python解析它并将其插入一个数据库。下面是我如何处理

def parsecreate(filename): 
    sourcefile = codecs.open("data.db",mode="r",encoding='utf-8') 
    cx = sqlite3.connect("sqlite.db") 
    cu = cx.cursor() 
    cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)") 
    eachline = [] 
    for lines in sourcefile: 
     eachline = lines.split('|') 
     eachline[-1] = eachline[-1].strip('\n') 
     eachline[-1] = eachline[-1].decode('utf-8') 

     cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1])) 

    cx.commit() 
    cx.close()

我会得到错误：

File "./parse1.py", line 18, in parsecreate for lines in sourcefile: File "/usr/lib/python2.7/codecs.py", line 684, in next return self.reader.next() File "/usr/lib/python2.7/codecs.py", line 615, in next line = self.readline() File "/usr/lib/python2.7/codecs.py", line 530, in readline data = self.read(readsize, firstline=True) File "/usr/lib/python2.7/codecs.py", line 477, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 51: invalid continuation byte

监守Unicode字符串可以包含一个字节的UTF8不知道这一点。如何正确读取最后一个字段？

简单地说。 UTF-16编码文件中有一个unicode字符串，如何使字段正确地插入到db中？ Python使用一种编码风格读取文件。我能读取原始字节吗？可以将这些字节组合成一个unicode字符串。

来源

2012-04-14 jiamo

您的数据文件不是纯文本文件，因此请将文件打开为二进制文件并明确解码文本字段。我不得不操纵这些数据来获取我认为是原始二进制数据的内容。它看起来像原始数据可能是一个sqlite3.exe转储类似于我下面的最终输出，除了最终字段的数据存储为UTF-16编码的BLOB而不是TEXT。

请注意，按行分割并按'|'分割如果UTF-16数据包含表示'\ n'或'|'的字节，可能会遇到问题，但现在我将忽略它。

这里是我的测试：

from binascii import unhexlify 
import sqlite3 

data = unhexlify('''\ 
6d73 7373 652e 6578 4e7c 4f74 6570 4b6e 
7965 337c 3430 347c 347c 307c 5c7c 5200 
6500 6700 6900 7300 7400 7200 7900 5c00 
4d00 6100 6300 6800 6900 6e00 6500 5c00 
5300 7900 7300 7400 6500 6d00 5c00 4300 
7500 7200 7200 6500 6e00 7400 4300 6f00 
6e00 7400 7200 6f00 6c00 5300 6500 7400 
5c00 4300 6f00 6e00 7400 7200 6f00 6c00 
5c00 5300 6500 7300 7300 6900 6f00 6e00 
2000 4d00 6100 6e00 6100 6700 6500 7200'''.replace(' ','').replace('\n','')) 

# OP's data dump must have been decoded from the original data 
# as little-endian words, and is missing a final 0x00 byte. 
# Byte-swapping and adding missing zero byte to get back what 
# was likely the original binary data. 
data = ''.join(a+b for a,b in zip(data[1::2],data[::2])) + '\x00' 

with open('data.db','wb') as f: 
    f.write(data) 

def parsecreate(filename): 
    with open(filename,'rb') as sourcefile: 
     with sqlite3.connect("sqlite.db") as cx: 
      cu = cx.cursor() 
      cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)") 
      eachline = [] 
      for line in sourcefile: 
       eachline = line.split('|') 
       eachline[-1] = eachline[-1].decode('utf-16le') 
       cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1])) 

parsecreate('data.db')

输出：

C:\>sqlite3 sqlite.db 
SQLite version 3.7.9 2011-11-01 00:52:41 
Enter ".help" for instructions 
Enter SQL statements terminated with a ";" 
sqlite> select * from data; 
1|smss.exe|NtOpenKey|304|4|4|0|\Registry\Machine\System\CurrentControlSet\Control\Session Manager

来源

2012-04-14 13:50:13

非常感谢您玉米粥。当我刚回家时，我会明天测试。我可以找到两个区别。 1.使用'b'模式读取2因为最后一个文件已经准备好了，所以只需将其解码为unicode字符串即可。顺便说一句，你写的文件data.db是“mssse.exN | OtepKnye3 | 404 | 4 | 0 | \ |一个字符串”。我认为问题是由于我只复制了文件的一小部分而引起的。它应该像“**。exe | NtOpenKey | **”。 – jiamo 2012-04-14 15:46:44

是的，最好是有原始的原始数据，或者至少倾倒为字节，而不是我怀疑是小端字的东西。我更新了我的答案以解密您的数据。 – 2012-04-14 17:16:51

我有一个问题来处理'\ n'，如果我不写一个'\ n'到一个条目recode的末尾。如何从文件中读取一行。如果我写'\ n'，那么：1不要使用'eachline [-1] = eachline [-1] .strip（'\ n'）'导致'UnicodeDecodeError：'utf16'编解码器无法解码字节在位置132的0x0a：截断的数据'2使用'eachline [-1] = eachline [-1] .strip（'\ n'）'我想知道天气有可能会删除unicode字符串中的一个字节。 – jiamo 2012-04-15 06:11:57

文件中的unicode字符串包含不同的

回答

相关问题