如何在文件中密集存储大量数字？

我需要存储和处理大量的非常长的数字，范围从0到f 64倍（ffffffffff ..... ffff）。如何在文件中密集存储大量数字？

如果我在文件中存储这些数据，我需要为每个字符（数字）用于\ N个码元=高达66个字节1个字节+ 2个字节。但是，为了表示所有可能的数字，我们不需要多于34个字节（4位表示从0到f的数字，因此4 [位] * 64 [十六进制数量]/8 [位a的字节] = 32字节+ \ n ，当然）。

有什么办法来存储数量，而不消耗过多的内存？

到目前为止，我已经创建了十六进制（每个符号16位数）的转换为基数为76（十六进制+所有字母和其他符号）的数字，这将数字的大小减少到41 + 2个字节。

来源

2017-08-07 Devilhunter

无关记：'\ n'只需要一个字节。在Windows上，行尾是'\ r \ n'，它需要两个字节。 – wecsam

你有没有考虑将文件存储为二进制文件？ –

相关备注：您可能想了解MIDI格式如何存储数字。基本上，有一个字节数组，但只使用每个字节的低七位。这些七位段被连接在一起形成大整数。除了最后一个字节以外的所有字节的最高位为0（或者也许是另一种方式......我不记得了）。全部0的所有连续字节从左侧省略。 – wecsam

您正在试图存储32个字节长。为什么不把它们存储为二进制数字？这样你只需要存储每个数字32个字节而不是41个或其他。您可以添加各种准压缩方案，以利用诸如大多数数字少于32个字节的情况。

如果您的号码是一个字符串，将其转换为int第一。 Python3 int s为基本无限的精度，这样你就不会丢失任何信息：

>>> num = '113AB87C877AAE3790' 
>>> num = int(num, 16) 
>>> num 
317825918024297625488

现在你可以将结果转换为字节数组，并将其写入一个文件打开二进制写作：

with open('output.bin', 'wb') as file: 
    file.write(num.to_bytes(32, byteorder='big'))

int方法to_bytes将您的号码转换为可放置在文件中的一串字节。您需要指定字符串长度和顺序。 'big'可以更容易地读取文件的十六进制转储。

读取文件回来了，它使用int.from_bytes以类似的方式进行解码：

with open('output.bin', 'rb') as file: 
    bytes = file.read(32) 
    num = int.from_bytes(bytes, byteorder='big')

记住，总是包含在文件模式b，或者如果您尝试读取你可能会遇到意想不到的问题或写入代码为\n的数据。

无论是读取和写入操作可以循环作为理所当然的事。

来源

2017-08-07 19:56:59

不知道更多的关于数字将是每数256个比特（32个字节）的最密集的可能的方式。

您可以将它们一个接一个地存储起来。

一个函数写一个文件可能是这样的：

def write_numbers(numbers, file): 
    for n in numbers: 
     file.write(n.to_bytes(32, 'big')) 

with open('file_name', 'wb') as f: 
    write_numbers(get_numbers(), f)

，并读取数据，你可以做这样的功能：

def read_numbers(file): 
    while True: 
     read = file.read(32) 
     if not read: 
      break 
     yield int.from_bytes(read, 'big') 

with open('file_name', 'rb') as f: 
    for n in read_numbers(f): 
     do_stuff(n)

来源

2017-08-07 19:46:27 Artyer

如果您预计存储的偶数分布，然后看看疯狂的物理学家的回答。但是，如果您预计主要存储小数字，但需要存储少量大数字，那么这些方案可能也很有用。

如果只需要以考虑是255首或更少个字节（2040位或更少位）长，然后简单地转换int为bytes对象并存储长度中的附加字节，像这样的整数：

# This was only tested with non-negative integers! 
def encode(num): 
    assert isinstance(num, int) 
    # Convert the number to a byte array and strip away leading null bytes. 
    # You can also use byteorder="little" and rstrip. 
    # If the integer does not fit into 255 bytes, an OverflowError will be raised. 
    encoded = num.to_bytes(255, byteorder="big").lstrip(b'\0') 
    # Return the length of the integer in the first byte, followed by the encoded integer. 
    return bytes([len(encoded)]) + encoded 
def encode_many(nums): 
    return b''.join(encode(num) for num in nums) 
def decode_many(byte_array): 
    assert isinstance(byte_array, bytes) 
    result = [] 
    start = 0 
    while start < len(byte_array): 
     # The first byte contains the length of the integer. 
     int_length = byte_array[start] 
     # Read int_length bytes and decode them as int. 
     new_int = int.from_bytes(byte_array[(start+1):(start+int_length+1)], byteorder="big") 
     # Add the new integer to the result list. 
     result.append(new_int) 
     start += int_length + 1 
    return result

要存储（实际上）无限长度的整数，您可以使用此方案，基于MIDI文件格式中的可变长度量。首先，规则：

一个字节有8位（对于那些谁不知道）。
在除最后一个以外的每个字节中，最左边的位（最高位位）将为1。

当连接在一起时，每个字节中的较低七位（即除最左边的位以外的所有位）形成具有可变数量位的整数。

这里有几个例子：

0二进制是00000000。它可以用一个字节表示而不用修改为00000000。

127二进制是01111111。它可以表示为一个字节而不需要修改为01111111。

128二进制是10000000。必须将其转换为两字节表示形式：10000001 00000000。让我们打破下来：

第一个字节的最左边位为1，这意味着它是不最后一个字节。

第二个字节中最左边的位是0，这意味着它是最后一个字节。

第一个字节的低七位是0000001，第二个字节的低七位是0000000。这些串联在一起，你会得到00000010000000，这是128

173249806138790二进制是100111011001000111011101001001101111110110100110。

存储它：

首先，分割二进制数到的七个比特组：0100111 0110010 0011101 1101001 0011011 1111011 0100110（领先0加入）

然后，在每个字节前面添加一个1除了最后，这得到了0：10100111 10110010 10011101 11101001 10011011 11111011 00100110

找回它：

首先，删除每个字节的第一位：0100111 0110010 0011101 1101001 0011011 1111011 0100110

您剩下一个由7位段组成的数组。将它们加在一起：100111011001000111011101001001101111110110100110

当它被转换成十进制时，你得到173,249,806,138,790。

为什么，你问，我们让每个号码0的最后一个字节最左边的位？那么，这样做可以让你连接多个数字而不用换行符。将数字写入文件时，只需将它们一个接一个地写出来。当从文件中读取数字时，使用一个循环构建一个整数数组，每当它检测到最左边的位为0的字节时结束每个整数。

这里有两个功能，encode和decode，这int和bytes之间的转换在Python 3

# Important! These methods only work with non-negative integers! def encode(num): assert isinstance(num, int) # If the number is 0, then just return a single null byte. if num <= 0: return b'\0' # Otherwise... result_bytes_reversed = [] while num > 0: # Find the right-most seven bits in the integer. current_seven_bit_segment = num & 0b1111111 # Change the left-most bit to a 1. current_seven_bit_segment |= 0b10000000 # Add that to the result array. result_bytes_reversed.append(current_seven_bit_segment) # Chop off the right-most seven bits. num = num >> 7 # Change the left-most bit in the lowest-order byte (which is first in the list) back to a 0. result_bytes_reversed[0] &= 0b1111111 # Un-reverse the order of the bytes and convert the list into a byte string. return bytes(reversed(result_bytes_reversed)) def decode(byte_array): assert isinstance(byte_array, bytes) result = 0 for part in byte_array: # Shift the result over by seven bits. result = result << 7 # Add in the right-most seven bits from this part. result |= (part & 0b1111111) return result

这里有两个函数与int小号列出工作：

def encode_many(nums): return [encode(num) for num in nums] def decode_many(byte_array): parts = [] # Split the byte array after each byte where the left-most bit is 0. start = 0 for i, b in enumerate(byte_array): # Check whether the left-most bit in this byte is 0. if not (b & 0b10000000): # Copy everything up to here into a new part. parts.append(byte_array[start:(i+1)]) start = i + 1 return [decode(part) for part in parts]

来源

2017-08-07 20:00:25 wecsam

问题有[python]标签。没有Python代码？ – gboffi

你最好只在一个单独的字节中编码长度。平均情况下，从每个字节取一位需要2个字节，而任何数目最多为32的位都可以很容易地放入5位。 –

@gboffi我在Python 3中编写了一些示例代码，并将其添加到我的答案中。 – wecsam

如何在文件中密集存储大量数字？

回答

相关问题