阅读大CSV文件和饲料到TensorFlow

所以我想读我的CSV文件到python，然后将数据分成训练和测试数据（n-fold交叉验证），然后喂它到我已经制作深度学习架构。然而，阅读如何在CSV文件，其中显示在这里阅读TensorFlow教程后：阅读大CSV文件和饲料到TensorFlow

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"]) 

reader = tf.TextLineReader() 
key, value = reader.read(filename_queue) 

# Default values, in case of empty columns. Also specifies the type of the 
# decoded result. 
record_defaults = [[1], [1], [1], [1], [1]] 
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults) 
features = tf.pack([col1, col2, col3, col4]) 

with tf.Session() as sess: 
    # Start populating the filename queue. 
    coord = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(coord=coord) 

    for i in range(1200): 
    # Retrieve a single instance: 
    example, label = sess.run([features, col5]) 

    coord.request_stop() 
    coord.join(threads)

一切才有意义在此代码，除了在与for循环结束的部分。

问题1：1200 for循环的意义是什么？数据中的记录数是多少？

有关代码拌和例子如下教程会谈的下一个部分：

def read_my_file_format(filename_queue): 
    reader = tf.SomeReader() 
    key, record_string = reader.read(filename_queue) 
    example, label = tf.some_decoder(record_string) 
    processed_example = some_processing(example) 
    return processed_example, label 

def input_pipeline(filenames, batch_size, num_epochs=None): 
    filename_queue = tf.train.string_input_producer(
     filenames, num_epochs=num_epochs, shuffle=True) 
    example, label = read_my_file_format(filename_queue) 
    # min_after_dequeue defines how big a buffer we will randomly   sample 
    # from -- bigger means better shuffling but slower start up and  more 
    # memory used. 
    # capacity must be larger than min_after_dequeue and the amount larger 
    # determines the maximum we will prefetch. Recommendation: 
    # min_after_dequeue + (num_threads + a small safety margin) *  batch_size 
    min_after_dequeue = 10000 
    capacity = min_after_dequeue + 3 * batch_size 
    example_batch, label_batch = tf.train.shuffle_batch(
     [example, label], batch_size=batch_size, capacity=capacity, 
     min_after_dequeue=min_after_dequeue) 
    return example_batch, label_batch

我明白，这是异步代码块，直到它接收到的一切。在代码运行后查看示例和标签的值时，我发现每个数据只保存数据中特定记录的信息。

问题2：“read_my_file”下的代码是否应该与我发布的第一个代码块相同？然后是input_pipeline函数将单个记录一起批量处理到某个batch_size中？如果read_my_file函数与第一个代码块相同，为什么不存在相同的循环（这可以回到我的第一个问题）

我很感激任何澄清，因为这是我第一次使用TensorFlow 。谢谢您的帮助！

来源

2016-06-10 Chandra_Rathnam

（1）1200是任意的 - 我们应该修正这个例子，以便在那里使用一个命名常量来使其更清晰。感谢您的发现。 :)随着the CSV reading example的设置方式，继续读取将通过两个CSV文件多次读取（string_input_producer持有的文件名没有提供num_epochs参数，所以它默认为永久循环）。所以1200就是程序员在示例中选择检索的记录数。

如果您只想读取文件中的示例数量，则可以捕获OutOfRangeError，如果输入器用完输入，或者读取的记录数完全相同，则会引发OutOfRangeError。有一个新的阅读操作正在进行中，这也有助于简化操作，但我认为它不包含在0.9中。（2）它应该建立一个非常相似的操作集，但实际上并不是阅读。请记住，你用Python编写的大部分内容都是构建一个图形，这是TensorFlow将执行的一系列操作。因此，read_my_file中的内容几乎是tf.Session()创建之前的内容。在上面的例子中，for循环中的代码实际上正在执行tf图来将示例提取回python。但在示例的第二部分，您只需设置管道将项目读入Tensors，然后添加额外的消耗这些张量的操作并执行一些有用的操作 - 在这种情况下，将它们投入队列以创建更大批次，这些批次本身很可能会被其他TF代理商随后使用。

来源

2016-06-10 20:10:29 dga

这很有道理！所以还有2个问题。 Q1：如果我有100条记录，并且我的培训批量大小为80，那么我可以在input_pipeline中返回80条记录（使用80作为batch_size参数吧？），然后跟踪其他20条记录以进行测试吗？基本上，你知道一种方法，我可以跟踪哪些80我用于训练，所以我可以用其余的测试（当然是洗牌后）。问题2：基本上，当我在我的代码（位于另一个文件中）初始化并运行Session来训练，测试等时，我应该在调用input_pipeline之后输入数据？谢谢！ –

阅读大CSV文件和饲料到TensorFlow

回答

相关问题