2017-02-09 69 views
1

我正在为字符串数据的令牌分类实现一个convnet。 I 需要从TFRecord中取出字符串数据,批量洗牌,然后执行一些扩展数据的处理,然后再批量处理。这是可能的两个batch_shuffle操作?双批处理Tensorflow输入数据

这是我需要做的:

  1. 排队文件名成开fileQueue
  2. 每个序列化实例,放到一个shuffle_batch
  3. 当我决绝的洗牌批次中的每个例子中,我需要按照序列长度复制它,协调位置向量,这将为第一批的每个原始示例创建多个示例。我需要再次批量处理。

当然,一个解决方法就是预处理加载到TF之前的数据,但会占用更多的方式比磁盘空间是必要的。

DATA

下面是一些示例数据。我有两个“例子”。各实施例包含一个标记化的句子和标签为每个令牌的特征:

sentences = [ 
      [ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'], 
      ['then', 'the', 'lazy', 'dog', 'slept', '.'] 
      ] 
sent_labels = [ 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'], 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O'] 
      ] 

每个“实施例”现在具有特征如下(一些reducution为了清楚):

features { 
    feature { 
    key: "labels" 
    value { 
     bytes_list { 
     value: "O" 
     value: "O" 
     value: "O" 
     value: "ANIMAL" 
     ... 
     } 
    } 
    } 

    feature { 
    key: "sentence" 
    value { 
     bytes_list { 
     value: "the" 
     value: "quick" 
     value: "brown" 
     value: "fox" 
     ... 
     } 
    } 
    } 
} 

转化

批处理稀疏数据后,我收到一个作为令牌列表的句子:

['the','quick','brown','fo X”,...]

我需要PAD列表第一至预定SEQ_LEN,然后插入 位置索引到每个例子中,旋转的位置,使得 托克欲分类是在pos 0,并且每个位置标记是相对 0位置:

[ 
['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the' 
['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick 
['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown 
['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox 
] 

配料和ReBatching数据

这里是什么,我试图做一个简化版本:

# Enqueue the Filenames and serialize 
filenames =[outfilepath] 
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ') 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue) 


# Parse Sparse Tensors, make into single dense Tensor 
# ['the', 'quick', 'brown', 'fox'] 
parsed = tf.parse_example(data_batch, features=feature_mapping) 
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>') 
sent_len = tf.shape(dense_tensor_sentence)[1] 

SEQ_LEN = 5 
NUM_PADS = SEQ_LEN - sent_len 
#['the', 'quick', 'brown', 'fox', 'PAD'] 
padded_sentence = pad(dense_tensor_sentence, NUM_PADS) 

# make sent_len X SEQ_LEN copy of sentence, position vectors 
#[ 
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ] 
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] 
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] 
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] 
# NOTE: There is no row where PAD is with a position 0, because I don't 
# want to classify the PAD token 
#] 
examples_with_positions = replicate_and_insert_positions(padded_sentence) 

# While my SEQ_LEN will be constant, the sent_len will not. Therefore, 
#I don't know the number of rows, but I can guarantee the number of 
# columns. shape = (?,SEQ_LEN) 

dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN) 

# Try Random Shuffle Queue: 

# Rebatch <-- This is where the problem is 
#reshape_concat.set_shape((None, SEQ_LEN)) 

random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,)) 
random_queue.enqueue_many(dynamic_input) 
batch = random_queue.dequeue_many(4) 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print sess.run(batch) 

    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs." 

编辑

我现在尝试使用RandomShuffleQueue。在每个队列中,我想排列一个具有形状的批处理(无,SEQ_LEN)。我修改了上面的代码来反映这一点。

我不再获得关于输入形状投诉,但排队挂确实在sess.run(batch)

+1

只是想了解。你第二次批量生产时,你想把这些位置矩阵分成多个句子,对吗?那些不会有不同的长度,在这种情况下,将它们分配在一个密集的张量中是不可能的? –

+0

对不起,我忘了提及我将每个输入PAD到一个常量SEQ_LEN。我重写了代码示例,希望能够澄清这些问题。我得到一个句子,填入它,然后平铺和重塑句子,使得每个记号与一个位置矢量连接。第二批的输入将是shape =(sent_len,SEQ_LEN)。但是因为我不知道sent_len,我不能使用QueueRunners – Neal

+1

在这种情况下'enqueue_many'是你想要的吗?然后批处理(sent_len_1 + sent_len_2 + ...,SEQ_LEN)。 'enqueue_many'的批量维度不应该需要静态形状信息(只要确保其余维度具有静态形状信息)。 –

回答

1

我被错误地接近整个问题。我错误地以为我必须在插入tf.batch_shuffle时定义批次的完整形状,但实际上我只需要定义我输入的每个元素的形状,并设置enqueue_many=True

下面是正确的代码:

single_batch=1 
input_batch_size = 64 
min_after_dequeue = 10 
capacity = min_after_dequeue + 3 * input_batch_size 
num_epochs=2 
SEQ_LEN = 10 
filenames =[outfilepath] 

fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True) 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
first_batch = tf.train.shuffle_batch([serialized_example], ONE, capacity, min_after_dequeue) 

# Get a single sentence and preprocess it shape=(sent_len) 
single_sentence = tf.parse_example(first_batch, features=feature_mapping) 

# Preprocess Sentence. shape=(sent_len, SEQ_LEN * 2). Each row is example 
processed_inputs = preprocess(single_sentence) 

# Re batch 
input_batch = tf.train.shuffle_batch([processed_inputs], 
       batch_size=input_batch_size, 
       capacity=capacity, min_after_dequeue=min_after_dequeue, 
       shapes=[SEQ_LEN * 2], enqueue_many=True) #<- This is the fix 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print i  
    print sess.run(input_batch) 
    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs."