2017-08-17 155 views
2

当我运行我的tensorflow应用程序时,它只输出“杀死”。我该如何调试?为什么tensorflow只是输出杀死

source code

[email protected]:~/tensorflow# python sample_cnn.py 
INFO:tensorflow:Using default config. 
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_model_dir': 'data/convnet_model', '_save_summary_steps': 100} 
INFO:tensorflow:Create CheckpointSaverHook. 
2017-08-17 12:56:53.160481: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160555: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
Killed 
+0

谢谢!你们真棒!经过一些调整我的参数后,我可以在16GB的笔记本电脑上运行它。 – reachlin

回答

4

当我运行代码,我得到了相同的行为,打字dmesg后,你会看到一个跟踪等,其证实了gdelab在暗示:

[38607.234089] python3 invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 
[38607.234090] python3 cpuset=/ mems_allowed=0 
[38607.234094] CPU: 3 PID: 1420 Comm: python3 Tainted: G   O 4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u2 
[38607.234094] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.2.4 03/29/2017 
[38607.234096] 0000000000000000 ffffffffa9f28414 ffffa50090317cf8 ffff940effa5f040 
[38607.234097] ffffffffa9dfe050 0000000000000000 0000000000000000 0101ffffa9d82dd0 
[38607.234098] e09c7db7f06d0ac2 00000000ffffffff 0000000000000000 0000000000000000 
[38607.234100] Call Trace: 
[38607.234104] [<ffffffffa9f28414>] ? dump_stack+0x5c/0x78 
[38607.234106] [<ffffffffa9dfe050>] ? dump_header+0x78/0x1fd 
[38607.234108] [<ffffffffa9d8047a>] ? oom_kill_process+0x21a/0x3e0 
[38607.234109] [<ffffffffa9d800fd>] ? oom_badness+0xed/0x170 
[38607.234110] [<ffffffffa9d80911>] ? out_of_memory+0x111/0x470 
[38607.234111] [<ffffffffa9d85b4f>] ? __alloc_pages_slowpath+0xb7f/0xbc0 
[38607.234112] [<ffffffffa9d85d8e>] ? __alloc_pages_nodemask+0x1fe/0x260 
[38607.234113] [<ffffffffa9dd7c3e>] ? alloc_pages_vma+0xae/0x260 
[38607.234115] [<ffffffffa9db39ba>] ? handle_mm_fault+0x111a/0x1350 
[38607.234117] [<ffffffffa9c5fd84>] ? __do_page_fault+0x2a4/0x510 
[38607.234118] [<ffffffffaa207658>] ? page_fault+0x28/0x30 
... 
[38607.234158] [ pid ] uid tgid total_vm  rss nr_ptes nr_pmds swapents oom_score_adj name 
... 
[38607.234332] [ 1396] 1000 1396 4810969 3464995 6959  21  0    0 python3 
[38607.234332] Out of memory: Kill process 1396 (python3) score 568 or sacrifice child 
[38607.234357] Killed process 1396 (python3) total-vm:19243876kB, anon-rss:13859980kB, file-rss:0kB, shmem-rss:0kB 
[38607.720757] oom_reaper: reaped process 1396 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 

这基本上意味着蟒蛇已开始太消耗太多许多内存和内核决定终止进程。如果您在代码中添加一些打印件,您会看到mnist_classifier.train()是处于活动状态的功能。然而,一些愚蠢的测试(如删除日志记录和降低步骤,似乎没有帮助)。

3

你的程序是由您的操作系统杀死,Tensorflow有不知道为什么,这不是为什么它输出任何东西。这可能是由于内存不足错误。

检查您syslog包含这样一行:

<date> <computer> kernel: [...] Out of memory: Kill process <id> (python) score <...> or sacrifice child 

如果是这样,你需要增加允许蟒蛇的内存,和/或降低你的程序使用的内存。

3

正如其他评论者所说,你的操作系统会因为内存不足而杀死你的进程。你正试图建立一个庞大的网络。让我们看看你最后的密集层。它有65536个输入和65536个单位。每个单位对每个输入都有权重,因此使得权重为65536 * 65536 = 4294967296。权重是基于你输入的dtype,我认为你的是float64,所以让它乘以64,你得到32GB的权重(65536 * 65536 * 64/1024/1024/1024/8 = 32)。所有这些权重都是单张张量,必须作为一个整体进行操作,因此它必须完全适合RAM。你的系统有32GB的RAM吗?

相关问题