解决apache波束管道导入错误[BoundedSource对象大于允许的限制]

我有一堆存储在谷歌云存储上的文本文件（〜1M）。当我读到这些文件到谷歌云数据流的管道的处理，我总是得到以下错误：解决apache波束管道导入错误[BoundedSource对象大于允许的限制]

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

的故障排除页说：

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

这是否意味着我不得不把文件分割成小批量，而不是一次导入全部？

我正在使用dataflow python sdk开发管道。

来源

2017-08-29 Youxun Shen

我不确定为什么人们投票结束这个问题。人们在使用Apache Beam进行编程时经常会遇到一个非常合理的问题。 – jkff