在Spring-Batch中使用分区程序的初始化时间过长？

我正在使用Partitioner来并行化*.csv文件的导入。该文件夹中有大约30k个文件。在Spring-Batch中使用分区程序的初始化时间过长？

问题：作业初始化大约需要1-2个小时，直到所有文件都设置好。瓶颈在SimpleStepExecutionSplitter.split()。

问题：步骤初始化需要多少时间是正常吗？或者我能否以某种方式改进它？

@Bean 
public Step partitionStep(Partitioner partitioner) { 
    return stepBuilderFactory.get("partitionStep") 
      .partitioner(step()) 
      .partitioner("partitioner", partitioner) 
      .taskExecutor(taskExecutor()) 
      .build(); 
} 

@Bean 
public TaskExecutor taskExecutor() { 
    ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor(); 
    taskExecutor.setCorePoolSize(4); //run import always with 4 parallel files 
    taskExecutor.setMaxPoolSize(4); 
    taskExecutor.afterPropertiesSet(); 
    return taskExecutor; 
} 


@Bean 
public Partitioner partitioner() throws IOException { 
    MultiResourcePartitioner p = new MultiResourcePartitioner(); 
    p.setResources(new PathMatchingResourcePatternResolver().getResources("mypath/*.csv")); 
    return p; 
}

来源

2017-05-05 membersound

MultiResourcePartitioner为每个资源创建一个分区。分区创建过程本身非常快（即分区程序非常快地返回executioncontext映射），但Spring Batch需要花费大量时间来填充相应的元数据数据库表，并且一旦分区数量超过100个，它就变得非常缓慢（这是我个人的经验）。

按只回答here，他们做了一些改进，但我使用的最新版本，其分区很慢超过100

见this了。

我认为，除非您准备好自己重写一堆API代码，否则除了减少分区数量外，没有其他选择。

来源

2017-05-05 16:23:59

我使用'spring.batch.initializer.enabled = false'和'MapJobRepository'，因此只在内存中存储任何元数据。无论如何，批处理作业中的“spring-batch stuff”似乎会减慢速度（对于我的30k文件，我有30k个分区;但是我必须坚持使用“Partitioner”，因为我必须为每个文件定义输出文件名在输入）。所以可能我不得不在这里放弃'春季批次'。 – membersound

我使用自定义分配器，因为在默认分配器（https://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/partition/support/SimpleStepExecutionSplitter.java）中，您为每个StepExecution调用jobRepository.getLastStepExecution。我不使用spring-batch的可重启性，所以我可以写我自己的分离器。现在，步骤初始化需要几秒钟的时间来处理数千个文件（在几分钟之前）

来源

2017-08-03 09:23:31

这应该是一条评论 –

在Spring-Batch中使用分区程序的初始化时间过长？

回答

相关问题