2

我试图在EMR集群上运行hadoop作业。它正在作为我使用jar-with-dependencies的Java命令运行。这项工作从Teradata中提取数据,我认为Teradata相关的jar也包含在jar-with-dependencies中。不过,我仍然得到异常:指定AWS EMR自定义jar应用程序中的其他jar

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver 
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171) 

pom具有以下相关依存关系:

<dependency> 
    <groupId>teradata</groupId> 
    <artifactId>terajdbc4</artifactId> 
    <version>14.10.00.17</version> 
</dependency> 

<dependency> 
    <groupId>teradata</groupId> 
    <artifactId>tdgssconfig</artifactId> 
    <version>14.10.00.17</version> 
</dependency> 

我包装完整的水瓶中下:

<build> 
    <plugins> 
     <plugin> 
     <artifactId>maven-compiler-plugin</artifactId> 
     <version>3.1</version> 
     <configuration> 
      <source>1.8</source> 
      <target>1.8</target> 
      <compilerArgument>-Xlint:-deprecation</compilerArgument> 
     </configuration> 
     </plugin> 

     <plugin> 
     <artifactId>maven-assembly-plugin</artifactId> 
     <version>2.2.1</version> 

     <configuration> 
      <descriptors> 
      </descriptors> 
      <archive> 
      <manifest> 
      </manifest> 
      </archive> 
      <descriptorRefs> 
      <descriptorRef>jar-with-dependencies</descriptorRef> 
      </descriptorRefs> 
     </configuration> 

     <executions> 
      <execution> 
      <id>make-assembly</id> 
      <phase>package</phase> 
      <goals> 
       <goal>single</goal> 
      </goals> 
      </execution> 
     </executions> 
     </plugin> 

    </plugins> 
    </build> 

assembly.xml文件:

<assembly> 
    <id>aws-emr</id> 
    <formats> 
     <format>jar</format> 
    </formats> 
    <includeBaseDirectory>false</includeBaseDirectory> 
    <dependencySets> 
     <dependencySet> 
      <unpack>false</unpack> 
      <includes> 
      </includes> 
      <scope>runtime</scope> 
      <outputDirectory>lib</outputDirectory> 
     </dependencySet> 
     <dependencySet> 
      <unpack>true</unpack> 
      <includes> 
       <include>${groupId}:${artifactId}</include> 
      </includes> 
     </dependencySet> 
    </dependencySets> 
</assembly> 

运行EMR命令:

aws emr create-cluster --release-label emr-5.3.1 \ 
--instance-groups \ 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \ 
    InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \ 
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \ 
--applications Name=Hadoop --name TeradataPullerTest \ 
--ec2-attributes <ec2-attributes> \ 

--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\ 
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \ 
--auto-terminate 

有没有我可以指定Teradata的罐子在执行的map-reduce任务,使得它们添加到类路径的方法吗?

编辑:我确认缺少的类是打包在jar-with-dependencies中的。

aws-emr$ jar tf target/aws-emr-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep TeraDriver 
com/ncr/teradata/TeraDriver.class 
com/teradata/jdbc/TeraDriver.class 

回答

0

我还没有完全解决这个问题,但找到了一种方法来使这项工作。理想的解决方案应该在超级罐子里包装Teradata罐子。这仍然在发生,但是这些jar不会被添加到类路径中。我不确定为什么会这样。

我通过创建2个独立的jar来解决这个问题 - 一个用于我的代码包,另一个用于所有需要的依赖关系。我将这两个罐子都上传到了S3,然后写了一个脚本,它执行以下操作(伪代码):

# download main jar 
aws s3 cp <s3-path-to-myjar.jar> . 

# download dependency jar in a temp directory 
aws s3 cp <s3-path-to-dependency-jar> temp 

# unzip the dependencies jar into another directory (say `jars`) 
unzip -j temp/dependencies.jar <path-within-jar-to-unzip>/* -d jars 

LIBJARS=`find jars/*.jar | tr -s '\n' ','` 

HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g` 

CLASSPATH=$HADOOP_CLASSPATH 

export CLASSPATH HADOOP_CLASSPATH 

# run via hadoop command 
hadoop jar myjar.jar com.my.package.EventsPullerMR -libjars ${LIBJARS} <arguments to the job> 

这将开始工作。

相关问题