java.lang.ClassCastException：org.apache.hadoop.hive.ql.io.orc.OrcStruct不能转换为org.apache.hadoop.io.Text。 json serde错误

我是新手，在配置单元上使用json数据。我正在开发一个获取json数据并将其存储到配置单元表的Spark应用程序。我有这样一个JSON：java.lang.ClassCastException：org.apache.hadoop.hive.ql.io.orc.OrcStruct不能转换为org.apache.hadoop.io.Text。 json serde错误

，看起来像这样展开时：

我能够读取JSON成数据帧，并将其保存在HDFS的位置。但让数据读取是非常困难的。

我在网上例如搜索后，我试着这样做：

使用STRUCT所有JSON字段，然后访问使用column.element的元素。

例如：

web_app_security将是在表内（类型STRUCT的）的柱，并在它的另一jsons像config_web_cms_authentication, web_threat_intel_alert_external的名称也将是的Structs（与rating和rating_numeric作为字段）。

我试着用json serde创建表。这里是我的表格定义：

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>, 
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>, 
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>, 
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float, threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>, 
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float, data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>, 
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>, hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>, 
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>, 
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>, 
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float> 
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS orc 
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

我试着用json serde解析行。我一直保存到表中的一些数据后，我收到以下错误，当我尝试进行查询：

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

我不知道如果我这样做的正确方法。

我愿意以任何其他方式将数据存储到表中。任何帮助，将不胜感激。谢谢。

来源

2017-07-15 Hemanth Annavarapu

那是因为你是混合ORC作为存储（STORED AS orc）和JSON作为SERDE（ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'）覆盖ORC的默认OrcSerde SERDE，而不是输入（OrcInputFormat）和输出（OrcOutputFormat）格式。

您或者需要使用ORC存储，而不会覆盖默认的SerDe。在这种情况下，确保Spark应用程序写入ORC表中，而不是JSON。

或者，如果您希望数据以JSON格式存储，请将JsonSerDe与纯文本文件一起用作存储（STORED AS TEXTFILE）。

蜂巢开发指南对解释SERDE和存储是如何工作的 - https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

来源

2017-07-16 23:34:56

感谢您的回答。我treid保存我的数据框作为文本文件使用'df.rdd.saveAsTextFile（“路径”）'但我得到他的错误'org.apache.hadoop.mapred.FileAlreadyExistsException：输出目录已存在'我不知道为什么它试图为每个数据帧创建一个新目录，而不是在给定路径中创建一个新文件。有没有更好的方式将数据框保存为文本文件？或者有什么办法可以将数据框保存为csv并给出适当的表定义来读取csv文件并使用json serde？ @Sergey Khudyakov –

@HemanthAnnavarapu来看看'df.write'，特别是'df.write.mode（SaveMode）'。我不知道你为什么现在提到CSV文件，但我强烈建议你先阅读Hive Developer Guide（答案中的链接）和Spark DataFrame API文档。 “更好的方法”实际上取决于你想要达到什么样的目标，你希望在Hive中拥有什么类型的表格等。 –

我正在尝试使用csv，因为我已经搜索了将文本保存为文本以及大多数他们指的是使用'DF.write.format（“org.databricks.spark.csv”）。save（“path”）''的例子。由于'Df.saveAsTextFile'不起作用，我试着用csv。有什么办法可以在'csv'文件上使用'serde'？或者，我可以使用'orc'格式来覆盖'输入格式'吗？ @Sergey Khudyakov –

java.lang.ClassCastException：org.apache.hadoop.hive.ql.io.orc.OrcStruct不能转换为org.apache.hadoop.io.Text。 json serde错误

回答

相关问题