2

阅读简单的文本文件,如下面博客给出如何从谷歌云存储使用火花斯卡拉本地程序

https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview

我试图用火花斯卡拉阅读从谷歌云存储文件。 对于我已导入谷歌云存储连接器和谷歌云存储如下,

// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage 
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0' 

// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector 
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2' 

之后创建像下面一个简单的斯卡拉目标文件, (创建一个sparkSession)

val csvData = spark.read.csv("gs://my-bucket/project-data/csv") 

但抛出错误,

17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2 
17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request 
java.net.SocketTimeoutException: connect timed out 
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method) 
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85) 
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) 
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) 
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
    at java.net.Socket.connect(Socket.java:589) 
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175) 
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) 
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) 
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) 
    at sun.net.www.http.HttpClient.New(HttpClient.java:308) 
    at sun.net.www.http.HttpClient.New(HttpClient.java:326) 
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) 
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) 
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) 
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) 
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) 
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) 
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158) 
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489) 
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205) 
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70) 
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816) 
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003) 
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287) 
    at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317) 
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) 
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413) 
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349) 
    at test$.main(test.scala:41) 
    at test.main(test.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) 

我也设置了所有的认证。不知道如何刷新超时。

编辑

我试图通过的IntelliJ IDEA(Windows)中上面的代码运行。 相同代码的JAR文件在Google Cloud DataProc上正常工作,但在通过本地系统运行时出现以上错误。 我已经在IntelliJ中安装了Spark,Scala和Google Cloud插件。

一件事, 我创造了Dataproc实例,并试图连接到外部的IP地址作为文档中给出, https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

这是无法连接到服务器给超时错误

回答

2

谢谢Dennis为此问题指出方向。由于我使用Windows操作系统,因此没有core-site.xml,因为hadoop不适用于Windows。

我已经下载预构建的火花,并在代码本身配置由你提到的参数如下

给出创建一个SparkSession,并使用它的可变配置Hadoop的参数一样spark.SparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","<KeyFile Path>"),而我们需要设置所有其它参数在core-site.xml中。

设置完所有这些后,程序可以访问Google云端存储中的文件。

3

您需要将google.cloud.auth.service.account.json.keyfile设置为您在these instructions for generating a private key之后创建的服务帐户的json凭证文件的本地路径。堆栈跟踪显示连接器认为它位于GCE VM上,并试图从本地元数据服务器获取证书。如果这不起作用,请尝试设置fs.gs.auth.service.account.json.keyfile

尝试SSH时,你有没有试过gcloud compute ssh <instance name>?您可能还需要检查计算引擎防火墙规则,以确保允许端口22上的入站连接。

+0

我已经下载了服务帐户密钥的json凭证文件,并将其设置为环境变量GOOGLE_APPLICATION_CREDENTIALS,因为我使用的是Windows操作系统并试图运行该程序,但我得到了相同的TimeOut错误。我希望我已采取正确的方式来实施您提出的有关google.cloud.auth.service的建议。account.json.keyfile到json文件的本地路径。如果没有,请纠正我。我不确定在哪里设置fs.gs.auth.service.account.json.keyfile。如果有任何文档可用,那么请建议在Windows操作系统下需要什么样的配置。 – Shawn

+0

当尝试SSH时,我已经尝试了gcloud compute ssh ,正如你提到的那样,但它也给我TimeOut错误。 – Shawn

+0

令我惊讶的是,我可以使用Storage类在Google云端存储中创建存储桶。不知道从桶中读取文件有什么问题。 – Shawn