2017-07-31 50 views
1

默认情况下,spark_read_jdbc()将整个数据库表读入Spark。我用下面的语法来创建这些连接。如何在从JDBC连接读取时使用谓词?

library(sparklyr) 
library(dplyr) 

config <- spark_config() 
config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar" 

sc <- spark_connect(master   = "local", 
        version  = "1.6.0", 
        hadoop_version = 2.4, 
        config   = config) 

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
        name = "table_name", 
        options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
           user  = "root", 
           password = "password", 
           dbtable = "table_name")) 

但是,我现在遇到了我在MySQL数据库中有一个表的情况,我宁愿只将这个表的一个子集读入到Spark中。

如何获得spark_read_jdbc接受谓词?我已经尝试添加谓词没有成功选项列表,

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
        name = "table_name", 
        options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
           user  = "root", 
           password = "password", 
           dbtable = "table_name", 
           predicates = "field > 1")) 

回答

1

您可以查询替代dbtable

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
       name = "table_name", 
       options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
          user  = "root", 
          password = "password", 
          dbtable = "(SELECT * FROM table_name WHERE field > 1) as my_query")) 

但这样的星火简单的条件时,过滤器应自动将其推:

db_tbl %>% filter(field > 1) 

只要确保设置:

memory = FALSE 

in spark_read_jdbc