2017-03-10 150 views
0

我有一个火花数据帧,结果,有两个字符串列我想转换为数字:铸造字符串为int空问题

>>> results.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    "43"|     "20"| 
|"BAYLOR MEDICAL C...|    "32"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"MASONIC HOME AND...| "Not Available"|   "Not Available"| 
|"ST HELENA HOSPITAL"|    "41"|     "20"| 
| "TOURO INFIRMARY"|    "15"|     "18"| 
|"WAHIAWA GENERAL ...|    "17"|     "10"| 
|"ANNA JAQUES HOSP...|    "27"|     "18"| 
| "CMC-BLUE RIDGE"|    "31"|     "18"| 
|"EVANSTON REGIONA...|    "15"|     "15"| 
|"OKLAHOMA SPINE H...|    "79"|     "20"| 
|"PICKENS COUNTY M...| "Not Available"|   "Not Available"| 
|"PORTNEUF MEDICAL...|    "11"|     "17"| 
|"PRESENCE SAINT J...|    "20"|     "17"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"SOUTH GEORGIA ME...| "3 out of 10"|     "24"| 
|"TAMPA GENERAL HO...|    "23"|     "16"| 
+--------------------+-----------------+------------------------+ 

尝试这样让我空值的表:

>>> results2 = results.select(results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score")) 
>>> results2.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    null|     null| 
|"BAYLOR MEDICAL C...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"MASONIC HOME AND...|    null|     null| 
|"ST HELENA HOSPITAL"|    null|     null| 
| "TOURO INFIRMARY"|    null|     null| 
|"WAHIAWA GENERAL ...|    null|     null| 
|"ANNA JAQUES HOSP...|    null|     null| 
| "CMC-BLUE RIDGE"|    null|     null| 
|"EVANSTON REGIONA...|    null|     null| 
|"OKLAHOMA SPINE H...|    null|     null| 
|"PICKENS COUNTY M...|    null|     null| 
|"PORTNEUF MEDICAL...|    null|     null| 
|"PRESENCE SAINT J...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"SOUTH GEORGIA ME...|    null|     null| 
|"TAMPA GENERAL HO...|    null|     null| 
+--------------------+-----------------+------------------------+ 

only showing top 20 rows 

是不是可以将字符串列转换为pyspark中的整数?

回答

4

首先你最好需要去除双引号,那么你应该能够转换为IntegerType。你可以使用下面的udf来完成它。

>>> def stripDQ(string): 
... return string.replace('"', "") 
... 
>>> from pyspark.sql.functions import udf 
>>> from pyspark.sql.types import StringType, IntegerType 
>>> udf_stripDQ = udf(stripDQ, StringType()) 

我们将使用它..

您的实际数据框:现在

>>> results.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    "43"|     "20"| 
|"BAYLOR MEDICAL C"|    "32"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"MASONIC HOME AND"| "Not Available"|   "Not Available"| 
+------------------+-----------------+------------------------+ 

,我们将使用我们的UDF从两列去掉双引号。

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"])).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"])) 
>>> results1.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    43|      20| 
|"BAYLOR MEDICAL C"|    32|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"MASONIC HOME AND"| Not Available|   Not Available| 
+------------------+-----------------+------------------------+ 

现在转换成integer:

>>> results2 = results1.select(results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score")) 
>>> results2.show() 
+------------------+-----------------+---------------------+ 
|  Hospital Name|HCAHPS Base Score|HPS Consistency Score| 
+------------------+-----------------+---------------------+ 
|"ADIRONDACK MEDIC"|    43|     20| 
|"BAYLOR MEDICAL C"|    32|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"MASONIC HOME AND"|    null|     null| 
+------------------+-----------------+---------------------+