2017-02-10 330 views
0

我想使用Python 3.5而不是Python 2.7在Spark中运行线性回归。所以首先我导出了PYSPARK_PHTHON = python3。我收到一个错误“No module named numpy”。我试图“点安装numpy”,但点不识别设置PYSPARK_PYTHON。我如何问pip安装3.5的numpy?谢谢你...如何在Spark中为Python 3.5安装numpy和pandas?

$ export PYSPARK_PYTHON=python3 

$ spark-submit linreg.py 
.... 
Traceback (most recent call last): 
    File "/home/yoda/Code/idenlink-examples/test22-spark-linreg/linreg.py", line 115, in <module> 
from pyspark.ml.linalg import Vectors 
    File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module> 
    File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 21, in <module> 
    File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module> 
    ImportError: No module named 'numpy' 

$ pip install numpy 
Requirement already satisfied: numpy in /home/yoda/.local/lib/python2.7/site-packages 

$ pyspark 
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux 
Type "help", "copyright", "credits" or "license" for more information. 
17/02/09 20:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
17/02/09 20:29:20 WARN Utils: Your hostname, yoda-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3) 
17/02/09 20:29:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 
17/02/09 20:29:31 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 
Welcome to 
     ____    __ 
    /__/__ ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/ '_/ 
    /__/.__/\_,_/_/ /_/\_\ version 2.1.0 
     /_/ 

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23) 
SparkSession available as 'spark'. 
>>> import site; site.getsitepackages() 
['/usr/local/lib/python3.5/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.5/dist-packages'] 
>>> 
+0

提示:Spark(可以并且通常会)在*集群*的计算机上完成其工作。 – 2017-02-10 19:30:22

+0

您将不得不在所用集群中的所有计算机上安装numpy lib。即如果您只在本地计算机上使用它,请正确下载并添加该库。 Spark不应该在乎它的numpy或其他lib已经正确链接了。 –

+0

@JackManey它看起来像一个本地模式。 OP只是使用错误的点子)Joshua - 使用virtualenv,Anaconda或其他env管理工具是一个好主意。 – zero323

回答

0

所以我实际上并没有把这看作一个火花问题。它看起来像你需要帮助的环境。正如评论者所说,你需要设置一个python 3环境,激活它,然后安装numpy。请看this,以获得有关使用环境的一些帮助。建立一个python3环境后,你应该激活它,然后运行pip install numpyconda install numpy,你应该很好去。