Issue with 'pandas on spark' used with conda: "No module named 'pyspark.pandas'" even though both pyspark and pandas are installed - pandas

I have installed both Spark 3.1.3 and Anaconda 4.12.0 on Ubuntu 20.04.
I have set PYSPARK_PYTHON to be the python bin of a conda environment called my_env
export PYSPARK_PYTHON=~/anaconda3/envs/my_env/bin/python
I installed several packages on conda environment my_env using pip. Here is a portion of the output of pip freeze command:
numpy==1.22.3
pandas==1.4.1
py4j==0.10.9.3
pyarrow==7.0.0
N.B: package pyspark is not installed on the conda environment my_env. I would like to be able to launch a pyspark shell on different conda environments without having to reinstall pyspark in every environment (I would like to only modify PYSPARK_PYTHON). This would also avoids having different versions of Spark on different conda environments (which is sometimes desirable but not always).
When I launch a pyspark shell using pyspark command, I can indeed import pandas and numpy which confirms that PYSPARK_PYTHON is properly set (my_env is the only conda env with pandas and numpy installed, moreover pandas and numpy are not installed on any other python installation even outside conda, and finally if I change PYSPARK_PYTHON I am no longer able to import pandas or numpy).
Inside the pyspark shell, the following code works fine (creating and showing a toy Spark dataframe):
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["a", "b"]).show()
However, if I try to convert the above dataframe into a pandas on spark dataframe it does not work. The command
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["t", "a"]).to_pandas_on_spark()
returns:
AttributeError: 'DataFrame' object has no attribute 'to_pandas_on_spark'
I tried to first import pandas (which works fine) and then pyspark.pandas before running the above command but when I run
import pyspark.pandas as ps
I obtain the following error:
ModuleNotFoundError: No module named 'pyspark.pandas'
Any idea why this happens ?
Thanks in advance

From here, it seems that you need apache spark 3.2, not 3.1.3. Update to 3.2 and you will have the desired API.

pip install pyspark #need spark 3.3
import pyspark.pandas as ps

Related

Cannot import name 'to_html' from 'pandas_profiling.report' using JupyterLab

I'm new using Jupyter Lab and Pandas profiling.
I'm trying to install and import and install Pandas Profiling in a jupyter notebook. I'm able to install pandas collab using pip, but unable to import the library. The error says I cannot import name 'to_html' from 'pandas_profiling.report'.
Here's the code and the error.
Funny thing is: I also tried to run the notebook in Google Colab, but I got a different but similar error:
ImportError: cannot import name 'PandasProfiling' from 'pandas_profiling' (/usr/local/lib/python3.8/dist-packages/pandas_profiling/__init__.py)
I already tried to use Jupyter Lab and Jupyter Notebook from Anaconda and Google Colab to see if it works, but no look.
conda install -c conda-forge pandas-profiling
See this question.
from pandas_profiling import ProfileReport
https://pandas-profiling.ydata.ai/docs/master/pages/getting_started/quickstart.html
PandasProfiling object does not exist.

pandas installed, but shows ModuleNotFoundError When running python script

Windows 11;
Tried Anaconda, PyCharm, Python IDLE, and PowerShell;
installed pandas under PyCharm/Terminal, PowerShell:
using pip3 install pandas
In PowerShell, it works with input:
Python
import pandas
In Pycharm,
In the terminal, when I tried to install pandas again, it showed "Requirement already satisfied: pandas in d:\anaconda3\lib\site-packages (1.4.4)..."
However, it still shows "ModuleNotFoundError: No Module named 'pandas' when I run a.py with just one sentence:
import pandas
What's wrong?

Pandas incompatible with numpy

I am using anaconda 3. When I try to import pandas I receive the following message:
ImportError: this version of pandas is incompatible with numpy < 1.15.4
your numpy version is 1.15.3.
Please upgrade numpy to >= 1.15.4 to use this pandas version
Printing numpy.__path__ gives me the following
['C:\Users\andrei\AppData\Roaming\Python\Python37\site-packages\numpy']
In conda list, my numpy version is 1.19.1. I checked the above directory to find that it has only numpy 1.15.3 inside and nothing else. Spyder is using this path instead of the anaconda's path to numpy for some arcane reason.
Looks like you have somehow installed several versions of NumPy. Try to remove them all by running several times conda remove numpy and pip uninstall numpy. If you have two versions, the corresponding uninstall command needs to be run twice. After these, install a fresh version of NumPy conda install numpy
You can verify if you still have a version of NumPy installed
conda list | grep numpy
pip list | grep numpy
Note that these commands show only one version number even if you have several copies installed.
You can use conda to upgrade to upgrade your numpy. Run this command in the terminal:
conda update numpy
You need to remove this directory
C:\Users\andrei\AppData\Roaming\Python\
to fix this problem. It seems at some point you used pip to install numpy and that's interfering with the packages installed by conda (which is reporting the right version, as you said).
Furthermore, please be aware that pip and conda packages are binary incompatible, so you should avoid as much as possible to mix them.

ModuleNotFoundError for pandas_datareader: Jupyter Notebook using different packages from conda environment

I am using Anaconda windows v5.3.
I am getting the error:
ModuleNotFoundError: No module named 'pandas_datareader'
When I tried to print out the packages used by Jupyter Notebook, I realized that pandas_datareader is not in, and a different version of pandas (0.23.0) is used:
import pkg_resources
for i in pkg_resources.working_set:
print(i)
Output
...
pandocfilters 1.4.2
pandas 0.23.0
packaging 17.1
openpyxl 2.5.3
...
This differs from the library installed in the pyfinance environment:
>conda list
# Name Version Build
pandas 0.20.3 py36_0
pandas-datareader 0.4.0 py36_0
Hence, pandas_datareader seem to work in the python shell in the command prompt, but not in jupyter notebook. Are there anyways to sync jupyter notebook environment to the conda environment?
I realized to sync jupyter notebook you just have to do:
conda install jupyter

ipython cannot search matplotlib while using tensorflow and jupyter also has import error

I am using python 2.7 in Ubuntu and recently updated tensorflow 0.12.1.
I installed jupyter today for my sample code of tf and I need to use matplotlib. It does not find module name matplotlib and ipython in tensorflow has same error.
1. How can I set path in virtualenv or ipython or jupyter?
After activate tensorflow, I need to use jupyter notebook.
This below in the script for error does not work.
import sys
sys.path.append('my/path/to/module/folder')
import module-of-interest
2. other information: My environments are below.
mickyefromsd#DEKSTOP~$source ~/tensorflow/bin/activate
When I find matplotlib by python script under TF condition and before TF activation, it has below;
/usr/lib/python2.7/dist-packages/matplotlib/
When I type 'which ipython', it has below (not by /usr/bin/ipython) ;
/home/mickeyfromd/tensorflow/bin/ipython
Btw, /tensorflow/lib/python2.7/site-packages/ it has ipython and jupyter.
(not in the same path of matplotlib)
3. My ipython under TF cannot find my existing matplotlib.
(tensorflow) mickeyfromd#DK-DESKTOP:~$ ipython
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
In [1]: import matplotlib ImportError: No module named matplotlib
4. I wanted to setup virtualenv, so I just run this
I followed this site. site:http://help.pythonanywhere.com/pages/IPythonNotebookVirtualenvs
(tensorflow) mickeyfromd#ipython kernelspec install-self --user
.....
Installed kernelspec python2 in /home/mickeyfromd/.local/share/jupyter/kernels/python2
(tensorflow) mickeyfromd#DK-DESKTOP:~$
I cannot move the the folder (in the second step)
How can I make ipython to have path for Matplotlib?
That import error is due to change in environment of the jupyter notebook. You might have installed the packages in one environment and you are running the jupyter notebook in another environment.
I have got two environments (envs) in my Anaconda folder ( I have Anaconda3 folder to be specific ).
(windows key+cmd ) -> open the windows command prompt run as administrator.
Activate (name of the environment) -> eg.: activate tensorflow-gpu
Start installing packages using conda install
Note: For each environment you need to install all the packages you want to use, separately using the same process. This solution is for windows users, might work for linux users not sure though.
Additionally to make sure your conda environment is up to date run:
conda update conda
conda update anaconda
check this out : https://pradyumnamajumder.wordpress.com/2017/09/30/solution-to-the-python-packages-import-error-in-jupyter/