ModuleNotFoundError: No module named 'pyspark' on emr cluster - amazon-emr

I have created an EMR cluster along with Jupyter notebook and chose Pyspark for it. When I tried to import pyspark it gave me an error: "ModuleNotFoundError: No module named 'pyspark'". I have run pip install pyspark and it seems like it was installed successfully but still gives me the same error. How can this be fixed?
Any help would be much appreciated.

Related

Cannot import name 'to_html' from 'pandas_profiling.report' using JupyterLab

I'm new using Jupyter Lab and Pandas profiling.
I'm trying to install and import and install Pandas Profiling in a jupyter notebook. I'm able to install pandas collab using pip, but unable to import the library. The error says I cannot import name 'to_html' from 'pandas_profiling.report'.
Here's the code and the error.
Funny thing is: I also tried to run the notebook in Google Colab, but I got a different but similar error:
ImportError: cannot import name 'PandasProfiling' from 'pandas_profiling' (/usr/local/lib/python3.8/dist-packages/pandas_profiling/__init__.py)
I already tried to use Jupyter Lab and Jupyter Notebook from Anaconda and Google Colab to see if it works, but no look.
conda install -c conda-forge pandas-profiling
See this question.
from pandas_profiling import ProfileReport
https://pandas-profiling.ydata.ai/docs/master/pages/getting_started/quickstart.html
PandasProfiling object does not exist.

No module named 'stable_baseline3' even when it is installed in google colab

I am trying to set up stable baselines 3 in google colab. The document is connected to a local runtime on my pc through jupyter notebooks. On my pc i have installed stable baselines 3 using anaconda, and got the output saying essentially stable baselines 3 is installed. I have also run the cells:
!pip install stable-baselines3[extra]
!pip install stable-baselines3
and
!pip install stable-baselines3 --upgrade
Despite this, when i run the cell:
import stable_baseline3
from stable_baselines3 import DQN
etc...
I get the error on line 1 of ModuleNotFoundError: No module named 'stable_baseline3'. I dont understand why this would be happening, does anybody know how it could be solved?
i had the same problem
try to import stable-baselines3 first in alone cell and it should work
!pip install stable-baselines3

Issue with 'pandas on spark' used with conda: "No module named 'pyspark.pandas'" even though both pyspark and pandas are installed

I have installed both Spark 3.1.3 and Anaconda 4.12.0 on Ubuntu 20.04.
I have set PYSPARK_PYTHON to be the python bin of a conda environment called my_env
export PYSPARK_PYTHON=~/anaconda3/envs/my_env/bin/python
I installed several packages on conda environment my_env using pip. Here is a portion of the output of pip freeze command:
numpy==1.22.3
pandas==1.4.1
py4j==0.10.9.3
pyarrow==7.0.0
N.B: package pyspark is not installed on the conda environment my_env. I would like to be able to launch a pyspark shell on different conda environments without having to reinstall pyspark in every environment (I would like to only modify PYSPARK_PYTHON). This would also avoids having different versions of Spark on different conda environments (which is sometimes desirable but not always).
When I launch a pyspark shell using pyspark command, I can indeed import pandas and numpy which confirms that PYSPARK_PYTHON is properly set (my_env is the only conda env with pandas and numpy installed, moreover pandas and numpy are not installed on any other python installation even outside conda, and finally if I change PYSPARK_PYTHON I am no longer able to import pandas or numpy).
Inside the pyspark shell, the following code works fine (creating and showing a toy Spark dataframe):
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["a", "b"]).show()
However, if I try to convert the above dataframe into a pandas on spark dataframe it does not work. The command
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["t", "a"]).to_pandas_on_spark()
returns:
AttributeError: 'DataFrame' object has no attribute 'to_pandas_on_spark'
I tried to first import pandas (which works fine) and then pyspark.pandas before running the above command but when I run
import pyspark.pandas as ps
I obtain the following error:
ModuleNotFoundError: No module named 'pyspark.pandas'
Any idea why this happens ?
Thanks in advance
From here, it seems that you need apache spark 3.2, not 3.1.3. Update to 3.2 and you will have the desired API.
pip install pyspark #need spark 3.3
import pyspark.pandas as ps

I keep getting this error: ModuleNotFoundError: No module named 'object_detection'

Iam using jupyter notebook.
I have already installed object_detection module using:
pip install object-detection-api
But keep getting this error.
Help somebody.
Thank you in advance.

cannot import name 'register_extension_dtype' while importing pandas

i am trying to import pandas but is giving me below error. Earlier it was giving me different errors but i fixed those. but now i am stuck on this one. Never had such a problem before while importing pandas.
Here are your to-do's -
Try shutting down and then restarting the notebook.
If 1 does not work, reinstall pandas using "conda install -f pandas" if using anaconda. Do not forget to shutdown and restart the notebook after the installation.