How to read HDF5 file in Python/Pandas via SSH? - pandas

I'm accessing a remote machine via SSH (Putty). A dataset is stored in a directory on that machine, which I need to read with pandas in Python on my local computer. I am trying to use dataframe=pandas.read_hdf(path, key="data") but I don't know which path to specify which would direct towards the dataset stored on the remote machine in my local Python code since it's not stored locally. As I mentioned I am accessing the dataset using Putty.
What should the path look like?
I tried replacing C: with the host name followed by the path which I use in Putty to access the file.
Thanks in advance.

I don't know what you mean precisely by read, but you can display the dataframe with the following:
SSH to your remote server
Navigate to the directory your dataframe is stored in:
cd /directory/of/dataframe
Launch a Python or iPython interpreter: python or ipython
Execute those python commands:
>>> import pandas as pd
>>> dataframe=pandas.read_hdf("hdf_file.h5", key="data")
# This should work because `hdf_file.h5 is
# in the directory you launched the python command
Print your dataframe: print(dataframe)

Related

Running remote Pycharm interpreter with tensorflow and cuda (with module load)

I am using a remote computer in order to run my program on its GPU. My program contains some code with tensorflow functions, and for easier debugging with Pycharm I would like to connect via ssh with remote interpreter to the computer with the GPU. This part can be done easily since Pycharm has this option so I can connect there. However, tensorflow is not loaded automatically so I get import error.
Note that in our institution, we run module load cuda/10.0 and module load tensorflow/1.14.0 each time the computer is loaded. Now this part is the tricky one. Opening a remote terminal creates another session which is not related to the remote interpreter session so it's not affecting remote interpreter modules.
I know that module load in general configures env, however I am not sure how can I export the environment variables to Pycharm's environment variables that are configured before a run.
Any help would be appreciated. Thanks in advance.
The workaround after all was relatively simple: first, I have installed the EnvFile plugin, as it was explained here: https://stackoverflow.com/a/42708476/13236698
Them I created an .env file with a quick script on python, extracting all environment variables and their values from os.environ and wrote them to a file in the following format: <env_variable>=<variable_value>, and saved the file with .env extension. Then I loaded it to PyCharm, and voila - all tensorflow modules were loaded fine.

Pyspark EMR Notebook - Unable to save file to EMR environment

This seems really basic but I can't seem to figure it out. I am working in a Pyspark Notebook on EMR and have taken a pyspark dataframe and converted it to a pandas dataframe using toPandas().
Now, I would like to save this dataframe to the local environment using the following code:
movie_franchise_counts.to_csv('test.csv')
But I keep getting a permission error:
[Errno 13] Permission denied: 'test.csv'
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 3204, in to_csv
formatter.save()
File "/usr/local/lib64/python3.6/site-packages/pandas/io/formats/csvs.py", line 188, in save
compression=dict(self.compression_args, method=self.compression),
File "/usr/local/lib64/python3.6/site-packages/pandas/io/common.py", line 428, in get_handle
f = open(path_or_buf, mode, encoding=encoding, newline="")
PermissionError: [Errno 13] Permission denied: 'test.csv'
Any help would be much appreciated.
When you are running PySpark in EMR Notebook you are connecting to EMR cluster via Apache Livy. Therefore all your variables and dataframes are stored on the cluster and when you run df.to_csv('file.csv') you are trying to save CSV on the cluster and not in your local enviroment. I've struggled a bit, but this worked for me:
Store your PySpark dataframe as temporary view: df.createOrReplaceTempView("view_name")
Load SparkMagic: %load_ext sparkmagic.magics
Select from view and use SparkMagic to load output to local (-o flag)
%%sql -o df_local --maxrows 10
SELECT * FROM view_name
Now you have your data in Pandas dataframe df_local and can save it with df_local.to_csv('file.csv')
It depends on where exactly your Kernel is running, i.e, whether it is running locally or on remote cluster. In case of EMR Notebooks, for EMR release label 5.30, 5.32+(except 6.0 and 6.1), all kernels run remotely on the attached EMR cluster, hence when you try to save file, it is actually trying to save file on the cluster and you may not have access to that directory on the cluster. For release label other than those mentioned above, kernels run locally and hence for those release label you would be able to save file locally with your code.
I believe the best way to do it would be to save to s3 directly from a pyspark dataframe like this
df.repartition(1).write.mode('overwrite').csv('s3://s3-bucket-name/folder/', header=True)
Note: You don't need a file name here since pyspark will create a file with a custom name such as part-00000-d129fe1-7721-41cd-a97e-36e076ea470e-c000.csv

nifi pyspark - "no module named boto3"

I'm trying to run a pyspark job I created that downloads and uploads data from s3 using the boto3 library. While the job runs fine in pycharm, when I try to run it in nifi using this template https://github.com/Teradata/kylo/blob/master/samples/templates/nifi-1.0/template-starter-pyspark.xml
The ExecutePySpark errors with "No module named boto3".
I made sure it was installed on my conda environment that is active.
Any ideas, im sure im missing something obvious.
Here is a picture of the nifi spark processor.
Thanks,
tim
The Python environment where PySpark should run on is configured via the PYSPARK_PYTHON variable.
Go to Spark installation directory
Go to conf
Edit spark-env.sh
Add this line: export PYSPARK_PYTHON=PATH_TO_YOUR_CONDA_ENV

X11 forwarding with PyCharm and Docker Interpreter

I am developing a project in PyCharm using a Docker interpreter, but I am running into issues when doing most "interactive" things. e.g.,
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
gives
RuntimeError: Invalid DISPLAY variable
I can circumvent this using
import matplotlib
matplotlib.use('agg')
which gets rid of the error, but no plot is produced when I do plt.show(). I also get the same error as in the thread [pycharm remote python console]: "cannot connect to X server" error with import pandas when trying to debug after importing Pandas, but I cannot SSH into my docker container, so the solution proposed there doesn't work. I have seen the solution of passing "-e DISPLAY=$DISPLAY" into the "docker run" command, but I don't believe PyCharm has any functionality for specifying command-line parameters like this with a Docker interpreter. Is there any way to set up some kind of permanent, generic X11 forwarding (if that is indeed the root cause) so that the plots will be appropriately passed to the DISPLAY on my local machine? More generally, has anyone used matplotlib with a Docker interpreter in PyCharm successfully?
Here's the solution I came up with. I hope this helps others. The steps are as follows:
Install and run Socat
socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\"
Install and run XQuartz (probably already installed)
Edit the PyCharm run/debug configuration for your project, setting the appropriate address for the DISPLAY variable (in my case 192.168.0.6:0)
Running/debugging the project results in a new quartz popup displaying the plotted graph, without any need to save to an image, etc.
Run xhost + on the host and add these options to the docker run: -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix

ipython kernel with remote display [duplicate]

This question already has an answer here:
ipython notebook on linux VM running matplotlib interactive with nbagg
(1 answer)
Closed 6 years ago.
I use an ipython kernel on a remote machine via:
user#remote_machine$ ipython kernel
[IPKernelApp] To connect another client to this kernel, use:
[IPKernelApp] --existing kernel-24970.json
and then through manual ssh tunneling (see here) connect a qtconsole on my local machine to it:
user#local_machine$ for port in $(cat kernel-24970.json | grep '_port' | grep -o '[0-9]\+'); do ssh remote_machine -Y -f -N -L $port:127.0.0.1:$port; done
user#local_machine$ ipython qtconsole --existing kernel-24970.json
This works fine. However, to visualize my data while debugging, i want to use matplotlib.pyplot. Although I have enabled X11 forwarding on my ssh tunnel (through -Y), when I try plotting something, I get the following error:
TclError: no display name and no $DISPLAY environment variable
as if X11 forwarding does not have any effect.
Furthermore, once when I had access to the remote machine, I started the remote kernel with:
user#remote_machine$ ipython qtconsole
and repeated the same process from my local machine. This time, I wasn't getting any errors. But the figures were being plotted on the remote machine instead of my local machine.
So, does anyone know if it's possible to connect to a remote ipython kernel, and display plots locally? (please note that inline mode works, and shows the plots in the local qtconsole, but that's not useful for me as I frequently need to zoom in).
A simpler and more robust approach is to run ipython remotely as you did, and instead of trying to plot the figures remotely, instead save them remotely. At the same time mount the remote directory using sftp, and open it in your local file browser.
Make sure to refresh your directory view in case the images that were saved remotely are not visible (otherwise it can take some time for this to happen). One simple way for refreshing the remote directory's view is noted here.