What disk image should I choose for my Google Cloud VM so that pandas will work just as it does on my Mac? - pandas

I followed a handy tutorial to setup a Google Compute Engine VM instance with data science libraries and Debian GNU/Linux 9 disk image. I ran a data exploration notebook I had put together on my local machine, and found pandas.read_csv() to screw up the import of my training data.
Correctly imported, the dataset is a pandas dataframe with one column ('text'). Each of 3000 entries in that column is an article from a biomedical literature corpus. What happens on the VM though is that some length threshold is applied and pandas shunts part of a given article to a new row of the dataframe. It does this to most but not all of the articles and the dataframe ends up with close to 6000 entries. More importantly, it's useless to try to train a model on.
I cloned my local environment using Vagrant but it looks like it might be difficult to get my disk image into Google Cloud and optimized. So, I thought I would check here first if anyone knows a simpler solution, like perhaps choosing a different machine type than Debian/Linux to set up my Compute Engine instance so that pandas functions work properly. Thanks for your input!

After you login to the Google Cloud VM instance that has Debian/GNU Linux default, you can use the usual:
sudo apt-get update
sudo apt-get install python-pandas
Else, if you prefer to use the pip installer, that works too:
sudo apt-get update
sudo apt-get install python-pip
Then you can install other PyPi libraries, such as pandas as sudo pip install pandas
Remember that if you want to install libraries for Python 3.x, use python3 instead of python in the above snippets.

Related

Getting GRIB2 Lat/Lon Information from GDAL

I am attempting to plot fields from a GRIB2 file of GFS model data (example file: https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20220202/12/atmos/gfs.t12z.pgrb2.0p25.f006 ). Normally I would just use PyGRIB and I'd have this problem solved yesterday, but I am on Windows (because it's what my employer uses, so I'm stuck with it and have to make this work on a Windows environment) and Windows and PyGRIB don't play nice. I am able to open the GRIB2 file and even plot variables over the entire domain using GDAL. The only problem is I need a way to get an array of the latitude and longitude values at each grid point (similar to in PyGRIB doing .latlons() on a GRIB message) so I can plot a subset of the domain.
Basically, I'm trying to replicate what is being done in this video, and need the data (got it using dataset.GetRasterBand(269).ReadAsArray()), then the lat/lon information.
I also tried using xarray, but Windows doesn't play nice with xarray either.
Given your comfort with PyGRIB, I'd say the solution is to use Conda and install it on Windows. You can use conda-forge's miniforge to install conda. Then, however you get Conda, install pygrib with:
conda install -c conda-forge pygrib

which one is better in installing tensorflow

I followed the instructions on the official website to download the TensorFlow. I chose to create a virtual environment as the instruction shown for macOS. My question is that if I need to activate the virtual environment each time before I use TensorFlow?
For example, I want to use tensor flow on Jupiter notebook and that means I need to install Jupiter and other required packages like Seaborn/pandas as well on the virtual environment. However I already downloaded anaconda and basically, it has all the packages I need.
Besides, will it make a difference if I download it with conda?
Well, if you downloaded the packages (like you said TensorFlow and Seaborn) in the base Conda environment which is the default environment that anaconda provides on installation, then to use what it has, you need to run whatever program/IDE like Jupyter lab from it. So you would open Anaconda Prompt and then type in jupyter lab and it would open up a new socket and you can edit with your installed python libraries from Conda.
Otherwise in IDE's VSCode you can simply set the python interpreter to that from Conda.
However, if you install the libraries and packages you need using pip on your actual python installation not Conda, then there is no need for any activation. Everything will run right out of the box. You don't need to select the interpreter in IDE's like VSCode.
Bottom line, if you know what libraries you need and don't mind running pip install package-name every time you need a package, stick with pip.
If you don't like to that sort of 'low level' stuff then use Anaconda or Miniconda.

Conda and Jupyter Notebook Environment Confusion

I am using Jupyter Notebook to help debug some issues I'm having moving between JSON and pandas. The specific application isn't important.
The important part is that I needed to use pandas.json_normalize() which apparently first showed up in pandas version 1.0.3. I was confused when Jupyter said it doesn't exist. I did a version check and got:
In[]: pd.__version
Out[]: 0.25.2
This is not the version of python installed in either my base environment or the conda environment that Jupyter Notebook is running in or that the app is running in. Version checks in both environments in Anaconda Prompt (outside of Jupyter Notebook) confirm this.
What is going on here? Looking around I haven't seen a good answer, but it does appear that other people have had the same issue --- Jupyter defaulting to pandas 0.25.2 for some reason.
It seems that your Notebook is using a different kernel/environment than what you want.
run this in the notebook to see which environment you are using
! which python
or try
import sys
print(sys.executable)
which would show you which environment it's using, if you have env named venv then you will get something like.
/home/your_home_directory/anaconda3/envs/venv/bin/python
If you don't care about all of that and you just want to update the pandas that it's using then copy that path and do this.
! pip install --upgrade pandas
Note that this will also depend on which version of python you are using

Installing Pandas for PyPy on Alpine Linux?

As documented in the following question
, installing Pandas and Numpy is slow with Alpine Linux. For those using normal Python, there are workarounds that involve adding prebuilt versions of Pandas. However, these versions are for Python3. What is the best way to handle this with PyPy?
the solution would be to provide prebuilt versions for Alpine Linux. Someone has to do the work of building them and uploading to a public site. It seems the distro provides these for cpython, perhaps they could be convinced to do so for pypy as well.
New Versions of PyPy Already supports Pandas and Numpy
https://doc.pypy.org/en/latest/release-v5.9.0.html
So ,Official Image on docker should be supporting . So no need to build your Dockerfile from alpine
https://hub.docker.com/_/pypy
Do update first:
apk add --update py-pip
Or:
apk update
apk add py-pip
Or: install anaconda navigator : Click Here
Or: Last option : Click Here

User folder names with a blank causes failure for anaconda

A screen shot of my problem
I have been trying to install Keras for about a week now. I installed Anaconda and then Tensorflow with Python3.5 and Jupyter. When I start up with the Anaconda3 prompt it always gives me the message
>was unexpected at this time
C:\Users\Ray Van>#IF NOT "==" #chcp > NUL
C:\Users\Ray Van>
I used to be able to just say
Jupyter Notebook but it doesn't like this
Also I want to say activate tensorflow and then say jupyter notebook and then run a Python program with Keras (for Neural networks) but no matter what I tried, nothing works. I read somewhere that having the blank in the name \Ray Van] can be a problem but I didn't set that up. Somehow it was just set up by Windows 10 and from reading various posts, it seem very difficult to change without risking having to install Windows10 again. Various places say that it is very easy to install Keras, but I have found the opposite after trying several days for 3 hours at a time. But I am not good at installing things like this and don't really understand how all the things are connected. Maybe I have to start over and install Anaconda and then tensorflow and then from within the tensorflow environment install Keras and Jupyter. I know the pip command or the conda command are used for this but I don't really understand that either. So a total newbie who just wants to run some Python programs for my Neural Network research using Keras.