Should we use pandas.compat.StringIO or Python 2/3 StringIO? - pandas

StringIO is the file-like string buffer object we use when reading pandas dataframe from text, e.g. "How to create a Pandas DataFrame from a string?"
Which of these two imports should we use for StringIO (within pandas)? This is a long-running question that has never been resolved over four years.
StringIO.StringIO (Python 2) / io.StringIO (Python 3)
Advantages: more stable for futureproofing code, but forces us to version-fork, e.g. see code at bottom from EmilH.
pandas.compat.StringIO
pandas.compat is a 2/3 compatibility package ("without the need for 2to3") introduced back in 0.13.0 (Jan 2014)
pandas.compat package is still marked 'private' as of 0.22 and no plans to make 'public' says "Warning The pandas.core, pandas.compat, and pandas.util top-level modules are considered to be PRIVATE. Stability of functionality in those modules in not guaranteed." although they essentially haven't broken since 0.13
pandas.compat source defines
the imports builtins, StringIO/cStringIO, BytesIO, cPickle, httplib, iterator versions of range, filter, map and zip, plus other necessary elements for Python 3 compatibility - see the 0.13.0 whatsnew
Version 2/3 forking code for imports from standard (from EmilH):
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# Note: but this is very much a poor-man's version of pandas.compat, which contains much much more
Note:
pandas.compat has existed since pandas 0.13.0 (Jan 2014) as a subpackage within pandas
it also seems to have been released as a standalone package: 0.1.0 (Jun 10, 2017) and 0.1.1 (Jun 10, 2017)

I know this is an old question, but I followed breadcrumbs here, so perhaps still worth answering. It's not totally definitive, but current Pandas documentation suggests using the built in StringIO rather than it's own internal methods.
For examples that use the StringIO class, make sure you import it with from io import StringIO for Python 3.

FYI, as of pandas 0.25, StringIO was removed from pandas.compat (PR #25954), so you'll now see:
from pandas.compat import StringIO
ImportError: cannot import name 'StringIO' from 'pandas.compat'
This means the only answer is to import from the io module.

Related

Plot a subset of data from a grib file on google colab

I'm trying to plot a subset of a field from a grib file on google colab. The issue I am finding is that due to google colab using an older version of python I can't get enough libraries to work together to 1.) get a field from the grib file and then 2.) extract a subset of that field by lat/lon, and then 3.) be able to plot with matplotlib/cartopy.
I've been able to do each of the above steps on my own PC and there are numerous answers on this forum already that work away from colab, so the issue is related to making it work on the colab environment, which uses python 3.7.
For simplicity, here are some assumptions that could be made for anybody who wants to help.
1.) Use this file, since its been what I have been trying to use:
https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20221113/conus/hrrr.t18z.wrfnatf00.grib2
2.) You could use any field, but I've been extracting this one (output from pygrib):
14:Temperature:K (instant):lambert:hybrid:level 1:fcst time 0 hrs:from 202211131800
3.) You can get this data in zarr format from AWS, but the grib format uploads to the AWS database faster so I need to use it.
Here are some notes on what I've tried:
Downloading the data isn't an issue, it's mostly relating to extracting the data (by lat lon) that is the main issue. I've tried using condacolab or pip to download pygrib, pupygrib, pinio, or cfgrib. I can then use these to download the data above.
I could never get pupygrib or pinio to even download correctly. Cfgrib I was able to get it to work with conda, but then xarray fails when trying to extract fields due to a library conflict. Pygrib worked the best, I was able to extract fields from the grib file. However, the function grb.data(lat1=30,lat2=40,lon1=-100,lon2-90) fails. It dumps the data into 1d arrays instead of 2d as it is supposed to per the documentation found here: https://jswhit.github.io/pygrib/api.html#example-usage
Here is some code I used for the pygrib set up in case that is useful:
!pip install pyproj
!pip install pygrib
# Uninstall existing shapely
!pip uninstall --yes shapely
!apt-get install -qq libgdal-dev libgeos-dev
!pip install shapely --no-binary shapely
!pip install cartopy==0.19.0.post1
!pip install metpy
!pip install wget
!pip install s3fs
import time
from matplotlib import pyplot as plt
import numpy as np
import scipy
import pygrib
import fsspec
import xarray as xr
import metpy.calc as mpcalc
from metpy.interpolate import cross_section
from metpy.units import units
from metpy.plots import USCOUNTIES
import cartopy.crs as ccrs
import cartopy.feature as cfeature
!wget https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20221113/conus/hrrr.t18z.wrfnatf00.grib2
grbs = pygrib.open('/content/hrrr.t18z.wrfnatf00.grib2')
grb2 = grbs.message(1)
data, lats, lons = grb2.data(lat1=30,lat2=40,lon1=-100,lon2=-90)
data.shape
This will output a 1d array for data, or lats and lons. That is as far as I can get here because existing options like meshgrib don't work on big datasets (I tried it).
The other option is to get data this way:
grb_t = grbs.select(name='Temperature')[0]
This is plottable, but I don't know of a way to extract a subset of the data from here using lat/lons.
If you can help, feel free to ask me anything I can add more details, but since I've tried like 10 different ways probably no sense in adding every failure. Really, I am open to any way to accomplish this task. Thank you.

To_datetime in pandas module

This is a followup from a question which i had asked earlier: File size increased after imported Pandas
I have the following code:
pd.to_datetime(xl_file.index, format='%Y-%m-%d')
in this code to make it work i have to use import pandas as pd is there a way i can get this code to work without having to import the entire pandas package. I need to do it this way because the size of the .exe file increases dramatically.
Just import the to_datetime function, rather than the entire package.
from pandas import to_datetime
val = to_datetime("2020-01-01", format='%Y-%m-%d')
print(val)
Output:
2020-01-01 00:00:00

Seaborn renders plots slowly in apache zeppelin notebooks

I am currently trying to generate visualizations in zeppelin (0.8.1) notebooks using the pyspark interpreter with python 3.7.3.
Generating the following simple plot with seaborn (0.9.0) takes around 5 minutes (with very high CPU usage throughout the duration):
%pyspark
import seaborn as sns
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(100,3))
sns.pairplot(data)
This behavior is rather inconsistent as the following (much more data intensive) plot is rendered instantly
%pyspark
import seaborn as sns
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.rand(10000,2))
sns.lineplot(x = 0, y = 1, data = df)
I noticed that using matplotlib (3.1.0) is generally much faster for and almost as snappy as I am used to from jupyter notebook environments.
I have already read about issue ZEPPELIN-1894 but I can render the mentioned scatterplot instantly as well.
Ok, after posting here the solution is to use the %spark.ipyspark interpreter, this might require installing additional packages:
pip install protobuf grpcio

from pyramid.arima import auto_arima not working

I am doing some timeseries forecasting, while at it I am trying to import auto_arima using pyramid but it throws an Module not found error as - ''No module named 'pyramid.arima'
from pyramid.arima import auto_arima
I also tried importing auto_arima from pmdarima :
from pmdarima.arima import auto_arima
but this throws an error as -
"type object 'pmdarima.arima._arima.array' has no attribute 'reduce_cython'"
What am I doing wrong?...
I'm using pmdarima package without any issues, but your error is highly probably related to your numpy version. I would recommend to you to upgrade it (in case you use pip):
pip install --upgrade numpy
You can also try to import numpy package before importing auto_arima (some people experience strange behavior).
You can follow discussion on github issues - https://github.com/tgsmith61591/pmdarima/issues/91 (similar here or here). You're definitely not the first one with that issue.
If it doesn't help, please, paste your pmdarima and numpy versions.

Numpy and Scipy matrix inversion functions differences

My question is rather simple : What is the difference between the numpy.linalg.inv and the scipy.linalg.inv functions for matrices inversion
Is the Scipy function just a wrapper of the Numpy one ?
Efficiency, numerical stability, speed ... which one should I prefer ?
Thanks !
From the SciPy Documentation you get the following information:
scipy.linalg vs numpy.linalg
scipy.linalg contains all the functions in numpy.linalg. plus some other more advanced ones not contained in numpy.linalg
Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support, while for numpy this is optional. Therefore, the scipy version might be faster depending on how numpy was installed.
Therefore, unless you don’t want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg
I hope this helps!