I have a pandas dataframe with two columns. One of the columns contains one nan value. Creating a histogram gives no warnings, but creating a boxplot gives a numpy VisibleDeprecationWarning. I use this in class and it worked fine the last couple of years. Advantage of pandas was always that hist and boxplot worked on data with nans. Current version that throws the error: numpy 1.19.1. pandas 1.1.0. Is this the intended behavior? Mismatch between versions? Example code:
%matplotlib inline
import numpy as np
import pandas as pd
data = pd.DataFrame()
data['test1'] = np.random.normal(size=100)
data['test2'] = np.random.normal(size=100)
data.test1[5] = np.nan # set one value to nan
data.boxplot() # throws VisibleDeprecationWarning
Warning:
/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order)
The combination of numpy 1.19.1. pandas 1.1.0 gives the warning. Warning disappears after updating to latest version (in this case numpy 1.19.2 and pandas 1.1.3).
Related
I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).
The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.
I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'
I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.
Looking for confirmation or correction. It appears to me, that as long as I do this:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
then I can pass a Pandas DatetimeIndex (which contains Pandas dates of type timestamps.Timestamp) directly in as the x-coordinate for Axes.plot like this:
In [4]: df
Out[4]:
Open High Low Close Volume AdjClose
Date
2015-12-24 2063.52 2067.36 2058.73 2060.99 1411860000 2060.99
2015-12-23 2042.20 2064.73 2042.20 2064.29 3484090000 2064.29
2015-12-22 2023.15 2042.74 2020.49 2038.97 3520860000 2038.97
2015-12-21 2010.27 2022.90 2005.93 2021.15 3760280000 2021.15
2015-12-18 2040.81 2040.81 2005.33 2005.55 6683070000 2005.55
In [5]: type(df.index)
Out[5]: pandas.core.indexes.datetimes.DatetimeIndex
In [6]: type(df.index[0])
Out[6]: pandas._libs.tslibs.timestamps.Timestamp
...
In [11]: ax.plot( df.index, df['Close'] )
and the x-axis dates work fine. But when building a plot directly using
ax.add_line(lines)
where lines[] is a list of
Line2D(xdata,ydata)
items, (for example, as can be seen here: https://github.com/matplotlib/mpl_finance/blob/master/mpl_finance.py#L133-L154 ) then the xdata must already be converted to matplotlib dates (floats as number of days since 01/01/01) doing something like this:
xdata = mdates.date2num(df.index.to_pydatetime())
Is this correct that the ax.plot() automatically converts Pandas dates, but the lower level APIs do not? Or am I missing something?
Also, to add something to this (based on the first couple comments) ...
If I don't register the converters, I get this warning:
In [7]: ax.plot( df.index, df['Close'])
/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py:129: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.
To register the converters:
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
Out[7]: [<matplotlib.lines.Line2D at 0x12a437f98>]
I'm a little confused by the last line (Out[7]) ... does that mean that the code was inside Line2D when this warning was printed?
As described here, Pandas.sort_index() sometimes emits a FutureWarning when doing a sort on a DateTimeIndex. That question isn't actionable, since it contains no MCVE. Here's one:
import pandas as pd
idx = pd.DatetimeIndex(['2017-07-05 07:00:00', '2018-07-05 07:15:00','2017-07-05 07:30:00'])
df = pd.DataFrame({'C1':['a','b','c']},index=idx)
df = df.tz_localize('UTC')
df.sort_index()
The warning looks like:
FutureWarning: Converting timezone-aware DatetimeArray to
timezone-naive ndarray with 'datetime64[ns]' dtype
The stack (Pandas 0.24.1) is:
__array__, datetimes.py:358
asanyarray, numeric.py:544
nargsort, sorting.py:257
sort_index, frame.py:4795
The error is emitted from datetimes.py, requesting that it be called with a dtype argument. However, there's no way to force that all the way up through nargsort -- it looks like obeying datetimes.py's request would require changes to both pandas and numpy.
Reported here. In the meantime, can you think of a workaround that I've missed?
Issue confirmed for the 0.24.2 milestone. Workaround is to filter the warning, thus:
with warnings.catch_warnings():
# Pandas 0.24.1 emits useless warning when sorting tz-aware index
warnings.simplefilter("ignore")
ds = df.sort_index()