Pandas ValueError: buffer source array is read-only - pandas

I am trying to read a Parquet file into a Pandas dataframe. Using the API's below (or even if I use pd.read_parquet() wrapper), I am hit by ValueError buffer source array is read-only.
Having searched around online, it seems to relate to Cython not supporting read-only buffer, however I couldn't find any solution on how to address this problem.
How can I read Parquet file into a Pandas dataframe when the API throws ValueError buffer source array is read-only?
In [1]: import pandas as pd
...: import numpy as np
...: import pyarrow as pa
...: import pyarrow.parquet as pq
In [2]: table = pq.read_table('Parquet/Journal.parquet', columns=['SOURCE_CODE','YEAR','MONTH','AMOUNT'])
In [3]: df = table.to_pandas()
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85326489 entries, 0 to 85326488
Data columns (total 4 columns):
AMOUNT float64
SOURCE_CODE category
YEAR category
MONTH category
dtypes: category(3), float64(1)
memory usage: 895.1 MB
In [5]: df.groupby(['SOURCE_CODE','YEAR','MONTH'])['AMOUNT'].sum()

This is a bug in the latest release of pandas (0.23.x) and will be solved in pandas 0.24+. This issue was reported already by other users: https://github.com/pandas-dev/pandas/issues/23276 and is fixed though the following pull request: https://github.com/pandas-dev/pandas/pull/21688
For the sane fix, you need to wait for a new pandas release or manually install the git master. As a workaround you might be able to fix this by adding a dummy float column via df['__dummy__'] = np.nan. This will force pandas' BlockManager to reorder the float columns and should turn AMOUNT into a writable column.

I resolved this by adding a single before applying groupby.
df = df.copy() #add this line
df.groupby(['SOURCE_CODE','YEAR','MONTH'])['AMOUNT'].sum()

Related

Dataframe conversion from pandas to polars -- difference in the final dimensions

I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).
The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

warning when creating boxplot with nans in pandas

I have a pandas dataframe with two columns. One of the columns contains one nan value. Creating a histogram gives no warnings, but creating a boxplot gives a numpy VisibleDeprecationWarning. I use this in class and it worked fine the last couple of years. Advantage of pandas was always that hist and boxplot worked on data with nans. Current version that throws the error: numpy 1.19.1. pandas 1.1.0. Is this the intended behavior? Mismatch between versions? Example code:
%matplotlib inline
import numpy as np
import pandas as pd
data = pd.DataFrame()
data['test1'] = np.random.normal(size=100)
data['test2'] = np.random.normal(size=100)
data.test1[5] = np.nan # set one value to nan
data.boxplot() # throws VisibleDeprecationWarning
Warning:
/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order)
The combination of numpy 1.19.1. pandas 1.1.0 gives the warning. Warning disappears after updating to latest version (in this case numpy 1.19.2 and pandas 1.1.3).

create a dask dataframe from a dictionary

I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?
If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.

When does matplotlib (or which matplotlib api's) automatically convert Pandas timestamps to matplotlib dates?

Looking for confirmation or correction. It appears to me, that as long as I do this:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
then I can pass a Pandas DatetimeIndex (which contains Pandas dates of type timestamps.Timestamp) directly in as the x-coordinate for Axes.plot like this:
In [4]: df
Out[4]:
Open High Low Close Volume AdjClose
Date
2015-12-24 2063.52 2067.36 2058.73 2060.99 1411860000 2060.99
2015-12-23 2042.20 2064.73 2042.20 2064.29 3484090000 2064.29
2015-12-22 2023.15 2042.74 2020.49 2038.97 3520860000 2038.97
2015-12-21 2010.27 2022.90 2005.93 2021.15 3760280000 2021.15
2015-12-18 2040.81 2040.81 2005.33 2005.55 6683070000 2005.55
In [5]: type(df.index)
Out[5]: pandas.core.indexes.datetimes.DatetimeIndex
In [6]: type(df.index[0])
Out[6]: pandas._libs.tslibs.timestamps.Timestamp
...
In [11]: ax.plot( df.index, df['Close'] )
and the x-axis dates work fine. But when building a plot directly using
ax.add_line(lines)
where lines[] is a list of
Line2D(xdata,ydata)
items, (for example, as can be seen here: https://github.com/matplotlib/mpl_finance/blob/master/mpl_finance.py#L133-L154 ) then the xdata must already be converted to matplotlib dates (floats as number of days since 01/01/01) doing something like this:
xdata = mdates.date2num(df.index.to_pydatetime())
Is this correct that the ax.plot() automatically converts Pandas dates, but the lower level APIs do not? Or am I missing something?
Also, to add something to this (based on the first couple comments) ...
If I don't register the converters, I get this warning:
In [7]: ax.plot( df.index, df['Close'])
/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py:129: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.
To register the converters:
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
Out[7]: [<matplotlib.lines.Line2D at 0x12a437f98>]
I'm a little confused by the last line (Out[7]) ... does that mean that the code was inside Line2D when this warning was printed?