Pandas 0.24.0 breaks my pandas dataframe with special column identifiers - pandas

I had code that worked fine until I tried to run it on a coworker's machine, whereupon I discovered that while it worked using pandas 0.22.0, it broke on pandas 0.24.0. For the moment, we've solved this problem by downgrading their copy of pandas, but I would like to find a better solution if one exists.
The problem seems to be that I am creating a user-defined class to use as identifiers for my columns in the dataframe. When trying to compare two dataframes it for some reason tries to call my column labels as functions, and then throws an exception because they aren't callable
Here's some example code:
import pandas as pd
import numpy as np
class label(object):
def __init__(self, var):
self.var = var
def __eq__(self,other):
return self.var == other.var
df = pd.DataFrame(np.eye(5),columns=[label(ii) for ii in range(5)])
df == df
This produces the following stack trace:
Traceback (most recent call last):
File "<ipython-input-4-496e4ab3f9d9>", line 1, in <module>
df==df1
File "C:\...\site-packages\pandas\core\ops.py", line 2098, in f
return dispatch_to_series(self, other, func, str_rep)
File "C:\...\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
new_data = expressions.evaluate(column_op, str_rep, left, right)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
return op(a, b)
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in column_op
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in <dictcomp>
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1739, in wrapper
name=res_name).rename(res_name)
File "C:\...\site-packages\pandas\core\series.py", line 3733, in rename
return super(Series, self).rename(index=index, **kwargs)
File "C:\...\site-packages\pandas\core\generic.py", line 1091, in rename
level=level)
File "C:\...\site-packages\pandas\core\internals\managers.py", line 171, in rename_axis
obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level))
File "C:\...\site-packages\pandas\core\internals\managers.py", line 2004, in _transform_index
items = [func(x) for x in index]
TypeError: 'label' object is not callable
I've found I can fix the problem by making my class callable with a single argument and returning that argument, but that breaks .loc indexing, which will default to treating my objects as callables.
This problem only occurs when the custom objects are in the columns - the index can handle them just fine.
Is this a bug or a change in usage, and is there any way I can work around it without giving up my custom labels?

Related

pandas value_counts() with IntEnum raises RecursionError

I got the following code to elaborate on my problem. I'm using python 3.6 with pandas==0.25.3.
import pandas as pd
from enum import Enum, IntEnum
class BookType(Enum):
DRAMA = 5
ROMAN = 3
class AuthorType(IntEnum):
UNKNOWN = 0
GROUP = 1
MAN = 2
def print_num_type(df: pd.DataFrame, col_name: str, enum_type: Enum) -> int:
counts = df[col_name].value_counts()
val = counts[enum_type]
print('value counts:', counts)
print(f'Found "{val}" of type {enum_type}')
d = {'title': ['Charly Morry', 'James', 'Watson', 'Marry L.'], 'isbn': [21412412, 334764712, 12471021, 124141111], 'book_type': [BookType.DRAMA, BookType.ROMAN, BookType.ROMAN, BookType.ROMAN], 'author_type': [AuthorType.UNKNOWN, AuthorType.UNKNOWN, AuthorType.MAN, AuthorType.UNKNOWN]}
df = pd.DataFrame(data=d)
df.set_index(['title', 'isbn'], inplace=True)
df['book_type'] = df['book_type'].astype('category')
df['author_type'] = df['author_type'].astype('category')
print(df)
print(df.dtypes)
print_num_type(df, 'book_type', BookType.DRAMA)
print_num_type(df, 'author_type', AuthorType.UNKNOWN)
My pandas.DataFrame consists of two columns (book_type and author_type) of type categorical.
Furthermore, book_type is a class inheriting from type Enum and author_type from IntEnum. When calling print_num_type(df, 'book_type', BookType.DRAMA) everything works out as expected and the number of books of this type are printed, whereas print_num_type(df, 'author_type', AuthorType.UNKNOWN) raises the error:
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
Exception ignored in: 'pandas._libs.lib.c_is_list_like'
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\_weakrefset.py", line 72, in __contains__
wr = ref(item)
RecursionError: maximum recursion depth exceeded while calling a Python object
What am I doing wrong here?
Is there a workaround to get this error fixed? since I can't change the IntEnum type of AuthorType since it's provided from another library.
Thanks in advance!
See answer here
The main idea is that since x.value_counts() or counts in your function is itself a pandas Series, it's best to use .iat or .iloc when calling it, e.g, see iat docs
I think the easiest solution is to just use (x==0).sum(), or in your syntax:
val = (df[col_name]==enum_type).sum()
I put a minimal working example in the comments under your question so you can reproduce the problem/fix easily with the "x" notation.
What version of Pandas are you using? I realized after reproducing the error that upgrading Pandas (now on pandas-1.4.2) fixes the error, and the value_counts()[0] worked as expected.
run pip install --upgrade pandas

TypeError: 1st argument must be a real sequence 2 signal.spectrogram

I'm trying to take a signal from an electrical reading and decompose it into its spectrogram, but I keep getting a weird error. Here is the code:
f, t, Sxx = signal.spectrogram(i_data.values, 130)
plt.pcolormesh(t, f, Sxx)
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
And here is the error:
convert_to_spectrogram(i_data.iloc[1000,:10020].dropna().values)
Traceback (most recent call last):
File "<ipython-input-140-e5951b2d2d97>", line 1, in <module>
convert_to_spectrogram(i_data.iloc[1000,:10020].dropna().values)
File "<ipython-input-137-5d63a96c8889>", line 2, in convert_to_spectrogram
f, t, Sxx = signal.spectrogram(wf, 130)
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 750, in spectrogram
mode='psd')
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 1836, in _spectral_helper
result = _fft_helper(x, win, detrend_func, nperseg, noverlap, nfft, sides)
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 1921, in _fft_helper
result = func(result, n=nfft)
File "//anaconda3/lib/python3.7/site-packages/mkl_fft/_numpy_fft.py", line 335, in rfft
output = mkl_fft.rfft_numpy(x, n=n, axis=axis)
File "mkl_fft/_pydfti.pyx", line 609, in mkl_fft._pydfti.rfft_numpy
File "mkl_fft/_pydfti.pyx", line 502, in mkl_fft._pydfti._rc_fft1d_impl
TypeError: 1st argument must be a real sequence 2
My reading has a full cycle of 130 observations and its stored as individual values of a pandas df. The wave I am using in particular can be found here. Anyone have any ideas what this error means?
(Small disclaimer, I do not know much about signal processing, so please forgive me if this is a naive question)
Python 3.6.9, scipy 1.3.3
Downloading your file and reading it with pandas.read_csv, I could generate the following spectrogram.
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import spectrogram
i_data = pd.read_csv('wave.csv')
f, t, Sxx = spectrogram(i_data.values[:, 1], 130)
plt.pcolormesh(t, f, Sxx)
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()

Pandas to_hdf fails on dataframes containing nullable int dtypes (e.g. Int8Dtype)

I'm trying to reduce the memory consumption of some large data that we work with, so that more data can be appended to it without throwing memory errors. Downcasting floats where possible helps a little, but the major savings I#ve found have been from casting float64s the Int8 and Int16 where possible. This data contains NaNs. This is unavoidable, and in context there is no value I can replace NaNs with that doesn't change the meaning of the data. The new nullable dtypes are great for this, but I get ValueError: cannot convert float NaN to integer when trying to save the resulting frames to hdf.
I've tried using to_hdf with and without specifiying table format, and get different errors (without specifying table format the error is AttributeError: 'NoneType' object has no attribute 'names')
´´´
df=pd.DataFrame([1,2,3,np.nan,5], columns=['A'])
df.to_hdf('Z:/test.hd5', 'data')
#This works
df['A']=df.A.astype(pd.Int8Dtype())
df.to_hdf('Z:/test.hd5', 'data')
Traceback (most recent call last):
File "<ipython-input-51-6b0f3ad26286>", line 1, in <module>
df.to_hdf('Z:/test.hd5', 'data', complevel=9, complib='blosc:zlib')
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\core\generic.py", line 2377, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 274, in to_hdf
f(store)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 268, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 889, in put
self._write_to_group(key, value, append=append, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 1415, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 3022, in write
blk.values, items=blk_items)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 2750, in write_array
atom = _tables().Atom.from_dtype(value.dtype)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\tables\atom.py", line 381, in from_dtype
if basedtype.names:
AttributeError: 'NoneType' object has no attribute 'names'
´´´
Is this a bug? An intentional limitation? Or have I done something dumb?
This is a bug. See GitHub Issue #26144 for the status.

Pandas Group Example Errors

I am trying to replicate one example out of Wes McKinney's book on Pandas, the code is here (it assumes all names datafiles are under names folder)
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
names
def get_tops(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year','sex'])
grouped.apply(get_tops)
I am using Pandas 0.10 and Python 2.7. The error I am seeing is this:
Traceback (most recent call last):
File "names.py", line 21, in <module>
grouped.apply(get_tops)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 321, in apply
return self._python_apply_general(f)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 324, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, self.obj, self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 585, in apply
values, mutated = splitter.fast_apply(f, group_keys)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 2127, in fast_apply
results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
File "reduce.pyx", line 421, in pandas.lib.apply_frame_axis0 (pandas/lib.c:24934)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2028, in __setattr__
self[name] = value
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2043, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2078, in _set_item
value = self._sanitize_column(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2112, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
Any ideas?
I think this was a bug introduced in 0.10, namely issue #2605,
"AssertionError when using apply after GroupBy". It's since been fixed.
You can either wait for the 0.10.1 release, which shouldn't be too long from now, or you can upgrade to the development version (either via git or simply by downloading the zip of master.)

TypeError on read_csv, working in pandas .7, error in .8.0rc2, possible dependency error?

I am attempting to execute the following within python:
from pandas import *
tickdata = read_csv('/home/user/sourcefile.csv',index_col=0,parse_dates='TRUE')
The csv files has rows that look like:
2011/11/23 23:56:00.554389,1165.2500
2011/11/23 23:56:02.310943,1165.5000
2011/11/23 23:56:05.564009,1165.2500
On pandas .7, this executes fine. On pandas .8.0rc2, I get the error below. Because I have .7 and .8 installed on two different systems, I have not ruled out a dependency or python version difference. Any ideas on how to get this to execute under .8 are appreciated.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 225, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 192, in _read
return parser.get_chunk()
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 728, in get_chunk
index = self._agg_index(index)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 846, in _agg_index
if try_parse_dates and self._should_parse_dates(self.index_col):
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 874, in _should_parse_dates
return i in to_parse or name in to_parse
TypeError: 'in <string>' requires string as left operand, not int
I fixed the parser bug shown in the stack trace that you pasted. However, I'm wondering whether your date column is named "TRUE" or did you mean to just pass a boolean? I haven't dug through pandas history but I know that in 0.8 we are supporting much more complex date parsing behavior as part of the time series API so here we're interpreting the string value as a column name.
I've reported the bug on GitHub (best place for bug reports):
https://github.com/pydata/pandas/issues/1544
Should have a resolution today or tomorrow.