Pandas Group Example Errors - pandas

I am trying to replicate one example out of Wes McKinney's book on Pandas, the code is here (it assumes all names datafiles are under names folder)
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
names
def get_tops(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year','sex'])
grouped.apply(get_tops)
I am using Pandas 0.10 and Python 2.7. The error I am seeing is this:
Traceback (most recent call last):
File "names.py", line 21, in <module>
grouped.apply(get_tops)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 321, in apply
return self._python_apply_general(f)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 324, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, self.obj, self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 585, in apply
values, mutated = splitter.fast_apply(f, group_keys)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 2127, in fast_apply
results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
File "reduce.pyx", line 421, in pandas.lib.apply_frame_axis0 (pandas/lib.c:24934)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2028, in __setattr__
self[name] = value
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2043, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2078, in _set_item
value = self._sanitize_column(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2112, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
Any ideas?

I think this was a bug introduced in 0.10, namely issue #2605,
"AssertionError when using apply after GroupBy". It's since been fixed.
You can either wait for the 0.10.1 release, which shouldn't be too long from now, or you can upgrade to the development version (either via git or simply by downloading the zip of master.)

Related

Python Script that used to work, is now getting automatically killed in Ubuntu

I was once able to run the below python script on my Ubuntu machine without the memory errors I was getting on windows.
import pandas as pd
import numpy as np
#create a pandas dataframe for each input file
dfs1 = pd.read_csv('s1.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfs2 = pd.read_csv('s2.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfr = pd.read_csv('r.csv' , encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
#combine them into one dataframe
dfs12r = pd.concat([dfs1, dfs2, dfr],ignore_index=True)#withour ignore index the line numbers are not adjusted
# bow is "comming
wordlist=[]
for line in range(8052):
for row in range(106) :
#print(line,row,dfs12r[row][line])
if dfs12r[row][line] not in wordlist :
wordlist.append(dfs12r[row][line])
wordlist.sort()
#print(wordlist)
print(len(wordlist)) #12350
dfBOW = pd.DataFrame(np.zeros((len(dfs12r.index), len(wordlist))),dtype='int')
#create the dictionary
wordDict = dict.fromkeys(wordlist,'default')
counter=0
for word in wordlist :
wordDict[word] = counter
counter+=1
#print(wordDict)
#will start scanning every word from dfS12R and +1 the respective cell in dfBOW
for line in range(8052):
for row in range(107):
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
Unfortunately, probably after some automatic Ubuntu updates I am now getting the simple message "KIlled", after trying to run the process without any further explanation.
Through simple print statements I know that the script is interrupted inside the for loop in the end.
I understand that I should be able to make the script more memory efficient, but I am also hoping for guidance on how to get Ubuntu able to run again the same script like they used to. (Through the TOP command I can see the all of my memory including the swap is being used while inside this loop)
Could paging have been disabled somehow after the updates? Any advice is welcome.
I still have 16GB of RAM, and use Ubuntu 20.04 (Specs are the same before and after the script stopped working). I use dual boot on the same SSD.
Below is the error I am getting from teh same script on windows :
Traceback (most recent call last):
File "D:\sharedfiles\Organised\WorkSpace\ptixiaki\github\ptixiaki\code\makingthedata\2.1 Approach (Same as 2 but turning all words to lowercase)\2.1_CSVtoDataframe\CSVtoBOW.py", line 60, in <module>
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1143, in __setitem__
self._maybe_update_cacher()
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1279, in _maybe_update_cacher
ref._maybe_cache_changed(cacher[0], self, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\frame.py", line 3950, in _maybe_cache_changed
self._mgr.iset(loc, arraylike, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\managers.py", line 1141, in iset
blk.delete(blk_locs)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\blocks.py", line 388, in delete
self.values = np.delete(self.values, loc, 0) # type: ignore[arg-type]
File "<__array_function__ internals>", line 5, in delete
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\numpy\lib\function_base.py", line 4555, in delete
new = arr[tuple(slobj)]
MemoryError: Unable to allocate 501. MiB for an array with shape (12234, 10736) and data type int32

Python matplotlib/pandas fails with "OverflowError: value too large to convert to npy_uint32"

I am trying to run this variant call pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32. This is a Python script, that plots from a gzipped tsv file. I guess, I just have to many rows in my calls.tsv.gzto be handled. The complete error log is:
Traceback (most recent call last):
File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
return unstack(self, level, fill_value)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
constructor=obj._constructor_expanddim)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
self.index = index.remove_unused_levels()
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
uniques = algos.unique(lab)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
table = htable(len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32
any ideas?

TypeError: 1st argument must be a real sequence 2 signal.spectrogram

I'm trying to take a signal from an electrical reading and decompose it into its spectrogram, but I keep getting a weird error. Here is the code:
f, t, Sxx = signal.spectrogram(i_data.values, 130)
plt.pcolormesh(t, f, Sxx)
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
And here is the error:
convert_to_spectrogram(i_data.iloc[1000,:10020].dropna().values)
Traceback (most recent call last):
File "<ipython-input-140-e5951b2d2d97>", line 1, in <module>
convert_to_spectrogram(i_data.iloc[1000,:10020].dropna().values)
File "<ipython-input-137-5d63a96c8889>", line 2, in convert_to_spectrogram
f, t, Sxx = signal.spectrogram(wf, 130)
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 750, in spectrogram
mode='psd')
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 1836, in _spectral_helper
result = _fft_helper(x, win, detrend_func, nperseg, noverlap, nfft, sides)
File "//anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py", line 1921, in _fft_helper
result = func(result, n=nfft)
File "//anaconda3/lib/python3.7/site-packages/mkl_fft/_numpy_fft.py", line 335, in rfft
output = mkl_fft.rfft_numpy(x, n=n, axis=axis)
File "mkl_fft/_pydfti.pyx", line 609, in mkl_fft._pydfti.rfft_numpy
File "mkl_fft/_pydfti.pyx", line 502, in mkl_fft._pydfti._rc_fft1d_impl
TypeError: 1st argument must be a real sequence 2
My reading has a full cycle of 130 observations and its stored as individual values of a pandas df. The wave I am using in particular can be found here. Anyone have any ideas what this error means?
(Small disclaimer, I do not know much about signal processing, so please forgive me if this is a naive question)
Python 3.6.9, scipy 1.3.3
Downloading your file and reading it with pandas.read_csv, I could generate the following spectrogram.
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import spectrogram
i_data = pd.read_csv('wave.csv')
f, t, Sxx = spectrogram(i_data.values[:, 1], 130)
plt.pcolormesh(t, f, Sxx)
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()

Pandas 0.24.0 breaks my pandas dataframe with special column identifiers

I had code that worked fine until I tried to run it on a coworker's machine, whereupon I discovered that while it worked using pandas 0.22.0, it broke on pandas 0.24.0. For the moment, we've solved this problem by downgrading their copy of pandas, but I would like to find a better solution if one exists.
The problem seems to be that I am creating a user-defined class to use as identifiers for my columns in the dataframe. When trying to compare two dataframes it for some reason tries to call my column labels as functions, and then throws an exception because they aren't callable
Here's some example code:
import pandas as pd
import numpy as np
class label(object):
def __init__(self, var):
self.var = var
def __eq__(self,other):
return self.var == other.var
df = pd.DataFrame(np.eye(5),columns=[label(ii) for ii in range(5)])
df == df
This produces the following stack trace:
Traceback (most recent call last):
File "<ipython-input-4-496e4ab3f9d9>", line 1, in <module>
df==df1
File "C:\...\site-packages\pandas\core\ops.py", line 2098, in f
return dispatch_to_series(self, other, func, str_rep)
File "C:\...\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
new_data = expressions.evaluate(column_op, str_rep, left, right)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
return op(a, b)
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in column_op
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in <dictcomp>
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1739, in wrapper
name=res_name).rename(res_name)
File "C:\...\site-packages\pandas\core\series.py", line 3733, in rename
return super(Series, self).rename(index=index, **kwargs)
File "C:\...\site-packages\pandas\core\generic.py", line 1091, in rename
level=level)
File "C:\...\site-packages\pandas\core\internals\managers.py", line 171, in rename_axis
obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level))
File "C:\...\site-packages\pandas\core\internals\managers.py", line 2004, in _transform_index
items = [func(x) for x in index]
TypeError: 'label' object is not callable
I've found I can fix the problem by making my class callable with a single argument and returning that argument, but that breaks .loc indexing, which will default to treating my objects as callables.
This problem only occurs when the custom objects are in the columns - the index can handle them just fine.
Is this a bug or a change in usage, and is there any way I can work around it without giving up my custom labels?

TypeError on read_csv, working in pandas .7, error in .8.0rc2, possible dependency error?

I am attempting to execute the following within python:
from pandas import *
tickdata = read_csv('/home/user/sourcefile.csv',index_col=0,parse_dates='TRUE')
The csv files has rows that look like:
2011/11/23 23:56:00.554389,1165.2500
2011/11/23 23:56:02.310943,1165.5000
2011/11/23 23:56:05.564009,1165.2500
On pandas .7, this executes fine. On pandas .8.0rc2, I get the error below. Because I have .7 and .8 installed on two different systems, I have not ruled out a dependency or python version difference. Any ideas on how to get this to execute under .8 are appreciated.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 225, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 192, in _read
return parser.get_chunk()
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 728, in get_chunk
index = self._agg_index(index)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 846, in _agg_index
if try_parse_dates and self._should_parse_dates(self.index_col):
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 874, in _should_parse_dates
return i in to_parse or name in to_parse
TypeError: 'in <string>' requires string as left operand, not int
I fixed the parser bug shown in the stack trace that you pasted. However, I'm wondering whether your date column is named "TRUE" or did you mean to just pass a boolean? I haven't dug through pandas history but I know that in 0.8 we are supporting much more complex date parsing behavior as part of the time series API so here we're interpreting the string value as a column name.
I've reported the bug on GitHub (best place for bug reports):
https://github.com/pydata/pandas/issues/1544
Should have a resolution today or tomorrow.