numpy returning record in a where query for an already deleted index in pandas df - numpy

I am trying to run this command:
ipums = ipums.drop(np.where(ipums['wkswork1'] == 0)[0])
but I am getting an error:
raise ValueError('labels %s not contained in axis' % labels[mask])
I check the ipums dataset for a value returned in the array:
ipums[207]
and I get:
File "index.pyx", line 128, in pandas.index.IndexEngine.get_loc (pandas/index.c:3542)
File "index.pyx", line 138, in pandas.index.IndexEngine.get_loc (pandas/index.c:3322)
KeyError: 207
Which I assume it means it was deleted in an earlier record. (And it was because of a similar earlier command that addressed a different field)
Am I missing something here?

The usual way you would do this in pandas is to use a boolean mask:
ipums = ipums[ipums['wkswork1'] != 0]
You can also use a ~ to negate the mask.
There error is raised because when you use numpy's where it returns the integer locations of the rows, rather than the labels, this means you can't use drop (as this uses labels).

Related

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

Pandas to_hdf fails on dataframes containing nullable int dtypes (e.g. Int8Dtype)

I'm trying to reduce the memory consumption of some large data that we work with, so that more data can be appended to it without throwing memory errors. Downcasting floats where possible helps a little, but the major savings I#ve found have been from casting float64s the Int8 and Int16 where possible. This data contains NaNs. This is unavoidable, and in context there is no value I can replace NaNs with that doesn't change the meaning of the data. The new nullable dtypes are great for this, but I get ValueError: cannot convert float NaN to integer when trying to save the resulting frames to hdf.
I've tried using to_hdf with and without specifiying table format, and get different errors (without specifying table format the error is AttributeError: 'NoneType' object has no attribute 'names')
´´´
df=pd.DataFrame([1,2,3,np.nan,5], columns=['A'])
df.to_hdf('Z:/test.hd5', 'data')
#This works
df['A']=df.A.astype(pd.Int8Dtype())
df.to_hdf('Z:/test.hd5', 'data')
Traceback (most recent call last):
File "<ipython-input-51-6b0f3ad26286>", line 1, in <module>
df.to_hdf('Z:/test.hd5', 'data', complevel=9, complib='blosc:zlib')
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\core\generic.py", line 2377, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 274, in to_hdf
f(store)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 268, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 889, in put
self._write_to_group(key, value, append=append, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 1415, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 3022, in write
blk.values, items=blk_items)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 2750, in write_array
atom = _tables().Atom.from_dtype(value.dtype)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\tables\atom.py", line 381, in from_dtype
if basedtype.names:
AttributeError: 'NoneType' object has no attribute 'names'
´´´
Is this a bug? An intentional limitation? Or have I done something dumb?
This is a bug. See GitHub Issue #26144 for the status.

Pandas 0.24.0 breaks my pandas dataframe with special column identifiers

I had code that worked fine until I tried to run it on a coworker's machine, whereupon I discovered that while it worked using pandas 0.22.0, it broke on pandas 0.24.0. For the moment, we've solved this problem by downgrading their copy of pandas, but I would like to find a better solution if one exists.
The problem seems to be that I am creating a user-defined class to use as identifiers for my columns in the dataframe. When trying to compare two dataframes it for some reason tries to call my column labels as functions, and then throws an exception because they aren't callable
Here's some example code:
import pandas as pd
import numpy as np
class label(object):
def __init__(self, var):
self.var = var
def __eq__(self,other):
return self.var == other.var
df = pd.DataFrame(np.eye(5),columns=[label(ii) for ii in range(5)])
df == df
This produces the following stack trace:
Traceback (most recent call last):
File "<ipython-input-4-496e4ab3f9d9>", line 1, in <module>
df==df1
File "C:\...\site-packages\pandas\core\ops.py", line 2098, in f
return dispatch_to_series(self, other, func, str_rep)
File "C:\...\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
new_data = expressions.evaluate(column_op, str_rep, left, right)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
return op(a, b)
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in column_op
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in <dictcomp>
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1739, in wrapper
name=res_name).rename(res_name)
File "C:\...\site-packages\pandas\core\series.py", line 3733, in rename
return super(Series, self).rename(index=index, **kwargs)
File "C:\...\site-packages\pandas\core\generic.py", line 1091, in rename
level=level)
File "C:\...\site-packages\pandas\core\internals\managers.py", line 171, in rename_axis
obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level))
File "C:\...\site-packages\pandas\core\internals\managers.py", line 2004, in _transform_index
items = [func(x) for x in index]
TypeError: 'label' object is not callable
I've found I can fix the problem by making my class callable with a single argument and returning that argument, but that breaks .loc indexing, which will default to treating my objects as callables.
This problem only occurs when the custom objects are in the columns - the index can handle them just fine.
Is this a bug or a change in usage, and is there any way I can work around it without giving up my custom labels?

DataFrame.apply(func, raw=True) doesn't seem to take effect?

I am trying to hash together only a few columns of my dataframe df so I do
temp = df['field1', 'field2']
df["hash"] = temp.apply(lambda x: hash(x), raw=True, axis=1)
I set raw to true because the doc (I am using 0.22) says it will pass a numpy array instead of a mutable Series but even with raw=True I am getting a Series, why?
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/frame.py", line 4973, in _apply_standard
results[i] = func(v)
File "/home/teto/mptcpanalyzer/mptcpanalyzer/data.py", line 190, in _hash_row
return hash(x)
File "/nix/store/9ampki9dbq0imhhm7i27qkh56788cjpz-python3.6-pandas-0.22.0/lib/python3.6/site-packages/pandas/core/generic.py", line 1045, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 1')
It's strange, as I can't reproduce your exact error (that is, by me, raw=True indeed results in an np.ndarray being passed). In any case, neither a Series nor a np.ndarray are hashable. The following works, though:
temp.apply(lambda x: hash(tuple(x)), axis=1)
A tuple is hashable.

Pandas: Location of a row with error

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:
df['x']=df['x'].astype('int')
...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'
In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.
The error you are seeing might be due to the value(s) in the x column being strings:
In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'
Ideally, the problem can be avoided by making sure the values stored in the
DataFrame are already ints not strings when the DataFrame is built.
How to do that depends of course on how you are building the DataFrame.
After the fact, the DataFrame could be fixed using applymap:
import ast
df = df.applymap(ast.literal_eval).astype('int')
but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.
Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.
However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.
There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.
So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:
df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
try:
int(item)
except ValueError:
print('ERROR at index {}: {!r}'.format(i, item))
yields
ERROR at index 0: '1.0692e+06'
I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.
import pandas as pd
import sys
def binarySearch(df, l, r, func):
while l <= r:
mid = l + (r - l) // 2;
result = func(df, mid, mid+1)
if result:
# Check if we hit exception at mid
return mid, result
result = func(df, l, mid)
if result is None:
# If no exception at left, ignore left half
l = mid + 1
else:
r = mid - 1
# If we reach here, then the element was not present
return -1
def check(df, start, end):
result = None
try:
# In my case, I want to find out which row cause this failure
df.iloc[start:end].uid.astype(int)
except Exception as e:
result = str(e)
return result
df = pd.read_csv(sys.argv[1])
index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)
To report all rows which fails to map due to any exception:
df.apply(my_function) # throws various exceptions at unknown rows
# print Exceptions, index, and row content
for i, row in enumerate(df):
try:
my_function(row)
except Exception as e:
print('Error at index {}: {!r}'.format(i, row))
print(e)