Querying a column with lists in it - pandas

I have a dataframe with columns with lists in them. How can I query these?
>>> df1.shape
(1812871, 7)
>>> df1.dtypes
CHROM object
POS int32
ID object
REF object
ALT object
QUAL int8
FILTER object
dtype: object
>>> df1.head()
CHROM POS ID REF ALT QUAL FILTER
0 20 60343 rs527639301 G [A] 100 [PASS]
1 20 60419 rs538242240 A [G] 100 [PASS]
2 20 60479 rs149529999 C [T] 100 [PASS]
3 20 60522 rs150241001 T [TC] 100 [PASS]
4 20 60568 rs533509214 A [C] 100 [PASS]
>>> df2 = df1.head(30)
>>> df3 = df1.head(3000)
I found a previous question, but the solutions do not quite work for me. The accepted solution does not work:
>>> df2[df2.ALT.apply(lambda x: x == ['TC'])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
The reason being, the booleans get nested:
>>> df2.ALT.apply(lambda x: x == ['TC']).head()
0 [False]
1 [False]
2 [False]
3 [True]
4 [False]
Name: ALT, dtype: object
So I tried the second answer, which seemed to work:
>>> c = np.empty(1, object)
>>> c[0] = ['TC']
>>> df2[df2.ALT.values == c]
CHROM POS ID REF ALT QUAL FILTER
3 20 60522 rs150241001 T [TC] 100 [PASS]
But strangely, it doesn't work when I try it on the larger dataframe:
>>> df3[df3.ALT.values == c]
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
Which is probably because the result of the boolean comparison is different!
>>> df3.ALT.values == c
False
>>> df2.ALT.values == c
array([False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False])
This is completely baffling to me.

I found a hacky solution of casting the list as tuples works for me
df = pd.DataFrame({'CHROM': [20] *5,
'POS': [60343, 60419, 60479, 60522, 60568],
'ID': ['rs527639301', 'rs538242240', 'rs149529999', 'rs150241001', 'rs533509214'],
'REF': ['G', 'A', 'C', 'T', 'A'],
'ALT': [['A'], ['G'], ['T'], ['TC'], ['C']],
'QUAL': [100] * 5,
'FILTER': [['PASS']] * 5})
df['ALT'] = df['ALT'].apply(tuple)
df[df['ALT'] == ('C',)]
This method works because the immutability of tuples allows pandas to check if the entire element is correct compared to the intra-list elementwise comparison you got for the Boolean series because lists are not hashable.

Related

Not able to extract a column name using Panda data frame

KeyError: 'Name'
>>> df=pd.read_csv(text_file)
>>> print(df)
Name Age
0 Ritesh 32
1 Priyanka 29
>>> print(df['Name'].where(df['Name'] == 'Ritesh'))
Traceback (most recent call last):
File "/Users/reyansh/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Name'
During handling of the above exception, another exception occurred:
read_csv didn't read your file into two columns because you have a space as the separator and the default is a comma. Specify the space:
df = pd.read_csv(text_file, " ")

AttributeError: 'bool' object has no attribute 'strftime'

I am inheriting 'account.partner.ledger' module. When we select the customer we will be able to print the report of the customer's ledger. In the partner ledger menu I want to make 'include Initial Balances' checkbox checked by default if the filter is by date/period.I tried to override the method by my custom module but I am unable to solve the error which I am getting.
Code,
#api.multi
def onchange_filter(self,filter='filter_no', fiscalyear_id=False):
res = super(account_partner_ledger, self).onchange_filter(filter=filter, fiscalyear_id=fiscalyear_id)
if filter in ['filter_no', 'unreconciled']:
if filter == 'unreconciled':
res['value'].update({'fiscalyear_id': False})
res['value'].update({'initial_balance': False, 'period_from': False, 'period_to': False, 'date_from': False ,'date_to': False})
if filter in ['filter_date','filter_period']:
res['value'].update({'initial_balance': True, 'period_from': True, 'period_to': True, 'date_from': True ,'date_to': True})
return res
Error,
Traceback (most recent call last):
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 544, in _handle_exception
return super(JsonRequest, self)._handle_exception(exception)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 581, in dispatch
result = self._call_function(**self.params)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 317, in _call_function
return checked_call(self.db, *args, **kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\service\model.py", line 118, in wrapper
return f(dbname, *args, **kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 314, in checked_call
return self.endpoint(*a, **kw)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 810, in __call__
return self.method(*args, **kw)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\http.py", line 410, in response_wrap
response = f(*args, **kw)
File "C:\Users\zendynamix\odooGit\odoo8\addons\web\controllers\main.py", line 944, in call_kw
return self._call_kw(model, method, args, kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\addons\web\controllers\main.py", line 936, in _call_kw
return getattr(request.registry.get(model), method)(request.cr, request.uid, *args, **kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\api.py", line 268, in wrapper
return old_api(self, *args, **kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\api.py", line 399, in old_api
result = method(recs, *args, **kwargs)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\models.py", line 5985, in onchange
record._onchange_eval(name, field_onchange[name], result)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\models.py", line 5883, in _onchange_eval
self.update(self._convert_to_cache(method_res['value'], validate=False))
File "C:\Users\zendynamix\odooGit\odoo8\openerp\models.py", line 5391, in _convert_to_cache
for name, value in values.iteritems()
File "C:\Users\zendynamix\odooGit\odoo8\openerp\models.py", line 5392, in <dictcomp>
if name in fields
File "C:\Users\zendynamix\odooGit\odoo8\openerp\fields.py", line 1250, in convert_to_cache
return self.to_string(value)
File "C:\Users\zendynamix\odooGit\odoo8\openerp\fields.py", line 1240, in to_string
return value.strftime(DATE_FORMAT) if value else False
AttributeError: 'bool' object has no attribute 'strftime'
You have to look at the underlying code sometimes to understand what's going on, you're getting errors because Odoo is trying to convert a boolean object back to a string representation of a time (it expects a python date object)
You can fire up a terminal and reproduce your error:
>>> True.strftime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bool' object has no attribute 'strftime'
>>>
This is the to_string method from odoo
#staticmethod
def to_string(value):
""" Convert a :class:`date` value into the format expected by the ORM. """
return value.strftime(DATE_FORMAT) if value else False
The test condition if value test's to see if value evaluates to False, testing from the terminal
>>> x = ''
>>> if x: print('Yeah')
...
>>>
>>> x = True
>>> if x: print('Yeah')
...
Yeah
>>> x = False
>>> if x: print('Yeah')
...
>>>
>>>
from the output, we can draw a conclusion that an empty string or False evaluates to False while a True value will evaluate to True, so instead of setting the date values to True, set all of them to empty strings.
#api.multi
def onchange_filter(self,filter='filter_no', fiscalyear_id=False):
res = super(account_partner_ledger, self).onchange_filter(filter=filter, fiscalyear_id=fiscalyear_id)
if filter in ['filter_no', 'unreconciled']:
if filter == 'unreconciled':
res['value'].update({'fiscalyear_id': False})
res['value'].update({'initial_balance': False, 'period_from': False, 'period_to': False, 'date_from': False ,'date_to': False})
if filter in ['filter_date','filter_period']:
res['value'].update({'initial_balance': 'True', 'period_from': '', 'period_to': '', 'date_from': '', 'date_to': ''})
return res
When you look at your code you'll see:
'date_from': True ,'date_to': True
This causes your error.
You should set those fields to a date not to a Boolean.
The value False is valid, since you should be able to not fill in a date.
Try using strptime instead of strftime and see if it solves the problem.
You can use strptime as follows for example:-
from openerp.tools import DEFAULT_SERVER_DATETIME_FORMAT
my_date = datetime.strptime(self.date_column, DEFAULT_SERVER_DATETIME_FORMAT)

Pandas and timeseries

I have a dictionary of dataframes. I want to convert each dataframe in it to its respective timeseries. I am able to convert one nicely. But, if I do it within an iterator, it complains. Eg:
This works:
df = dfDict[4]
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
print df.head() works nicely.
But, this doesn't work:
tsDict = {}
for id, df in dfDict.iteritems():
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
tsDict[id] = df
It gives the following error message:
Traceback (most recent call last):
File "tsa.py", line 105, in <module>
main()
File "tsa.py", line 84, in main
df['start_date'] = pd.to_datetime(df['start_date'])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'start_date'
I am unable to see the subtle problem here...

Arithmetic in pandas HDF5 queries

Why am I getting an error when I try to do simple arithmetic on constants in an HDF5 where clause? Here's an example:
>>> import pandas
>>> import numpy as np
>>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)})
>>> store = pandas.HDFStore('teststore.h5', mode='w')
>>> store.append('thingy', d, format='table', data_columns=True, append=False)
>>> store.select('thingy', where="B>50")
A B
0 0 61
1 1 63
6 6 80
7 7 79
8 8 52
9 9 82
>>> store.select('thingy', where="B>40+10")
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
store.select('thingy', where="B>40+10")
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 682, in select
return it.get_result()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 1365, in get_result
results = self.func(self.start, self.stop, where)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 675, in func
columns=columns, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4006, in read
if not self.read_axes(where=where, **kwargs):
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 3212, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4527, in __init__
self.condition, self.filter = self.terms.evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 580, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 122, in prune
res = pr(left.value, right.prune(klass))
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 317, in evaluate
raise ValueError("query term is not valid [%s]" % self)
ValueError: query term is not valid [[Condition : [None]]]
Querying directly on the underlying pytables object seems to work:
>>> for row in store.get_storer('thingy').table.where("B>40+10"):
... print(row[:])
(0L, 0, 61)
(1L, 1, 63)
(6L, 6, 80)
(7L, 7, 79)
(8L, 8, 52)
(9L, 9, 82)
So what is going on here?
This is simply not supported. I suppose it could fail with a slightly better message. it is trying to and the 2 nodes (the comparison and the +10) and doesn't know how to deal with it as it's not a comparison operation.
I suppose it could be implemented but IMHO is needlessly complex

How to avoid error "Cannot compare type 'Timestamp' with type 'str'" pandas 0.16.0

I have various dataframes with this format
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-10-23, ..., 2010-06-15]
Length: 161, Freq: None, Timezone: None
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
every now and then, at the execution of this line:
zeros_idx = df[ (df.A==0) | (df.B==0) | (df.C==0) | (df.D==0) ].index
I get the following error with this stack trace:
zeros_idx = df[ (df.A==0) | (df.B==0) | (df.C==0) | (df.D==0) ].index
File "/usr/lib64/python3.4/site-packages/pandas/core/ops.py", line 811, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "/usr/lib64/python3.4/site-packages/pandas/core/frame.py", line 3158, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "/usr/lib64/python3.4/site-packages/pandas/core/frame.py", line 3191, in _combine_match_columns
left, right = self.align(other, join='outer', axis=1, level=level, copy=False)
File "/usr/lib64/python3.4/site-packages/pandas/core/generic.py", line 3143, in align
fill_axis=fill_axis)
File "/usr/lib64/python3.4/site-packages/pandas/core/generic.py", line 3225, in _align_series
return_indexers=True)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1810, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/tseries/index.py", line 904, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1820, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1830, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 2083, in _join_monotonic
join_index, lidx, ridx = self._outer_indexer(sv, ov)
File "pandas/src/generated.pyx", line 8558, in pandas.algos.outer_join_indexer_object (pandas/algos.c:157803)
File "pandas/tslib.pyx", line 823, in pandas.tslib._Timestamp.__richcmp__ (pandas/tslib.c:15585)
TypeError: Cannot compare type 'Timestamp' with type 'str'