Ignore warning Pandas KeyError: value not in index - pandas

Is there a way to suppress the pandas KeyError: '[x]' not in index? For example, if I have a data frame with columns A B C, and I call df[['A','B','C','D']], is it possible to have it just return A,B,C and ignore D if it does not exist?
Example code
import pandas as pd
import numpy as np
a = np.matrix('[1,4,5];[1,2,2];[9,7,5]')
df = pd.DataFrame(a,columns=['A','B','C'])
df[['A','B','C','D']]
Here's the error message
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['D'] not in index"

Use the column intersection with your desired list when selecting columns. You get all columns when they exist and only the subset that exists with fewer columns, without any errors.
l = ['A', 'B', 'C', 'D']
df[df.columns.intersection(l)]
A B C
0 1 4 5
1 1 2 2
2 9 7 5

If you definitely want D, you can reindex() on axis=1:
l = ['A', 'B', 'C', 'D']
df.reindex(l, axis=1)
A B C D
0 1 4 5 NaN
1 1 2 2 NaN
2 9 7 5 NaN

Related

Confusing indexer of pandas

I found the bracket indexer([]) very confusing.
import pandas as pd
import numpy as np
aa = np.asarray([[1,2,3],[4,5,6],[7,8,9]])
df = pd.DataFrame(aa)
df
output
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
Then I tried to index it with []
df[1]
output as below, it seems it gets me the values of a column
0 2
1 5
2 8
but..when I do
df[1:3]
it gets me the rows...
0 1 2
1 4 5 6
2 7 8 9
Besides that, it does not allow me to do
df[1,2]
it gives me error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Untitled-1.ipynb Cell 19' in <cell line: 1>()
----> 1 df[1,2]
File d:\ProgramData\Miniconda3\lib\site-packages\pandas\core\frame.py:3458, in DataFrame.__getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
File d:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\range.py:388, in RangeIndex.get_loc(self, key, method, tolerance)
386 except ValueError as err:
387 raise KeyError(key) from err
--> 388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: (1, 2)
Should I avoid using [] and always use loc and iloc instead ?
In pandas, if you want to select values by numeric index, you use iloc. a dataframe has 2 axes, so to select a specific cell you have to specify both axes (row and column). see the code.
df.iloc[0,0] # this should return the value 1
df.iloc[0,:] # this returns the first row
df.iloc[:,0] # first column
df.iloc[:2,:2] # this returns a slice of the dataframe which is the first two rows with the first two columns
to select values by labels (column names and index labels), use loc

TypeError when using chunksize argument to pandas method pd.read_csv()

I have a csv file like this:
1 1.1 0 0.1 13.1494 32.7957 2.27266 0.2 3 5.4 ... \
0 2 2 0 8.17680 4.76726 25.6957 1.13633 0 3 4.8 ...
1 3 0 0 8.22718 2.35340 15.2934 1.13633 0 3 4.8 ...
read the file using panda.read_csv:
data_raw = pd.read_csv(filename, chunksize=chunksize)
Now, I want to make a dataframe:
df = pd.DataFrame(data_raw, columns=['id', 'colNam1', 'colNam2', 'colNam3',...])
But I met a problem:
File "test.py", line 143, in <module>
data = load_frame(csvfile)
File "test.py", line 53, in load_frame
'id', 'colNam1', 'colNam2', 'colNam3',...])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 325, in __init__
raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator
I don't know why.
This is because what is returned when you pass chunksize as a param to read_csv is an iterable rather than a df as such.
To demonstrate:
In [67]:
import io
import pandas as pd
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), chunksize=1)
df
Out[67]:
<pandas.io.parsers.TextFileReader at 0x7e9e8d0>
You can see that the df here is in this case not a DataFrame but a TextFileReader object
It's unclear to me what you're really trying to achieve but if you want to read a specific number of rows you can pass nrows instead:
In [69]:
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), nrows=1)
df
Out[69]:
a b
0 0 -0.278303 -1.625377
The idea here with your original problem is that you need to iterate over it in order to get the chunks:
In [73]:
for r in df:
print(r)
a b
0 0 -0.278303 -1.625377
a b
1 1 -1.954218 0.843397
a b
2 2 1.213572 -0.098594
If you want to generate a df from the chunks you need to append to a list and then call concat:
In [77]:
df_list=[]
for r in df:
df_list.append(r)
pd.concat(df_list)
Out[77]:
a b
0 0 -0.278303 -1.625377
1 1 -1.954218 0.843397
2 2 1.213572 -0.098594

In Python Pandas using cumsum with groupby

I am trying to do a pandas cumsum(), where want to initialize the value to 0 every time group changes.
Say I have below dataframe where after group by I have col2(Group) and expect col3(cumsum) while using the function
Value Group Cumsum
a 1 0
a 1 1
a 1 2
b 2 0
b 2 1
b 2 2
b 2 3
c 3 0
c 3 1
d 4 0
This doesnt work
df['Cumsum'] = df['Group'].cumsum()
Please advise.
Thanks!
Hmm, this turned out more complicated than I imagined, due to getting the groups' keys back in. Perhaps someone else will find something shorter.
First, imports
import pandas as pd
import itertools
Now a DataFrame:
df = pd.DataFrame({
'a': ['a', 'b', 'a', 'b'],
'b': [0, 1, 2, 3]})
So now we separately do a groupby-cumsum, some itertools stuff for finding the keys, and combine both:
>>> pd.DataFrame({
'keys': list(itertools.chain.from_iterable([len(g) * [k] for k, g in df.b.groupby(df.a)])),
'cumsum': df.b.groupby(df.a).cumsum()})
cumsum keys
0 0 a
1 1 a
2 2 b
3 4 b

rolling majority on non-numeric data

Given a dataframe:
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:
def majority(window):
freqs = scipy.stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df['a'] = pd.rolling_apply(df['a'], 3, majority)
And I get:
In [43]: df
Out[43]:
a
0 NaN
1 NaN
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
I'll have to deal with the NaNs, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:
In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
751 return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
752 return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753 center=False, args=args, kwargs=kwargs)
754
755
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
382 arg = _conv_timerule(arg, freq, how)
383
--> 384 return_hook, values = _process_data_structure(arg)
385
386 if values.size == 0:
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
433
434 if not issubclass(values.dtype.type, float):
--> 435 values = values.astype(float)
436
437 if kill_inf:
ValueError: could not convert string to float: a
I've tried converting a to a Categorical, but even then I get the same error. I can first convert to a Categorical, work on the codes and finally convert back from codes to labels, but that seems really convoluted.
Is there an easier/more natural solution?
(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq instead of unique, see here.)
Here is a way, using pd.Categorical:
import scipy.stats as stats
import pandas as pd
def majority(window):
freqs = stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')
cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)
yields
a b
0 NaN NaN
1 NaN NaN
2 1 a
3 1 a
4 1 a
5 1 a
6 1 b
7 2 b
8 2 b
9 2 b
10 2 b
Here is one way to do it by defining your own rolling apply function.
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['b'] = np.where(df.a == 1, 'A', 'B')
print(df)
Out[60]:
a b
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 1 A
7 2 B
8 2 B
9 2 B
10 2 B
def get_mode_from_Series(series):
return series.value_counts().index[0]
def my_rolling_apply_char(frame, window, func):
index = frame.index[window-1:]
values = [func(frame.iloc[i:i+window]) for i in range(len(frame)-window+1)]
return pd.Series(data=values, index=index).reindex(frame.index)
my_rolling_apply_char(df.b, 3, get_mode_from_Series)
Out[61]:
0 NaN
1 NaN
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 B
10 B
dtype: object

How to assign a Series to a DataFrame from a Panel?

I have a Panel
quotes_cc_returns
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 132 (major_axis) x 7 (minor_axis)
Items axis: VFINX to VWESX
Major_axis axis: 2001-01-31 00:00:00 to 2011-12-31 00:00:00
Minor_axis axis: Open to CC_Returns
and I can add a column which has a subtraction
quotes_premiums = quotes_cc_returns.transpose(2, 1, 0)
quotes_premiums['RiskPremium'] = quotes_premiums.CC_Returns.sub(ff_data_factors_subset.RF, axis=0)
but I'm unable to add a column with a simple assignment
quotes_premiums['MktRiskPremium'] = ff_data_factors_subset.MktMinusRF
because it returns this error
Traceback (most recent call last):
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 411, in <module>
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 243, in calcRiskPremiums
File "D:\Python27\Lib\site-packages\pandas\core\panel.py", line 668, in __setitem__
raise AssertionError()
AssertionError:
ff_data_factors_subset.MktMinusRF is a Series with the same length and index as quotes_premiums['MktRiskPremium'].
Thanks,
JM
The key is to use .loc to select the items and major axes where the new series should go. Here's an example that might help you sort it out.
In [16]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']})
In [17]: df
Out[17]:
A B
0 0 one
1 1 one
2 2 two
3 3 two
4 4 one
5 5 one
[6 rows x 2 columns]
In [18]: wp = pd.Panel({'L1': df, 'L2': df})
In [19]: other = pd.Series(np.arange(1, 7))
So to it in just item 'L1':
In [20]: wp.loc['L1', :, 'other'] = other
In [22]: wp['L1']
Out[22]:
A B other
0 0 one 1
1 1 one 2
2 2 two 3
3 3 two 4
4 4 one 5
5 5 one 6
[6 rows x 3 columns]
I've solved it by using this code:
quotes_premiums['MktRiskPremium'] = 0.0
quotes_premiums['MktRiskPremium'] = quotes_premiums.MktRiskPremium.add(ff_data_factors_subset.MktMinusRF, axis=0)
JM