Confusing indexer of pandas - pandas

I found the bracket indexer([]) very confusing.
import pandas as pd
import numpy as np
aa = np.asarray([[1,2,3],[4,5,6],[7,8,9]])
df = pd.DataFrame(aa)
df
output
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
Then I tried to index it with []
df[1]
output as below, it seems it gets me the values of a column
0 2
1 5
2 8
but..when I do
df[1:3]
it gets me the rows...
0 1 2
1 4 5 6
2 7 8 9
Besides that, it does not allow me to do
df[1,2]
it gives me error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Untitled-1.ipynb Cell 19' in <cell line: 1>()
----> 1 df[1,2]
File d:\ProgramData\Miniconda3\lib\site-packages\pandas\core\frame.py:3458, in DataFrame.__getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
File d:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\range.py:388, in RangeIndex.get_loc(self, key, method, tolerance)
386 except ValueError as err:
387 raise KeyError(key) from err
--> 388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: (1, 2)
Should I avoid using [] and always use loc and iloc instead ?

In pandas, if you want to select values by numeric index, you use iloc. a dataframe has 2 axes, so to select a specific cell you have to specify both axes (row and column). see the code.
df.iloc[0,0] # this should return the value 1
df.iloc[0,:] # this returns the first row
df.iloc[:,0] # first column
df.iloc[:2,:2] # this returns a slice of the dataframe which is the first two rows with the first two columns
to select values by labels (column names and index labels), use loc

Related

Why does .loc not always match column names?

I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3

Ignore warning Pandas KeyError: value not in index

Is there a way to suppress the pandas KeyError: '[x]' not in index? For example, if I have a data frame with columns A B C, and I call df[['A','B','C','D']], is it possible to have it just return A,B,C and ignore D if it does not exist?
Example code
import pandas as pd
import numpy as np
a = np.matrix('[1,4,5];[1,2,2];[9,7,5]')
df = pd.DataFrame(a,columns=['A','B','C'])
df[['A','B','C','D']]
Here's the error message
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['D'] not in index"
Use the column intersection with your desired list when selecting columns. You get all columns when they exist and only the subset that exists with fewer columns, without any errors.
l = ['A', 'B', 'C', 'D']
df[df.columns.intersection(l)]
A B C
0 1 4 5
1 1 2 2
2 9 7 5
If you definitely want D, you can reindex() on axis=1:
l = ['A', 'B', 'C', 'D']
df.reindex(l, axis=1)
A B C D
0 1 4 5 NaN
1 1 2 2 NaN
2 9 7 5 NaN

TypeError when using chunksize argument to pandas method pd.read_csv()

I have a csv file like this:
1 1.1 0 0.1 13.1494 32.7957 2.27266 0.2 3 5.4 ... \
0 2 2 0 8.17680 4.76726 25.6957 1.13633 0 3 4.8 ...
1 3 0 0 8.22718 2.35340 15.2934 1.13633 0 3 4.8 ...
read the file using panda.read_csv:
data_raw = pd.read_csv(filename, chunksize=chunksize)
Now, I want to make a dataframe:
df = pd.DataFrame(data_raw, columns=['id', 'colNam1', 'colNam2', 'colNam3',...])
But I met a problem:
File "test.py", line 143, in <module>
data = load_frame(csvfile)
File "test.py", line 53, in load_frame
'id', 'colNam1', 'colNam2', 'colNam3',...])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 325, in __init__
raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator
I don't know why.
This is because what is returned when you pass chunksize as a param to read_csv is an iterable rather than a df as such.
To demonstrate:
In [67]:
import io
import pandas as pd
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), chunksize=1)
df
Out[67]:
<pandas.io.parsers.TextFileReader at 0x7e9e8d0>
You can see that the df here is in this case not a DataFrame but a TextFileReader object
It's unclear to me what you're really trying to achieve but if you want to read a specific number of rows you can pass nrows instead:
In [69]:
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), nrows=1)
df
Out[69]:
a b
0 0 -0.278303 -1.625377
The idea here with your original problem is that you need to iterate over it in order to get the chunks:
In [73]:
for r in df:
print(r)
a b
0 0 -0.278303 -1.625377
a b
1 1 -1.954218 0.843397
a b
2 2 1.213572 -0.098594
If you want to generate a df from the chunks you need to append to a list and then call concat:
In [77]:
df_list=[]
for r in df:
df_list.append(r)
pd.concat(df_list)
Out[77]:
a b
0 0 -0.278303 -1.625377
1 1 -1.954218 0.843397
2 2 1.213572 -0.098594

rolling majority on non-numeric data

Given a dataframe:
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:
def majority(window):
freqs = scipy.stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df['a'] = pd.rolling_apply(df['a'], 3, majority)
And I get:
In [43]: df
Out[43]:
a
0 NaN
1 NaN
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
I'll have to deal with the NaNs, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:
In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
751 return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
752 return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753 center=False, args=args, kwargs=kwargs)
754
755
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
382 arg = _conv_timerule(arg, freq, how)
383
--> 384 return_hook, values = _process_data_structure(arg)
385
386 if values.size == 0:
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
433
434 if not issubclass(values.dtype.type, float):
--> 435 values = values.astype(float)
436
437 if kill_inf:
ValueError: could not convert string to float: a
I've tried converting a to a Categorical, but even then I get the same error. I can first convert to a Categorical, work on the codes and finally convert back from codes to labels, but that seems really convoluted.
Is there an easier/more natural solution?
(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq instead of unique, see here.)
Here is a way, using pd.Categorical:
import scipy.stats as stats
import pandas as pd
def majority(window):
freqs = stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')
cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)
yields
a b
0 NaN NaN
1 NaN NaN
2 1 a
3 1 a
4 1 a
5 1 a
6 1 b
7 2 b
8 2 b
9 2 b
10 2 b
Here is one way to do it by defining your own rolling apply function.
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['b'] = np.where(df.a == 1, 'A', 'B')
print(df)
Out[60]:
a b
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 1 A
7 2 B
8 2 B
9 2 B
10 2 B
def get_mode_from_Series(series):
return series.value_counts().index[0]
def my_rolling_apply_char(frame, window, func):
index = frame.index[window-1:]
values = [func(frame.iloc[i:i+window]) for i in range(len(frame)-window+1)]
return pd.Series(data=values, index=index).reindex(frame.index)
my_rolling_apply_char(df.b, 3, get_mode_from_Series)
Out[61]:
0 NaN
1 NaN
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 B
10 B
dtype: object

How to assign a Series to a DataFrame from a Panel?

I have a Panel
quotes_cc_returns
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 132 (major_axis) x 7 (minor_axis)
Items axis: VFINX to VWESX
Major_axis axis: 2001-01-31 00:00:00 to 2011-12-31 00:00:00
Minor_axis axis: Open to CC_Returns
and I can add a column which has a subtraction
quotes_premiums = quotes_cc_returns.transpose(2, 1, 0)
quotes_premiums['RiskPremium'] = quotes_premiums.CC_Returns.sub(ff_data_factors_subset.RF, axis=0)
but I'm unable to add a column with a simple assignment
quotes_premiums['MktRiskPremium'] = ff_data_factors_subset.MktMinusRF
because it returns this error
Traceback (most recent call last):
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 411, in <module>
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 243, in calcRiskPremiums
File "D:\Python27\Lib\site-packages\pandas\core\panel.py", line 668, in __setitem__
raise AssertionError()
AssertionError:
ff_data_factors_subset.MktMinusRF is a Series with the same length and index as quotes_premiums['MktRiskPremium'].
Thanks,
JM
The key is to use .loc to select the items and major axes where the new series should go. Here's an example that might help you sort it out.
In [16]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']})
In [17]: df
Out[17]:
A B
0 0 one
1 1 one
2 2 two
3 3 two
4 4 one
5 5 one
[6 rows x 2 columns]
In [18]: wp = pd.Panel({'L1': df, 'L2': df})
In [19]: other = pd.Series(np.arange(1, 7))
So to it in just item 'L1':
In [20]: wp.loc['L1', :, 'other'] = other
In [22]: wp['L1']
Out[22]:
A B other
0 0 one 1
1 1 one 2
2 2 two 3
3 3 two 4
4 4 one 5
5 5 one 6
[6 rows x 3 columns]
I've solved it by using this code:
quotes_premiums['MktRiskPremium'] = 0.0
quotes_premiums['MktRiskPremium'] = quotes_premiums.MktRiskPremium.add(ff_data_factors_subset.MktMinusRF, axis=0)
JM