rolling majority on non-numeric data - pandas

Given a dataframe:
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:
def majority(window):
freqs = scipy.stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df['a'] = pd.rolling_apply(df['a'], 3, majority)
And I get:
In [43]: df
Out[43]:
a
0 NaN
1 NaN
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
I'll have to deal with the NaNs, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:
In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
751 return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
752 return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753 center=False, args=args, kwargs=kwargs)
754
755
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
382 arg = _conv_timerule(arg, freq, how)
383
--> 384 return_hook, values = _process_data_structure(arg)
385
386 if values.size == 0:
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
433
434 if not issubclass(values.dtype.type, float):
--> 435 values = values.astype(float)
436
437 if kill_inf:
ValueError: could not convert string to float: a
I've tried converting a to a Categorical, but even then I get the same error. I can first convert to a Categorical, work on the codes and finally convert back from codes to labels, but that seems really convoluted.
Is there an easier/more natural solution?
(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq instead of unique, see here.)

Here is a way, using pd.Categorical:
import scipy.stats as stats
import pandas as pd
def majority(window):
freqs = stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')
cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)
yields
a b
0 NaN NaN
1 NaN NaN
2 1 a
3 1 a
4 1 a
5 1 a
6 1 b
7 2 b
8 2 b
9 2 b
10 2 b

Here is one way to do it by defining your own rolling apply function.
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['b'] = np.where(df.a == 1, 'A', 'B')
print(df)
Out[60]:
a b
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 1 A
7 2 B
8 2 B
9 2 B
10 2 B
def get_mode_from_Series(series):
return series.value_counts().index[0]
def my_rolling_apply_char(frame, window, func):
index = frame.index[window-1:]
values = [func(frame.iloc[i:i+window]) for i in range(len(frame)-window+1)]
return pd.Series(data=values, index=index).reindex(frame.index)
my_rolling_apply_char(df.b, 3, get_mode_from_Series)
Out[61]:
0 NaN
1 NaN
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 B
10 B
dtype: object

Related

Why does .loc not always match column names?

I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3

comverting the numpy array to proper dataframe

I have numpy array as data below
data = np.array([[1,2],[4,5],[7,8]])
i want to split it and change to dataframe with column name as below to get the first value of each array as below
df_main:
value_items excluded_items
1 2
4 5
7 8
from which later I can take like
df:
value_items
1
4
7
df2:
excluded_items
2
5
8
I tried to convert to dataframe with command
df = pd.DataFrame(data)
it resulted in still array of int32
so, the splitting is failure for me
Use reshape for 2d array and also add columns parameter:
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
Sample:
data = np.arange(785*2).reshape(1, 785, 2)
print (data)
[[[ 0 1]
[ 2 3]
[ 4 5]
...
[1564 1565]
[1566 1567]
[1568 1569]]]
print (data.shape)
(1, 785, 2)
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
print (df)
value_items excluded_items
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
.. ... ...
780 1560 1561
781 1562 1563
782 1564 1565
783 1566 1567
784 1568 1569
[785 rows x 2 columns]

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

Why the column name is missing in pandas output in the group by result?

Update
if use to_frame() the column name seems not in the same row
重量
型号
HG-R2075 2040
HG220 680
This is my code, it groups the "型号"(which means type), and get the sum of the "重量"(weight) and exclude the column("是否发送") with a value in it.
import pandas as pd
import numpy as np
import sys
import os
script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_dir ) # change to the path that you already know
try:
ClientName = sys.argv[1]
except :
print(u'没有输入或者错误的客户名称!')
df = pd.read_excel("Summary.xlsm")
df = df[df['客户'].str.contains(ClientName)][pd.isnull(df[u"是否已经发送"])].groupby([ u'型号'])[u'重量'].sum()
print('[CQ:face,id=21] ' + '*' * 10 + u'以下是' + ClientName + u'未发送的重量' + '*' * 10 + '[CQ:face,id=21]')
print(str(df))
Output is this :
[CQ:face,id=21] **********以下是KATUN未发送的重量**********[CQ:face,id=
21]
型号 (****the column name is missing here*****)
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64
I don't know why the column name is missing?
The output I want is this: how to make it?
型号 重量
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64
The result df of your groupby operation is actually a Series, not a DataFrame. That's why it is printed with a different format.
print(df.to_frame()) should to the trick.
EDIT: Actually in such a dataframe index name and column name will not be printed on the same row. To get a cleaner output, use reset_index to get 2 proper columns:
print(df.reset_index().to_string(index=False))
First use boolean indexing with chaining by &.
If need 2 column DataFrame add as_index=False or Series.reset_index:
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
Or:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
For one column DataFrame use Series.to_frame - first column is index:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
Sample:
np.random.seed(345)
N = 10
df = pd.DataFrame({'客户':np.random.choice(list('abc'), size=N),
u"是否已经发送":np.random.choice([np.nan,0], size=N),
u'型号':np.random.randint(2, size=N),
u'重量':np.random.randint(10, size=N)})
print (df)
型号 客户 是否已经发送 重量
0 0 a 0.0 4
1 0 a 0.0 0
2 1 b NaN 8
3 1 b NaN 5
4 1 c 0.0 6
5 1 a NaN 3
6 1 a NaN 3
7 1 b 0.0 4
8 0 a NaN 2
9 1 c NaN 8
ClientName = 'a'
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df1 = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
print(df1)
型号 重量
0 0 2
1 1 6
df1 = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
print(df1)
型号 重量
0 0 2
1 1 6
df2 = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
print (df2)
重量
型号
0 2
1 6

How to assign a Series to a DataFrame from a Panel?

I have a Panel
quotes_cc_returns
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 132 (major_axis) x 7 (minor_axis)
Items axis: VFINX to VWESX
Major_axis axis: 2001-01-31 00:00:00 to 2011-12-31 00:00:00
Minor_axis axis: Open to CC_Returns
and I can add a column which has a subtraction
quotes_premiums = quotes_cc_returns.transpose(2, 1, 0)
quotes_premiums['RiskPremium'] = quotes_premiums.CC_Returns.sub(ff_data_factors_subset.RF, axis=0)
but I'm unable to add a column with a simple assignment
quotes_premiums['MktRiskPremium'] = ff_data_factors_subset.MktMinusRF
because it returns this error
Traceback (most recent call last):
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 411, in <module>
File "D:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 243, in calcRiskPremiums
File "D:\Python27\Lib\site-packages\pandas\core\panel.py", line 668, in __setitem__
raise AssertionError()
AssertionError:
ff_data_factors_subset.MktMinusRF is a Series with the same length and index as quotes_premiums['MktRiskPremium'].
Thanks,
JM
The key is to use .loc to select the items and major axes where the new series should go. Here's an example that might help you sort it out.
In [16]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']})
In [17]: df
Out[17]:
A B
0 0 one
1 1 one
2 2 two
3 3 two
4 4 one
5 5 one
[6 rows x 2 columns]
In [18]: wp = pd.Panel({'L1': df, 'L2': df})
In [19]: other = pd.Series(np.arange(1, 7))
So to it in just item 'L1':
In [20]: wp.loc['L1', :, 'other'] = other
In [22]: wp['L1']
Out[22]:
A B other
0 0 one 1
1 1 one 2
2 2 two 3
3 3 two 4
4 4 one 5
5 5 one 6
[6 rows x 3 columns]
I've solved it by using this code:
quotes_premiums['MktRiskPremium'] = 0.0
quotes_premiums['MktRiskPremium'] = quotes_premiums.MktRiskPremium.add(ff_data_factors_subset.MktMinusRF, axis=0)
JM