pd.Categorical.from_codes with missing values - pandas

Assume I have:
df = pd.DataFrame({'gender': np.random.choice([1, 2], 10), 'height': np.random.randint(150, 210, 10)})
I'd like to make the gender column categorical. If I try:
df['gender'] = pd.Categorical.from_codes(df['gender'], ['female', 'male'])
it'll fail.
I can pad the categories
df['gender'] = pd.Categorical.from_codes(df['gender'], ['N/A', 'female', 'male'])
But then 'N/A' is returned in some methods:
In [67]: df['gender'].value_counts()
Out[67]:
female 5
male 5
N/A 0
Name: gender, dtype: int64
I thought about using None as the padding value. It works as intended in the value_counts however I get a warning:
opt/anaconda3/bin/ipython:1: FutureWarning:
Setting NaNs in `categories` is deprecated and will be removed in a future version of pandas.
#!/opt/anaconda3/bin/python
Any better way to do this? Also is there a way to give a mapping from code to category explicitly?

you can use rename_categories() method:
Demo:
In [33]: df
Out[33]:
gender height
0 1 203
1 2 169
2 2 181
3 1 172
4 2 174
5 1 166
6 2 187
7 2 200
8 1 208
9 1 201
In [34]: df['gender'] = df['gender'].astype('category').cat.rename_categories(['male','feemale'])
In [35]: df
Out[35]:
gender height
0 male 203
1 feemale 169
2 feemale 181
3 male 172
4 feemale 174
5 male 166
6 feemale 187
7 feemale 200
8 male 208
9 male 201
In [36]: df.dtypes
Out[36]:
gender category
height int32
dtype: object

Assign the new categories directly to it's .categories attribute and it would then be renamed to these values:
df['gender'] = df['gender'].astype('category')
df['gender'].cat.categories = ['female', 'male']
df['gender'].value_counts()
Out[23]:
female 7
male 3
Name: gender, dtype: int64
df.dtypes
Out[24]:
gender category
height int32
dtype: object
If you want a mapper dict of code and it's respective category, then:
old = df['gender'].cat.categories
new = ['female', 'male']
dict(zip(old, new))
Out[28]:
{1: 'female', 2: 'male'}

The error you get from pd.Categorical.from_codes(df['gender'], ['female', 'male']) should alert you that your codes need to be 0 indexed.
So you can simply make it so with your DataFrame declaration.
df = pd.DataFrame({'gender': np.random.choice([0, 1], 10), 'height': np.random.randint(150, 210, 10)})

Related

Improper broadcasting(?) on dataframe multiply() operation on two multindex slices

I'm trying to multipy() two multilevel slices of a dataframe, however I'm unable to coerce the multiply operation to broadcast properly, so I just end up with lots of nans. It's like somehow I'm not specifying the indexing properly.
I've tried all variations of both axis and level but it eithers throws an exception or gives me a 6x6 grid of Nan
import numpy as np
import pandas as pd
np.random.seed(0)
idx = pd.IndexSlice
df_a = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['weight', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.randint(70, high=120, size=(6,3), dtype=int)
)
df_a.index.name= "m"
df_b = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['coef', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df_b.index.name= "m"
df_c = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['extraneous', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df = df_a.join([df_b, df_c])
# What I'm wanting:
# new column = coef*weight
#measure NewCol
#person alice bob sue
#m
#0 30.2 48.1 88.9
#...
#5 18.3 32.2 103
#all of these variations generatea 6x6 grid of NaNs
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="rows", )
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="colums", )
Here is an approach using pandas.concat:
df = pd.concat([df,
pd.concat({'NewCol': df['coef'].mul(df['weight'])},
axis=1)],
axis=1)
output:
measure weight coef extraneous NewCol
person alice bob sue alice bob sue alice bob sue alice bob sue
m
0 107 98 89 0.906243 0.761173 0.754762 0.889252 0.140435 0.708203 96.968045 74.594927 67.173827
1 106 77 117 0.193279 0.138338 0.699014 0.826331 0.087769 0.242337 20.487623 10.652021 81.784634
2 104 77 101 0.340416 0.131111 0.394653 0.465670 0.825667 0.624923 35.403258 10.095575 39.859948
3 80 92 116 0.329999 0.144878 0.794014 0.539082 0.968411 0.588952 26.399889 13.328731 92.105674
4 75 76 100 0.024841 0.083313 0.113684 0.160948 0.003354 0.246954 1.863067 6.331802 11.368357
5 115 99 71 0.662492 0.755795 0.123242 0.144265 0.993883 0.513367 76.186541 74.823720 8.750217
You can try via to_numpy() If you want to assign changes back to DataFrame:
df.loc[:,idx['weight',:]]=df.loc[:,idx['weight',:]].to_numpy()*df.loc[:,idx['coef',:]].to_numpy()
#you can also use values attribute
OR
If you want to create a new MultiIndexed column then use concat()+join():
df=df.join(pd.concat([df['coef'].mul(df['weight'])],keys=['NewCol'],axis=1))
#OR
#df=df.join(pd.concat({'NewCol': df['coef'].mul(df['weight'])},axis=1))

Pandas-way to separate a DataFrame based on previouse groupby() explorations without loosing the not-grouped columns

I tried to translate the problem with my real data to example data presented in my question. Maybe I just have a simple technical problem. Or maybe my whole way and workflow is not the best?
The objectiv
There are persons (column name) who have eaten different fruit's at different day's. And there is some more data (column foo and bar) I do not want to lose.
I want to separate/split the original data, without loosing the additational data (in foo and bar).
The condition to separate is the number of unique fruits eaten at the specific days.
That is the initial data
>>> df
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Anna 1 Banana 495 924
4 Anna 1 Strawberry 236 542
5 Bob 1 Strawberry 420 894
6 Bob 2 Apple 27 192
7 Bob 2 Kiwi 671 145
The separated interim result should look like this two DataFrame's:
>>> two
name day fruit foo bar
0 Anna 1 Banana 495 924
1 Anna 1 Strawberry 236 542
2 Bob 2 Apple 27 192
3 Bob 2 Kiwi 671 145
>>> non_two
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Bob 1 Strawberry 420 894
Example explanation in words: Tim ate just Apple's at day 1 and 2. It does not matter how many apples. It just matters that it is one unique fruit.
What I have done so far
I did some groupby() magic to find out who and when have eaten two or less/more then two unique fruits.
import pandas as pd
import random as rd
data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
'day': [1, 1, 2, 1, 1, 1, 2, 2],
'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
'Strawberry', 'Apple', 'Kiwi'],
'foo': rd.sample(range(1000), 8),
'bar': rd.sample(range(1000), 8)
}
# That is the primary DataFrame
df = pd.DataFrame(data)
# Explore the data
a = df[['name', 'day', 'fruit']].groupby(['name', 'day', 'fruit']).count().reset_index()
b = a.groupby(['name', 'day']).count()
# People who ate 2 fruits on specific days
two = b[(b.fruit == 2)].reset_index()
print(two)
# People who ate less or more then 2 fruits on specific days
non_two = b[(b.fruit != 2)].reset_index()
print(non_two)
Here is my roadblocker
With the dataframes two and non_two I have the informations I want. Know I want to separate the initial dataframe based on that informations. I think name and day are the columns I should use to select and separate in the initial dataframe.
# filter mask
mymask = (df.name == two.name) & (df.day == two.day)
df_two = df[mymask]
df_non_two = df[~mymask]
But this does not work. The first line raise ValueError: Can only compare identically-labeled Series objects.
Use DataFrameGroupBy.nunique in GroupBy.transform, so possible filter original DataFrame:
mymask = df.groupby(['name', 'day'])['fruit'].transform('nunique').eq(2)
df_two = df[mymask]
df_non_two = df[~mymask]
print (df_two)
name day fruit foo bar
3 Anna 1 Banana 335 62
4 Anna 1 Strawberry 286 694
6 Bob 2 Apple 822 738
7 Bob 2 Kiwi 793 449

comverting the numpy array to proper dataframe

I have numpy array as data below
data = np.array([[1,2],[4,5],[7,8]])
i want to split it and change to dataframe with column name as below to get the first value of each array as below
df_main:
value_items excluded_items
1 2
4 5
7 8
from which later I can take like
df:
value_items
1
4
7
df2:
excluded_items
2
5
8
I tried to convert to dataframe with command
df = pd.DataFrame(data)
it resulted in still array of int32
so, the splitting is failure for me
Use reshape for 2d array and also add columns parameter:
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
Sample:
data = np.arange(785*2).reshape(1, 785, 2)
print (data)
[[[ 0 1]
[ 2 3]
[ 4 5]
...
[1564 1565]
[1566 1567]
[1568 1569]]]
print (data.shape)
(1, 785, 2)
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
print (df)
value_items excluded_items
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
.. ... ...
780 1560 1561
781 1562 1563
782 1564 1565
783 1566 1567
784 1568 1569
[785 rows x 2 columns]

Sort Column of Dataframe by similarity, first row should be fixed Python

I want to order the Frame depending on the first row of B. So the first row of B is allways fixed and the second, third .... row is sorted by similarity of B's first row. It should also be flexible, B could contain 2-20 or even more rows
I expect a result like this
Any idea how to do this?
If you sort the values by the difference from the first value in b, you can just use that index into the original DataFrame:
In [35]: df = pd.DataFrame({'a': range(6), 'b': [483, 479, 503, 479, 485, 495]})
In [36]: df
Out[36]:
a b
0 0 483
1 1 479
2 2 503
3 3 479
4 4 485
5 5 495
In [37]: idx = df['b'].sub(df.loc[0, 'b']).abs().sort_values().index
In [38]: df.loc[idx]
Out[38]:
a b
0 0 483
4 4 485
1 1 479
3 3 479
5 5 495
2 2 503

rolling majority on non-numeric data

Given a dataframe:
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:
def majority(window):
freqs = scipy.stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df['a'] = pd.rolling_apply(df['a'], 3, majority)
And I get:
In [43]: df
Out[43]:
a
0 NaN
1 NaN
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
I'll have to deal with the NaNs, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:
In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
751 return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
752 return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753 center=False, args=args, kwargs=kwargs)
754
755
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
382 arg = _conv_timerule(arg, freq, how)
383
--> 384 return_hook, values = _process_data_structure(arg)
385
386 if values.size == 0:
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
433
434 if not issubclass(values.dtype.type, float):
--> 435 values = values.astype(float)
436
437 if kill_inf:
ValueError: could not convert string to float: a
I've tried converting a to a Categorical, but even then I get the same error. I can first convert to a Categorical, work on the codes and finally convert back from codes to labels, but that seems really convoluted.
Is there an easier/more natural solution?
(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq instead of unique, see here.)
Here is a way, using pd.Categorical:
import scipy.stats as stats
import pandas as pd
def majority(window):
freqs = stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')
cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)
yields
a b
0 NaN NaN
1 NaN NaN
2 1 a
3 1 a
4 1 a
5 1 a
6 1 b
7 2 b
8 2 b
9 2 b
10 2 b
Here is one way to do it by defining your own rolling apply function.
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['b'] = np.where(df.a == 1, 'A', 'B')
print(df)
Out[60]:
a b
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 1 A
7 2 B
8 2 B
9 2 B
10 2 B
def get_mode_from_Series(series):
return series.value_counts().index[0]
def my_rolling_apply_char(frame, window, func):
index = frame.index[window-1:]
values = [func(frame.iloc[i:i+window]) for i in range(len(frame)-window+1)]
return pd.Series(data=values, index=index).reindex(frame.index)
my_rolling_apply_char(df.b, 3, get_mode_from_Series)
Out[61]:
0 NaN
1 NaN
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 B
10 B
dtype: object