Dataframe Index value - dataframe

I have a dataframe
df = pd.DataFrame({'Event1':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Event2':[Poetry, Music, Dance, Theater]})
I need to create a new column called 'Val' that has the index of the element from Event 2 as it occurs in Event1. For example Val would be
'Val':[1,0,NaN,2].

Here's a way you can do:
Solution 1
import numpy as np
df['val'] = df['Event2'].apply(lambda x: np.where(x == df['Event1'])[0][0])
print(df)
Event1 Event2 val
0 Music Poetry 1
1 Poetry Music 0
2 Theatre Comedy 3
3 Comedy Theatre 2
Solution 2
df = pd.DataFrame({'Event1':['Music', 'Poetry', 'Theater', 'Comedy'], 'Event2':['Poetry', 'Music', 'Dance', 'Theater']})
df['val'] = (df['Event2']
.apply(lambda x: np.argwhere(x == df['Event1']))
.apply(lambda x: x[0][0] if len(x)>0 else x)
)
df['val'] = pd.to_numeric(df['val'], errors='coerce')
print(df)
Event1 Event2 val
0 Music Poetry 1.0
1 Poetry Music 0.0
2 Theater Dance NaN
3 Comedy Theater 2.0

Related

python pandas divide dataframe in method chain

I want to divide a dataframe by a number:
df = df/10
Is there a way to do this in a method chain?
# idea:
df = df.filter(['a','b']).query("a>100").assign(**divide by 10)
We can use DataFrame.div here:
df = df[['a','b']].query("a>100").div(10)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Use DataFrame.pipe with lambda function for use some function for all data of DataFrame:
df = pd.DataFrame({
'a':[400,500,40,50,5,700],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4]
})
df = df.filter(['a','b']).query("a>100").pipe(lambda x: x / 10)
print (df)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Here if use apply all columns are divided separately:
df = df.filter(['a','b']).query("a>100").apply(lambda x: x / 10)
You can see difference with print:
df1 = df.filter(['a','b']).query("a>100").pipe(lambda x: print (x))
a b
0 400 7
1 500 8
5 700 3
df2 = df.filter(['a','b']).query("a>100").apply(lambda x: print (x))
0 400
1 500
5 700
Name: a, dtype: int64
0 7
1 8
5 3
Name: b, dtype: int64

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Lookup into dataframe from another with iloc

Have a dataframe with location and column for lookup as follows:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'bird', 'donkey'] * 100000
df1 = pd.DataFrame(np.random.randint(1, high=380, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100000).reset_index()
df1.columns = ['animal', 'locn']
df1.head()
The dataframe to be looked at is as follows:
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df
Looking for a faster way to assign a column with value of B, for every record in df1.
df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
...is pretty slow.
Use GroupBy.cumcount for counter of animal column for positions, so possible use merge with left join:
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
Verfify in smaller dataframe:
np.random.seed(2019)
i = ['dog', 'cat', 'bird', 'donkey'] * 100
df1 = pd.DataFrame(np.random.randint(1, high=10, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100).reset_index()
df1.columns = ['animal', 'locn']
print (df1)
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df1 = df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
locn = df.groupby('animal').cumcount()
df1 = df1.assign(new1 = df1.merge(df.reset_index().assign(locn = locn),
on=['animal','locn'], how='left')['B'])
print (df1.head(10))
animal locn val new new1
0 cat 9 -0.535465 -0.535465 -0.535465
1 bird 3 0.296240 0.296240 0.296240
2 donkey 6 0.222638 0.222638 0.222638
3 dog 9 1.115175 1.115175 1.115175
4 cat 7 0.608889 0.608889 0.608889
5 bird 9 -0.025648 -0.025648 -0.025648
6 donkey 1 0.324736 0.324736 0.324736
7 dog 1 0.533579 0.533579 0.533579
8 cat 8 -1.818238 -1.818238 -1.818238
9 bird 9 -0.025648 -0.025648 -0.025648

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0

pandas groupby by the dictionary

I encounter a problem:
import pandas
df=pandas.DataFrame({"code":['a','a','b','c','d'],
'data':[3,4,3,6,7],})
mat={'group1':['a','b'],'group2':['a','c'],'group3':{'a','b','c','d'}}
the df like this
code data
0 a 3
1 a 4
2 b 3
3 c 6
4 d 7
I wanted the mean of the group1,group2,group3. In this example the key:group1 match the value:a,b , So I find the code equal a or b in df. the mean of group1 is (3+4+3)/3
group2 -> 'a','c' -> (3+4+6)/3
group3 -> 'a','b','c','d' ->(3+4+3+6+7)/5
I try to use groupby. It's doesn't works.
thx!
IIUC you can do something like as follows:
In [133]: rules = {
...: 'grp1': ['a','b'],
...: 'grp2': ['a','c'],
...: 'grp3': list('abcd')
...: }
...:
...: r = pd.DataFrame(
...: [{r:df.loc[df.code.isin(rules[r]), 'data'].mean()}
...: for r in rules
...: ]
...: ).stack()
...:
In [134]: r
Out[134]:
0 grp1 3.333333
1 grp2 4.333333
2 grp3 4.600000
dtype: float64