Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
Related
I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?
Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
df5.mode()
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Why does the NaN come from here?
Reason is if check DataFrame.mode:
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
So missing values means for A is ony one mode value, for B column are 3 mode values, so for same rows are added missing values.
If check my sample data - there is mode A 2 times and B only once, because 2and 3 are both 11 times in data:
np.random.seed(20)
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
print (df5.mode())
A B
0 2 8.0
1 3 NaN
print (df5.A.value_counts())
3 11 <- both top1
2 11 <- both top1
6 9
5 8
0 5
1 4
4 2
Name: A, dtype: int64
print (df5.B.value_counts())
8 6 <- only one top1
0 4
4 4
-4 3
10 3
-2 3
1 3
12 3
6 3
7 2
3 2
5 2
-9 2
-6 2
14 2
9 2
-1 1
11 1
-3 1
-7 1
Name: B, dtype: int64
I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2
I managed to solve using if and for loops but I'm looking for a less computationally expensive way to do this. i.e. using apply or map or any other technique
d = {1:10, 2:20, 3:30}
df
a b
1 35
1 nan
1 nan
2 nan
2 47
2 nan
3 56
3 nan
I want to fill missing values of column b according to dict d, i.e. output should be
a b
1 35
1 10
1 10
2 20
2 47
2 20
3 56
3 30
You can use fillna or combine_first by maped a column:
print (df['a'].map(d))
0 10
1 10
2 10
3 20
4 20
5 20
6 30
7 30
Name: a, dtype: int64
df['b'] = df['b'].fillna(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
df['b'] = df['b'].combine_first(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
And if all values are ints add astype:
df['b'] = df['b'].fillna(df['a'].map(d)).astype(int)
print (df)
a b
0 1 35
1 1 10
2 1 10
3 2 20
4 2 47
5 2 20
6 3 56
7 3 30
If all data in column a are in keys of dict, then is possible use replace:
df['b'] = df['b'].fillna(df['a'].replace(d))