Pandas groupby calculation using values from different rows based on other column - pandas

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute
A/NaN so that the value for all NaNs can be expressed in unit A.
transaction name value ...many other columns
1 A 3
1 NaN 5
2 NaN 7
2 A 6
3 A 4
3 NaN 3
4 A 10
4 NaN 9
5 C 8
5 A 6
..
Thus the desired df would be
transaction name value new_column ...many other columns
1 A 3 NaN
1 NaN 6 0.5
2 NaN 7 0.8571
2 A 6 NaN
3 A 4 1.333
3 NaN 3 NaN
4 A 10 1.111
4 NaN 9 NaN
5 C 8 0.75
5 A 6 NaN
...

First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:
m = df['name'].ne('A')
s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN
1 1 NaN 5 0.600000
2 2 NaN 7 0.857143
3 2 A 6 NaN
4 3 A 4 NaN
5 3 NaN 3 1.333333
6 4 A 10 NaN
7 4 NaN 9 1.111111
8 5 NaN 8 0.750000
9 5 A 6 NaN
EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:
print (df)
transaction name value
0 1 A 3
1 1 A 4
2 1 NaN 5
3 2 NaN 7
4 2 A 6
5 3 A 4
6 3 NaN 3
7 4 A 10
8 4 NaN 9
9 5 C 8
10 5 A 6
# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects
m = df['name'].ne('A')
print (df[~m].drop_duplicates(['transaction','name']))
transaction name value
0 1 A 3
4 2 A 6
5 3 A 4
7 4 A 10
10 5 A 6
s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN <- 2 times a per 1 group
1 1 A 4 NaN <- 2 times a per 1 group
2 1 NaN 5 0.600000
3 2 NaN 7 0.857143
4 2 A 6 NaN
5 3 A 4 NaN
6 3 NaN 3 1.333333
7 4 A 10 NaN
8 4 NaN 9 1.111111
9 5 C 8 0.750000
10 5 A 6 NaN

Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:
df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
groupby('transaction')['value'].\
agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Any function/attribute in dataframe similar to attribute 'remove' or 'pop'

Is there any attribute/function for dataframe similar to like 'remove' attribute in series, to remove the 1st occirance of similar indexes in a dataframe.
Dataframe:
a b c d
100 1 2 3 NaN
200 4 5 6 NaN
100 7 9 10 NaN
Desired output:(after the desired command)
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
Try with loc and duplicated with keep='last':
>>> df[~df.index.duplicated(keep='last')]
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
>>>
Edit:
df.iloc[np.where(df.index.duplicated(keep='last'))]

Applying multiple functions to a pivot table (grouped) dataframe

I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks
I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN

How to map missing values of a df's column according to another column's values (of the same df) using a dictionary? Python

I managed to solve using if and for loops but I'm looking for a less computationally expensive way to do this. i.e. using apply or map or any other technique
d = {1:10, 2:20, 3:30}
df
a b
1 35
1 nan
1 nan
2 nan
2 47
2 nan
3 56
3 nan
I want to fill missing values of column b according to dict d, i.e. output should be
a b
1 35
1 10
1 10
2 20
2 47
2 20
3 56
3 30
You can use fillna or combine_first by maped a column:
print (df['a'].map(d))
0 10
1 10
2 10
3 20
4 20
5 20
6 30
7 30
Name: a, dtype: int64
df['b'] = df['b'].fillna(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
df['b'] = df['b'].combine_first(df['a'].map(d))
print (df)
a b
0 1 35.0
1 1 10.0
2 1 10.0
3 2 20.0
4 2 47.0
5 2 20.0
6 3 56.0
7 3 30.0
And if all values are ints add astype:
df['b'] = df['b'].fillna(df['a'].map(d)).astype(int)
print (df)
a b
0 1 35
1 1 10
2 1 10
3 2 20
4 2 47
5 2 20
6 3 56
7 3 30
If all data in column a are in keys of dict, then is possible use replace:
df['b'] = df['b'].fillna(df['a'].replace(d))

based on a value in column A, shift the values in columns C and D to the right in a pandas dataframe

How can i achieve the desired result based on the following dataset ?
A B C D E
1 apple 5 2 20 NaN
2 orange 2 6 30 NaN
3 apple 6 1 40 NaN
4 apple 10 3 50 NaN
5 banana 8 9 60 NaN
Desired Result :
A B C D E
1 apple 5 NaN 2 20
2 orange 2 6 30 NaN
3 apple 6 NaN 1 40
4 apple 10 NaN 3 50
5 banana 8 9 60 NaN
IIUC you can use np.roll on the rows of interest, here we need to select only the rows where 'A' is 'apple' and then roll these by a single column row-wise and assign back:
In [14]:
df.loc[df['A']=='apple', 'C':] = np.roll(df.loc[df['A']=='apple', 'C':], 1,axis=1)
df
Out[14]:
A B C D E
1 apple 5 NaN 2 20.0
2 orange 2 6.0 30 NaN
3 apple 6 NaN 1 40.0
4 apple 10 NaN 3 50.0
5 banana 8 9.0 60 NaN
Note that because you introduce NaN values the dtype changes to float to allow this