How to get column name based on multiple columns in pandas? - pandas

Goal:
Create columns
fst_imp: return column name in which value is index of the min value of each row.
snd_imp: value is column name in which value is index of the second small value of each row.
trd_imp: value is column name in which value is index of the third small value of each row.
Example result:
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B

Here is one potential solution using numpy.argsort, the pandas.DataFrame constructor and DataFrame.join:
# Setup
import numpy as np
df = pd.DataFrame({'A': {0: 1, 1: 6, 2: 7}, 'B': {0: 2, 1: 5, 2: 9}, 'C': {0: 3, 1: 4, 2: 8}})
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)],
columns=['fst_imp', 'snd_imp', 'trd_imp']))
[out]
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B
Or a bit more scalable...
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)]))
[out]
A B C 0 1 2
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B

Related

Iterate over rows and subtract values in pandas df

I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)

Append dataframe in specific row

I have dataframe in the following format
a b label
1 5 A
2 6 A
3 7 A
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
I want append with new dataframe
a b label
3 4 A
The result become this
a b label
1 5 A
2 6 A
3 7 A
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
3 4 A <-- New Data
My question is how order new data become this every append new data
a b label
1 5 A
2 6 A
3 7 A
3 4 A <-- New Data
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
This is my code
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4, 1, 2,5,3],
"b":[5, 6, 7, 8, 5, 6,6,2],
"label":['A','A','A','B','B','B','C','C']})
new_data = pd.DataFrame({"a":[3],
"b":[4],
"label":['A']})
df1 = df1.append(new_data,ignore_index = True)
You can simply sort it on the label column after the data frame append
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4, 1, 2,5,3],
"b":[5, 6, 7, 8, 5, 6,6,2],
"label":['A','A','A','B','B','B','C','C']})
new_data = pd.DataFrame({"a":[3],
"b":[4],
"label":['A']})
df1 = df1.append(new_data,ignore_index = True).sort_values(by='label')
Result :
a b label
1 5 A
2 6 A
3 7 A
3 4 A <-- new data here
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C

Get group counts of level 1 after doing a group by on two columns

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1
If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

how to selectively filter elements in pandas group

I want to selectively remove elements of a pandas group based on their properties within the group.
Here's an example: remove all elements except the row with the highest value in the 'A' column
>>> dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc'), 'C': list('lmnopqrt')})
>>> dff
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
>>> grped = dff.groupby('B')
>>> grped.groups
{'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}
apply custom function/method to the groups (sort within group on col 'A', filter elements).
>>> yourGenius(grped,'A').reset_index()
returns dataframe:
A B C
0 2 a m
1 9 b p
2 10 c t
maybe there is a compact way to do this with a lambda function or .filter()? thanks
If you want to select one row per group, you could use groupby/agg
to return index values and select the rows using loc.
For example, to group by B and then select the row with the highest A value:
In [171]: dff
Out[171]:
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
[8 rows x 3 columns]
In [172]: dff.loc[dff.groupby('B')['A'].idxmax()]
Out[172]:
A B C
1 2 a m
4 9 b p
7 10 c t
another option (suggested by jezrael) which in practice is faster for a wide range of DataFrames is
dff.sort_values(by=['A'], ascending=False).drop_duplicates('B')
If you wish to select many rows per group, you could use groupby/apply with a function that returns sub-DataFrames for
each group. apply will then try to merge these sub-DataFrames for you.
For example, to select every row except the last from each group:
In [216]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=list('ABC'), index=list('vwxyz')); df['A'] %= 2; df
Out[216]:
A B C
v 0 1 2
w 1 4 5
x 0 7 8
y 1 10 11
z 0 13 14
In [217]: df.groupby(['A']).apply(lambda grp: grp.iloc[:-1]).reset_index(drop=True, level=0)
Out[217]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
Another way is to use groupby/apply to return a Series of index values. Again apply will try to join the Series into one Series. You could then use df.loc to select rows by index value:
In [218]: df.loc[df.groupby(['A']).apply(lambda grp: pd.Series(grp.index[:-1]))]
Out[218]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
I don't think groupby/filter will do what you wish, since
groupby/filter filters whole groups. It doesn't allow you to select particular rows from each group.