python3:How to sum column value of each three of a column - pandas

There is a dataframe like following:
id t_id y1 y2
1 1. 2. 1
1 2. 2. 1
1 3. 2. 1
1 4. 2. 1
1 .......
1 15. 2. 1
2 1. 2. 8
2 2. 5. 6
2 3. 5. 7
2 4. 5. 5
2 .......
2 15. 5. 10
..............
There is a number of id (1,2...), and t_id (1,...15) for each id, there are y1, y2 for each t_id. I want to sum the y1, y2 for each id in every three t_id (123,456,789,10112,131415) like the following dataframe(I just give example)
id. t_id y1. y2
1 1,2,3 6. 3
1. 4,5,6 6. 3
1. 7,8,9. 6. 3
1 10,11,12. 6. 3
1 13,14,15. 6. 3
......
Thanks in advance!

You can map t_id and groupby:
mapped_t_id = (df['t_id']-1)//3
(df.groupby(['id', mapped_t_id])
.agg({'t_id':set, 'y1':'sum', 'y2':'sum'})
)

You could process the entire dataframe by grouping it by 3 rows at a time. Here's how to do it.
import pandas as pd
df = pd.DataFrame({'id' : [1]*15 + [2]*15,
't_id':[i for i in range(1,16)]*2,
'y1': [2]*15 + [5]*15,
'y2':[1]*15 + [i for i in range (15,0,-1)]})
print (df)
df1 = pd.DataFrame(data=None, columns=df.columns)
for index, g in df.groupby(df.index // 3):
df_id = g.iloc[0,0]
df_tid = ','.join(str(g.iloc[i,1]) for i in range (3))
df_y1 = g.sum(axis=0)['y1']
df_y2 = g.sum(axis=0)['y2']
df1.loc[index] = [df_id,df_tid,df_y1,df_y2]
print (df1)
The output of this will be:
id t_id y1 y2
0 1 1,2,3 6 3
1 1 4,5,6 6 3
2 1 7,8,9 6 3
3 1 10,11,12 6 3
4 1 13,14,15 6 3
5 2 1,2,3 15 42
6 2 4,5,6 15 33
7 2 7,8,9 15 24
8 2 10,11,12 15 15
9 2 13,14,15 15 6
While this is a good response, I do like how #Quang Hoang solved it. It is concise.

Related

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

pandas: idxmax for k-th largest

Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...

Pandas : Get a column value where another column is the minimum in a sub-grouping [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Replace values in pandas dataframe with other on condition

i have a dataframe
id main_value
1 10
2 3
4 1
6 10
i want to change main_value of id = 4,such that it should decrement by 2.
i know a method using .loc
freq = 3
if freq == 3:
df.loc[df.id==4, ['main_value']] = df.main_value.loc[df.id==4] - 2
But this seems very lengthy, is there a better way to do this?
I think you can use:
df.loc[df.id==4, 'main_value'] -= 2
print (df)
id main_value
0 1 10
1 2 3
2 4 -1
3 6 10