Using Pandas groupby methods, find largest values in each group - pandas

By using Pandas groupby, I have data on how much activity certain users have on average any given day of the week. Grouped by user and day, I compute max and mean for several users in the last 30 days.
Now I want to find, for every user, which day of the week corresponds to their daily max activity, and what is the average magnitude of that activity.
What is the method in pandas to perform such a task?
The original data looks something like this:
userID countActivity weekday
0 3 25 5
1 3 58 6
2 3 778 0
3 3 78208 1
4 3 6672 2
The object that has these groups is created from the following:
aggregations = {
'countActivity': {
'maxDaily': 'max',
'meanDaily': 'mean'
}
}
dailyAggs = df.groupby(['userID','weekday']).agg(aggregations)
The groupby object looks something like this:
countActivity
maxDaily meanDaily
userID weekday
3 0 84066 18275.6
1 78208 20698.5
2 172579 64930.75
3 89535 25443
4 6152 2809
Pandas groupby method filter seems to be needed here, but I'm stumped how on how to proceed.

I'd first do a groupby on 'userID', and then write an apply function to do the rest. The apply function will take a 'userID' group, perform another groupby on 'weekday' to do your aggregations, and then only return the row that contains the maximum value for maxDaily, which can be found with argmax.
def get_max_daily(grp):
aggregations = {'countActivity': {'maxDaily': 'max', 'meanDaily': 'mean'}}
grp = grp.groupby('weekday').agg(aggregations).reset_index()
return grp.loc[grp[('countActivity', 'maxDaily')].argmax()]
result = df.groupby('userID').apply(get_max_daily)
I've added a row to your sample data to make sure the daily aggregations were working correctly, since your sample data only contains one entry per weekday:
userID countActivity weekday
0 3 25 5
1 3 58 6
2 3 778 0
3 3 78208 1
4 3 6672 2
5 3 78210 1
The resulting output:
weekday countActivity
meanDaily maxDaily
userID
3 1 78209 78210

Related

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

Panda Move Same Index Names to Unique Column Names

I have a pandas df "lty" of shape 366,1 that looks like this below with the first "time" col is a month or Jan in this case, with the second "time" column is a day of the month. This is the result of
lty = ylt.groupby([ylt.index.month, ylt.index.day]).mean()
I need to reset the index possibly or get the columns to be renamed like this below so that "lty" has shape 366,3:
LT Mean
time time
1 1 11.263604
2 11.971495
3 11.989080
4 12.558736
5 11.850899
month day LT Mean
1 1 11.263604
2 11.971495
3 11.989080
4 12.558736
5 11.850899
I have tried to reset the index and i get this error:
lty.reset_index()
ValueError: cannot insert time, already exists
Thank you since I am still learning how to manipulate columns, indexing.
When you groupby, rename grouping Index attributes that way they don't collide after the aggregation:
lty = (ylt.groupby([ylt.index.month.rename('month'), ylt.index.day.rename('day')])
.mean().reset_index())
#month day LT Mean
#1 1 11.263604
#1 2 11.971495
#1 3 11.989080
#1 4 12.558736
#1 5 11.850899

Pandas search lower and upper neighbour within group

I have the following dataframe df, which specifies latitudes and longitudes for a certain groupnumber:
latitude longitude group
0 51.822231 4.700267 1
1 51.822617 4.801417 1
2 51.823235 4.903300 1
3 51.823433 5.003917 1
4 51.823616 5.504467 1
5 51.822231 3.900267 2
6 51.822617 3.901417 2
7 51.823235 3.903300 2
8 51.823433 6.903917 2
9 51.823616 8.904467 2
10 51.822231 1.900267 3
11 51.822617 2.901417 3
12 51.823235 11.903300 3
13 51.823433 12.903917 3
14 51.823616 13.904467 3
Within each groupnumber I try to find the lower and upper neighbour of the column 'longitude' for a specified value longitude_value = 5.00. All longitudes within each group 'trips' are sorted in df (they ascend in each group)
Per row I want to have the upper and lower neighbour values of longitude=5.000000. The desired output looks like:
latitude longitude trip
2 51.823235 4.903300 1
3 51.823433 5.003917 1
7 51.823235 3.903300 2
8 51.823433 6.903917 2
11 51.822617 2.901417 3
12 51.823235 11.903300 3
From this result I want to rearrange the data a little bit as:
lat_lo lat_up lon_lo lon_up
0 51.823235 51.823433 4.903300 5.003917
1 51.823235 51.823433 3.903300 6.903917
2 51.822617 51.823235 2.901417 11.903300
Hope I got your question right. See my attempt below. Made it long to be explicit in my approach. I could have easily introduced a longitude value of 5.00 and sliced on index but that would have complicated answering part 2 of your question. If I missed something, let me know.
Data
df=pd.read_clipboard()
df
Input point and calculate difference with longitude
fn=5.00
df['dif']=(df['longitude']-fn)
df
Find the minimum positive difference in each group
df1=df[df['dif'] > 0].groupby('group').min().reset_index().reindex()
Find the minimum negative difference in each group
df2=df[df['dif'] < 0].groupby('group').max().reset_index().reindex()
Append the second group above to the first into one df. This answers your question 1
df3=df1.append(df2, ignore_index=True).sort_values(['group','longitude'])
df3
Question 2
Introduce a column called status and append a pattern, 3 for the lower neighbor and 4 for the upper neighbor
df3['Status']=0
np.put(df3['Status'], np.arange(len(df3)), ['3','4'])
df3.drop(columns=['dif'], inplace=True)
df3
Rename the neighbours to lon_lo and lon_up
df3['Status']=np.where(df3['Status']==3,'lon_lo', (np.where(df3['Status']==4,'lon_up',df3['Status'] )))
Using pivot, break up the dataframe into lon_lo and latitude and do the same to lon_up. The rational here is to break up latitudes into two groups lo and up
first group break
df4=df3[df3['Status']=='lon_lo']
result=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
second group break
df4=df3[df3['Status']=='lon_up']
result1=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
Merge on index the two groups while renaming the latitudes to lo and up
final=result1.merge(result, left_index=True, right_index=True, suffixes=('_lo','_up'))
final
Output

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

get second largest value in row in selected columns in dataframe in pandas

I have a dataframe with subset of it shown below. There are more columns to the right and left of the ones I am showing you
M_cols 10D_MA 30D_MA 50D_MA 100D_MA 200D_MA Max Min 2nd smallest
68.58 70.89 69.37 **68.24** 64.41 70.89 64.41 68.24
**68.32**71.00 69.47 68.50 64.49 71.00 64.49 68.32
68.57 **68.40** 69.57 71.07 64.57 71.07 64.57 68.40
I can get the min (and max is easy as well) with the following code
df2['MIN'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].max(axis=1)
But how do I get the 2nd smallest. I tried this and got the following error
df2['2nd SMALLEST'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].nsmallest(2)
TypeError: nsmallest() missing 1 required positional argument: 'columns'
Seems like this should be a simple answer but I am stuck
For example you have following df
df=pd.DataFrame({'V1':[1,2,3],'V2':[3,2,1],'V3':[3,4,9]})
After pick up the value need to compare , we just need to sort value by axis=0(default)
sortdf=pd.DataFrame(np.sort(df[['V1','V2','V3']].values))
sortdf
Out[419]:
0 1 2
0 1 3 3
1 2 2 4
2 1 3 9
1st max:
sortdf.iloc[:,-1]
Out[421]:
0 3
1 4
2 9
Name: 2, dtype: int64
2nd max
sortdf.iloc[:,-2]
Out[422]:
0 3
1 2
2 3
Name: 1, dtype: int64