Pandas groupby and rank - same rank for duplicates

Pandas groupby and rank - same rank for duplicates - pandas

Here is my dataframe:
my_df = pd.DataFrame({'group':['a','a', 'a','b','b'], 'date':['2017-01-02', '2017-01-02','2017-03-01', '2018-02-05', '2018-04-06']})
my_df['date']= pd.to_datetime(my_df['date'], format = '%Y-%m-%d')
I would like to add rank per group, where same values would be assigned same rank.
Here is what I would like as output:
date group rank
0 2017-01-02 a 1
1 2017-01-02 a 1
2 2017-03-01 a 2
3 2018-02-05 b 1
4 2018-04-06 b 2
I guess I can do it by grouping twice and ranking and join back to original dataframe, but I wonder if there is faster way to do it.

Just using rank with method dense
my_df.groupby(['group'])['date'].rank(method ='dense')
Out[6]:
0 1.0
1 1.0
2 2.0
3 1.0
4 2.0
Name: date, dtype: float64

You could use transform with factorize:
my_df['group_rank'] = my_df.groupby(['group'])['date'].transform(lambda x: x.factorize()[0])
>>> my_df
date group group_rank
0 2017-01-02 a 0
1 2017-01-02 a 0
2 2017-03-01 a 1
3 2018-02-05 b 0
4 2018-04-06 b 1
If you add + 1 to the end of that, it will be ranks of 1 and 2 as in your desired output, but I thought this might not be important (since they are properly binned together in any case)

Related

apply a function over each row to find unique values in multiple columns

My data frame looks like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[0,1,2,3,4],
'date1': ['2020-12-01','2020-12-01',np.nan,'2018-12-01',np.nan],
'date2': ['2015-04-01','2015-04-01','2018-12-01','2018-12-01',np.nan],
'date3': [np.nan,'2013-12-01','2018-12-01','2018-12-01',np.nan]
})
I'm trying to apply a function like nunique() over each of the data columns for each ID to obtain then sum of distinct dates. I have tried using agg() function in groupby.
Resulted data frame would look like:

Use nunique on axis=1 after filtering out the ID column:
out = df[['ID']]
out['unique_sum'] = df.drop(columns='ID').nunique(axis=1)
Or with filter:
out = df[['ID']]
out['unique_sum'] = df.filter(like='date').nunique(axis=1)
Or, as chained commands:
out = (
df[['ID']]
.assign(unique_sum=(df.drop(columns='ID').nunique(axis=1)))
)
Output:
ID unique_sum
0 0 2
1 1 3
2 2 1
3 3 1
4 4 0

Apply Dateframe.nunique to selected date values:
df['uniq_sum'] = df.filter(like='date').nunique(axis=1)
ID date1 date2 date3 uniq_sum
0 0 2020-12-01 2015-04-01 NaN 2
1 1 2020-12-01 2015-04-01 2013-12-01 3
2 2 NaN 2018-12-01 2018-12-01 1
3 3 2018-12-01 2018-12-01 2018-12-01 1
4 4 NaN NaN NaN 0

Distribute values randomly over previous days

I have a dataset that looks something like the following:
df = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[0,0,14,0,0,0,9]})
df['Date']=pd.to_datetime(df['Date'])
df
Date Value
2021-01-01 0
2021-01-02 0
2021-01-03 14
2021-01-04 0
2021-01-05 0
2021-01-06 0
2021-01-07 9
I know that where the data is missing was due to a lack of reporting, so rows with values represents that day, plus the sum of values from the missing days as well. I want the outcome to randomly distribute the data backwards, based on the existing values, example below:
df2 = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[2,8,4,3,1,4,1]})
df2['Date']=pd.to_datetime(df2['Date'])
df2
Date Value
2021-01-01 2
2021-01-02 8
2021-01-03 4
2021-01-04 3
2021-01-05 1
2021-01-06 4
2021-01-07 1
(The local 'totals' on 2021-01-03 and 2021-01-07 remain the same)
I know part of the problem is that the intervals of missing/present data isn't consistent...
Any ideas on how to get this done? All advice appreciated.

You can create a group up to a non zero Value with ne (not equal) to 0, shift to keep the non-zero value in the right group and cumsum.
Then to create a split per group of difference length summing to the non zero value of the group, you can refer to this question for example.
so using numpy.random.multinomial like in this answer you get:
import numpy as np
np.random.seed(1)
df['new_value'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: np.random.multinomial(x.max(), [1./len(x)]*len(x)))
.explode() # create a Series
.to_numpy() # because index alignment not possible
)
print(df)
Date Value new_value
0 2021-01-01 0 4
1 2021-01-02 0 6
2 2021-01-03 14 4
3 2021-01-04 0 0
4 2021-01-05 0 2
5 2021-01-06 0 2
6 2021-01-07 9 5
or you can also use this answer that seems a bit more popular:
import random
def constrained_sum_sample_pos(n, total):
"""Return a randomly chosen list of n positive integers summing to total.
Each such list is equally likely to occur."""
dividers = sorted(random.sample(range(1, total), n - 1))
return [a - b for a, b in zip(dividers + [total], [0] + dividers)]
df['new_value_r'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: constrained_sum_sample_pos(len(x), x.max()))
.explode()
.to_numpy()
)
print(df)
Date Value new_value new_value_r
0 2021-01-01 0 4 5
1 2021-01-02 0 6 2
2 2021-01-03 14 4 7
3 2021-01-04 0 0 1
4 2021-01-05 0 2 2
5 2021-01-06 0 2 5
6 2021-01-07 9 5 1

Get the cumulative most frequent status for specific column in pandas dataframe

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date status
0 2 2010-08-31 b
1 2 2010-07-31 d
2 2 2010-06-30 d
3 2 2010-05-31 a
4 1 2010-04-30 c
5 1 2010-03-31 b
6 1 2010-02-28 a
7 1 2010-01-31 b
I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Interpretation:
for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.
How would you do it?

You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.
We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.
After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:
# make one-hot-encoded status dataframe
z = (df
.groupby(['ID', 'election_date', 'status'])
.size().unstack().fillna(0))
# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)
# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()
# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())
Output:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN

How to create a new column of conditional count in a Pandas' DataFrame

I have a DataFrame, df, like:
id date
a 2019-07-11
a 2019-07-16
b 2018-04-01
c 2019-08-10
c 2019-07-11
c 2018-05-15
I want to add a count column and shows how many rows with the same id exist in the date with a date that is before the date of that row. Meaning:
id date count
a 2019-07-11 0
a 2019-07-16 1
b 2018-04-01 0
c 2019-08-10 2
c 2019-07-11 1
c 2018-05-15 0
If you believe it is easier in SQL and know how to do it, that works for me too.

Do this:
In [1688]: df.sort_values('date').groupby('id').cumcount()
Out[1688]:
2 0
5 0
0 0
4 1
1 1
3 2
dtype: int64

Group and divide a date column monthwise in pandas

i have a dataframe df:
store date invoice_count
A 2018-04-03 2
A 2018-04-06 5
A 2018-06-15 5
B 2018-05-05 2
B 2018-04-09 5
C 2018-02-16 6
which contains the invoice_counts(no of invoices generated) of stores for given dates.
I am trying to group them such that i get a month-wise invoice_count for every store.
Expected final dataframe in this format:
store jan_18 feb_18 mar_18 apr_18 may_18 june_18
A 0 0 0 7 0 5
B 0 0 0 5 2 0
C 0 6 0 0 0 0
Is there any way to group dates based on month-wise??
Note: This is a dummy dataframe, the final monthly column names can be in other appropriate format.

Use groupby with DataFrameGroupBy.resample and aggregate sum, then reshape by unstack and if necessary add missing columns with 0 by reindex, last change format of datetimes by DatetimeIndex.strftime:
df = (df.set_index('date')
.groupby('store')
.resample('m')['invoice_count']
.sum()
.unstack(fill_value=0))
df = df.reindex(columns=pd.date_range('2018-01-01', df.columns.max(), freq='m'), fill_value=0)
df.columns = df.columns.strftime('%b_%y')
print (df)
Jan_18 Feb_18 Mar_18 Apr_18 May_18 Jun_18
store
A 0 0 0 7 0 5
B 0 0 0 5 2 0
C 0 6 0 0 0 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas groupby and rank - same rank for duplicates - pandas

Just using rank with method dense my_df.groupby(['group'])['date'].rank(method ='dense') Out[6]: 0 1.0 1 1.0 2 2.0 3 1.0 4 2.0 Name: date, dtype: float64

Related

apply a function over each row to find unique values in multiple columns

Distribute values randomly over previous days

Get the cumulative most frequent status for specific column in pandas dataframe

How to create a new column of conditional count in a Pandas' DataFrame

Group and divide a date column monthwise in pandas

Categories

Resources