pandas get 1 rank from groupby multiple columns - pandas

Is it possible to do something like this
df = pd.DataFrame({
"sort_by": ["a","a","a","a","b","b","b", "a"],
"x": [100.5,200,200,500,1,2,3, 200],
"y": [4000,2000,2000,1000,500.5,600.5,600.5, 100.5]
})
df = df.sort_values(by=["x","y"], ascending=False)
where I can sort by the sort_by column and use x and y to find the rank (using y to break ties)
so ideal outlook will be
sort_by x y rank
a 500 1000 1
a 200 2000 2
a 200 2000 2
a 200 100.5 3
a 100.5 4000 4
b 3 600.5 1
b 2 600.5 2
b 1 500.5 3

Check with factorize after sort_values
df = df.sort_values(by=["x","y"], ascending=False)
df['rank']=tuple(zip(df.x,df.y))
df['rank']=df.groupby('sort_by',sort=False)['rank'].apply(lambda x : pd.Series(pd.factorize(x)[0])).values
df
Out[615]:
sort_by x y rank
3 a 500.0 1000.0 1
1 a 200.0 2000.0 2
2 a 200.0 2000.0 2
7 a 200.0 100.5 3
0 a 100.5 4000.0 4
6 b 3.0 600.5 1
5 b 2.0 600.5 2
4 b 1.0 500.5 3

Related

Pandas finding and replacing outliers based on a group of two columns

I'm having a bit of trouble finding outliers in a df based on groups and dates.
For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).
However I'm having some trouble filtering the data and isolating the outliers and replacing them.
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r
print (df)
group date val
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 10
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 20
Thanks for the help!
First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
Then you can determine the median of each group:
medians = df.groupby('group')['value'].median()
Finally, locate the outliers and replace with the medians:
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()
All together it looks like:
import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values
Output:
group date value
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 2
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 3
You can use a combination of .groupby/transform to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.
medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)
print(df)
group date value new_value
0 A 2022-06-28 1 1.0
1 A 2022-06-28 2 2.0
2 A 2022-06-28 1 1.0
3 A 2022-06-27 2 2.0
4 A 2022-06-27 3 3.0
5 A 2022-06-27 10 2.0
6 B 2022-06-28 2 2.0
7 B 2022-06-28 3 3.0
8 B 2022-06-28 2 2.0
9 B 2022-06-27 3 3.0
10 B 2022-06-27 4 4.0
11 B 2022-06-27 20 3.0

Clean the data based on condition pandas

I have a data frame as shown below
ID Unit_ID Price Duration
1 A 200 2
2 B 1000 3
2 C 1000 3
2 D 1000 3
2 F 1000 3
2 G 200 1
3 A 500 2
3 B 200 2
From the above data frame if ID, Price and Duration are same then replace the Price by average (Price divided by count of Such combination).
For example from the above data frame from row 2 to 5 has same ID, Price and Duration, that means its count is 4, so the new Price = 1000/4 = 250.
Expected Output:
ID Unit_ID Price Duration
1 A 200 2
2 B 250 3
2 C 250 3
2 D 250 3
2 F 250 3
2 G 200 1
3 A 500 2
3 B 200 2
Use GroupBy.transform with GroupBy.size for Series with same size like original filled by counts, so possible divide by Series.div:
df['Price'] = df['Price'].div(df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
print (df)
ID Unit_ID Price Duration
0 1 A 200.0 2
1 2 B 250.0 3
2 2 C 250.0 3
3 2 D 250.0 3
4 2 F 250.0 3
5 2 G 200.0 1
6 3 A 500.0 2
7 3 B 200.0 2
Detail:
print (df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
0 1
1 4
2 4
3 4
4 4
5 1
6 1
7 1
Name: Price, dtype: int64

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.
You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5
IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

merging 3 dataframes on condition

i have a dataframe df
id value
1 100
2 200
3 500
4 600
5 700
6 800
i have another dataframe df2
c_id flag
2 Y
3 Y
5 Y
Similarly df3
c_id flag
1 N
3 Y
4 Y
i want to merge these 3 dataframes and create a column in df
such that my df looks like:
id value flag
1 100 N
2 200 Y
3 500 Y
4 600 Y
5 700 Y
6 800 nan
I DON'T WANT TO USE df2 and df3 concatenation
for eg(
final = pd.concat([df2,df3],ignore_index=False)
final.drop_duplicates(inplace=True)
i don't want to use this method, is there any other way?
Using pd.merge, between df and combined df2+df3
In [1150]: df.merge(df2.append(df3), left_on=['id'], right_on=['c_id'], how='left')
Out[1150]:
id value c_id flag
0 1 100 1.0 N
1 2 200 2.0 Y
2 3 500 3.0 Y
3 3 500 3.0 Y
4 4 600 4.0 Y
5 5 700 5.0 Y
6 6 800 NaN NaN
Details
In [1151]: df2.append(df3)
Out[1151]:
c_id flag
0 2 Y
1 3 Y
2 5 Y
0 1 N
1 3 Y
2 4 Y
Using map you could
In [1140]: df.assign(flag=df.id.map(
df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag']))
)
Out[1140]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Let me explain, using set_index and combine_first create a mapping for id and flag
In [1141]: mapping = df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag'])
In [1142]: mapping
Out[1142]:
c_id
1 N
2 Y
3 Y
4 Y
5 Y
Name: flag, dtype: object
In [1143]: df.assign(flag=df.id.map(mapping))
Out[1143]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Merge on both df2 and df3
df= pd.merge(pd.merge(df,df2,on='ID',how='left'),df3,on='ID',how='left')
Fill nulls
df['ID'] =df['ID_y'].fillna(df['ID_x']
Delete the columns
del df['ID_y']; del df['ID_x']
Or you could just append,
df4 = df2.append(df3)
pd.merge(df,df4,how='left',on='ID')

Pandas Dataframe merge 2 columns

I have a datatable like this:
Run, test1, test2
1, 100, 102.
2, 110, 100.
3, 108, 105.
I would like to have the 2 columns merged together like this:
Run, results
1, 100
1, 102
2, 110
2, 100
3, 108
3, 105
How do I do it in Pandas? Thanks a lot!
Use stack with Multiindex to column by double reset_index:
df = df.set_index('Run').stack().reset_index(drop=True, level=1).reset_index(name='results')
print (df)
Run results
0 1 100.0
1 1 102.0
2 2 110.0
3 2 100.0
4 3 108.0
5 3 105.0
Or melt:
df = df.melt('Run', value_name='results').drop('variable', axis=1).sort_values('Run')
print (df)
Run results
0 1 100.0
3 1 102.0
1 2 110.0
4 2 100.0
2 3 108.0
5 3 105.0
Numpy solution with numpy.repeat:
a = np.repeat(df['Run'].values, 2)
b = df[['test1','test2']].values.flatten()
df = pd.DataFrame({'Run':a , 'results': b}, columns=['Run','results'])
print (df)
Run results
0 1 100.0
1 1 102.0
2 2 110.0
3 2 100.0
4 3 108.0
5 3 105.0
This how I achieve this
Option 1
wide_to_long
pd.wide_to_long(df,stubnames='test',i='Run',j='LOL').reset_index().drop('LOL',1)
Out[776]:
Run test
0 1 100.0
1 2 110.0
2 3 108.0
3 1 102.0
4 2 100.0
5 3 105.0
Notice : Here I did not change the column name from test to results, I think by using test as new column name is better in your situation .
Option 2
pd.concat
df=df.set_index('Run')
pd.concat([df[Col] for Col in df.columns],axis=0).reset_index().rename(columns={0:'results'})
Out[786]:
Run results
0 1 100.0
1 2 110.0
2 3 108.0
3 1 102.0
4 2 100.0
5 3 105.0