Pandas finding and replacing outliers based on a group of two columns

Pandas finding and replacing outliers based on a group of two columns - pandas

I'm having a bit of trouble finding outliers in a df based on groups and dates.
For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).
However I'm having some trouble filtering the data and isolating the outliers and replacing them.
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r
print (df)
group date val
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 10
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 20
Thanks for the help!

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
Then you can determine the median of each group:
medians = df.groupby('group')['value'].median()
Finally, locate the outliers and replace with the medians:
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()
All together it looks like:
import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values
Output:
group date value
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 2
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 3

You can use a combination of .groupby/transform to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.
medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)
print(df)
group date value new_value
0 A 2022-06-28 1 1.0
1 A 2022-06-28 2 2.0
2 A 2022-06-28 1 1.0
3 A 2022-06-27 2 2.0
4 A 2022-06-27 3 3.0
5 A 2022-06-27 10 2.0
6 B 2022-06-28 2 2.0
7 B 2022-06-28 3 3.0
8 B 2022-06-28 2 2.0
9 B 2022-06-27 3 3.0
10 B 2022-06-27 4 4.0
11 B 2022-06-27 20 3.0

Related

Get the cumulative most frequent status for specific column in pandas dataframe

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date status
0 2 2010-08-31 b
1 2 2010-07-31 d
2 2 2010-06-30 d
3 2 2010-05-31 a
4 1 2010-04-30 c
5 1 2010-03-31 b
6 1 2010-02-28 a
7 1 2010-01-31 b
I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Interpretation:
for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.
How would you do it?

You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.
We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.
After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:
# make one-hot-encoded status dataframe
z = (df
.groupby(['ID', 'election_date', 'status'])
.size().unstack().fillna(0))
# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)
# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()
# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())
Output:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN

Return the length of the longest consecutive months by group in pandas

df
id date
0 a 202007
1 a 202008
2 a 202009
3 a 202010
4 a 202011
5 b 202011
6 c 202011
7 c 202012
8 c 202101
9 c 202102
10 d 202101
11 d 202102
12 d 202103
13 d 202105
14 e 202012
15 e 202101
16 e 202102
17 e 202104
18 e 202105
14 f 202012
15 f 202101
16 f 202103
17 f 202104
18 f 202105
The second column type is int.
Expected
a 5
b 1
c 4
d 3
e 3
f 3
Try and Ref
Ref： Get longest streak of consecutive weeks by group in pandas
I refer above post but still could get the results.
Note: For each id, the value of date is unique.
Pandas version: 1.1.5

Convert datetimes to month periods by Series.dt.to_period and then to integers and then instead transform use GroupBy.size with max per first level, here id:
per = pd.to_datetime(df['date'], format='%Y%m').dt.to_period('m').astype('int')
g = per.diff(-1).ne(-1).shift().bfill().cumsum()
df = df.groupby(['id',g]).size().max(level=0).reset_index(name='count')
print (df)
id count
0 a 5
1 b 1
2 c 4
3 d 3
4 e 3
5 f 3
For oldier pandas version is possible get attribute n from MonthEnd object if not missing value by custom function after diff:
f = lambda x: x.n if pd.notna(x) else None
df['date'] = pd.to_datetime(df['date'], format='%Y%m').dt.to_period('m')
g = df['date'].diff(-1).apply(f).ne(-1).shift().bfill().cumsum()
df = df.groupby(['id',g]).size().max(level=0).reset_index(name='count')
print (df)
id count
0 a 5
1 b 1
2 c 4
3 d 3
4 e 3
5 f 3

How to populate monthly data to weekly data in pandas?

I want to populate data in dataframe which consists of monthly data like the followings
M A B C
2020-1 2 30 5
2020-2 8 50 9
How can I do this easily using pandas api?
M A B C
2020-1-01 2 30 5
2020-1-08 3 35 6
2020-1-15 5 40 7
2020-1-22 7 45 8
2020-2-01 8 50 9
Thanks in advance

Use DataFrame.resample by months with W for weekends and ffill for forward filling values and then some processing with Grouper and GroupBy.cumcount, multiple and add values to columns:
df['M'] = pd.to_datetime(df['M'])
df = df.set_index('M').resample('W').ffill()
s = df.groupby(pd.Grouper(freq='MS')).cumcount().to_numpy()
df['B'] = df['B'].add(df.C.mul(s))
df[['A','C']] = df[['A','C']].add(s, axis=0)
df['B'] = df['B']
print (df)
A B C
M
2020-01-05 2 30 5
2020-01-12 3 35 6
2020-01-19 4 40 7
2020-01-26 5 45 8
2020-02-02 8 50 9

pandas get 1 rank from groupby multiple columns

Is it possible to do something like this
df = pd.DataFrame({
"sort_by": ["a","a","a","a","b","b","b", "a"],
"x": [100.5,200,200,500,1,2,3, 200],
"y": [4000,2000,2000,1000,500.5,600.5,600.5, 100.5]
})
df = df.sort_values(by=["x","y"], ascending=False)
where I can sort by the sort_by column and use x and y to find the rank (using y to break ties)
so ideal outlook will be
sort_by x y rank
a 500 1000 1
a 200 2000 2
a 200 2000 2
a 200 100.5 3
a 100.5 4000 4
b 3 600.5 1
b 2 600.5 2
b 1 500.5 3

Check with factorize after sort_values
df = df.sort_values(by=["x","y"], ascending=False)
df['rank']=tuple(zip(df.x,df.y))
df['rank']=df.groupby('sort_by',sort=False)['rank'].apply(lambda x : pd.Series(pd.factorize(x)[0])).values
df
Out[615]:
sort_by x y rank
3 a 500.0 1000.0 1
1 a 200.0 2000.0 2
2 a 200.0 2000.0 2
7 a 200.0 100.5 3
0 a 100.5 4000.0 4
6 b 3.0 600.5 1
5 b 2.0 600.5 2
4 b 1.0 500.5 3

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.

You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5

IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas finding and replacing outliers based on a group of two columns - pandas

Related

Get the cumulative most frequent status for specific column in pandas dataframe

Return the length of the longest consecutive months by group in pandas

How to populate monthly data to weekly data in pandas?

pandas get 1 rank from groupby multiple columns

Complete an incomplete dataframe in pandas

Categories

Resources