Multi Level pivoting of dataframe - pandas

I have this dataframe:
Group
Feature 1
Feature 2
Class
First
5
4
1
Second
5
5
0
First
1
2
0
I want to do a multi level pivot in pandas to have something like this:
Group | Feature1 (class 1)| Feature (Class 2) | Feature2 (Class 1) | Feature 1(Class 2)
What if I want to select only one feature to work with?

Like this?
out = (df.assign(Class=df["Class"]+1)
.pivot(index="Group", columns="Class"))
print(out)
Feature 1 Feature 2
Class 1 2 1 2
Group
First 1.0 5.0 2.0 4.0
Second 5.0 NaN 5.0 NaN

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

aggregate dataframe horizontally

I have the following data:
inputdata = [[1,'long',30.2,'Win'],[1,'long',-12.4,'Loss'],
[2,'short',-12.3,'Loss'],[1,'long',3.2,'Win'],
[3,'short',0.0,'B/E'],[3,'short',23.2,'Win'],
[3,'long',3.2,'Win'],[4,'short',-4.2,'Loss']]
datadf = DataFrame(columns=['AssetId','Direction','PnL','W_L'], data = inputdata)
datadf
AssetId Direction PnL W_L
0 1 long 30.2 Win
1 1 long -12.4 Loss
2 2 short -12.3 Loss
3 1 long 3.2 Win
4 3 short 0.0 B/E
5 3 short 23.2 Win
6 3 long 3.2 Win
7 4 short -4.2 Loss
Now I want to aggregate this further into a new dataframe that looks like this mock up (a few sample rows added, more stats to be added:
Stat Long Short Total
0 Trades 4 4 8
1 Won 3 1 4
2 Lost 1 2 3
(...)
I tried this:
datadf.groupby(['Direction'])['PnL'].count()
Direction
long 4
short 4
Name: PnL, dtype: int64
This produces the necessary data, but I would have to fill my aggregation data frame field by field, which seems cumbersome and I am not even sure how to get the exact value into each row/column. Based on this example, is there a better way to achieve this goal?
You can do crosstab:
pd.crosstab(df['W_L'], df['Direction'],margins=True, margins_name='Total')
Output:
Direction long short Total
W_L
B/E 0 1 1
Loss 1 2 3
Win 3 1 4
Total 4 4 8
Use pivot_table:
res = pd.pivot_table(df.iloc[:,1:], index=["W_L"], columns=["Direction"], aggfunc="count").droplevel(0, 1)
res["total"] = res.sum(1)
print (res.append(res.sum().rename(index="Trades")))
Direction long short total
W_L
B/E NaN 1.0 1.0
Loss 1.0 2.0 3.0
Win 3.0 1.0 4.0
Trades 4.0 4.0 8.0

Resampling a dataframe based on depth column

I have two dataframe which the key is depth. One has > 2k values the other only 100, but the min and the max depth are the same. I would like to upsample the small dataframe (which has only one column) at the same size of the bigger one and repeat the same value of a column between two depths.
I've tried using concatenate and resampling but I'm stuck when I want to find the same depth since the two dataframes depths do not have exactly the same values
I have this:
df_small:
depth Litholog
0 38.076 2.0
1 39.546 2.0
2 41.034 4.0
3 55.133 3.0
4 69.928 2.0
and this:
df_big:
depth
0 21.3360
1 35.2044
2 37.6428
3 41.7576
4 41.9100
5 48.7680
6 53.1876
7 56.0832
8 58.3692
9 62.1792
I would like this:
df_result:
depth Litholog
0 21.3360 2
1 35.2044 2
2 37.6428 2
3 41.7576 4
4 41.9100 4
5 48.7680 4
6 53.1876 4
7 56.0832 3
8 58.3692 3
9 62.1792 2
I tried several approach but without success. Many thanks to all
If change sample data for same max and min value in both is possible use merge_asof:
#change sample data for same min,max by df_big
print (df_small)
depth Litholog
0 21.3360 2.0
1 39.5460 2.0
2 41.0340 4.0
3 55.1330 3.0
4 62.1792 2.0
df = pd.merge_asof(df_big, df_small, on='depth')
print (df)
depth Litholog
0 21.3360 2.0
1 35.2044 2.0
2 37.6428 2.0
3 41.7576 4.0
4 41.9100 4.0
5 48.7680 4.0
6 53.1876 4.0
7 56.0832 3.0
8 58.3692 3.0
9 62.1792 2.0

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0