Obtaining a subset of the correlation matrix of a dataframe having only features that are less correlated - pandas

If I have a correlation matrix of features for a given target, like this:
feat1 feat2 feat3 feat4 feat5
feat1 1 ....
feat2 1
feat3 1
feat4 1
feat5 .... 1
how can I end up with a subset of the original correlation matrix give only some features that are less correlated? Let's say
feat2 feat3 feat5
feat2 1 ....
feat3 1
feat5 .... 1

In order to subset you just need to loc on both axis, i.e.:
In [105]: df
Out[105]:
0 1 2 3 4
0 0.4 0.0 0.0 0.00 0.0
1 0.0 1.0 0.0 0.00 0.0
2 0.0 0.0 1.0 0.00 0.0
3 0.0 0.0 0.0 0.45 0.0
4 0.0 0.0 0.0 0.00 1.0
target = [0, 2, 3] # ['featX', 'featY', 'etc']
subset = df.loc[target, target]
Or if you want to filter by some logic, do it in steps:
corr = pd.Series(np.diag(df), index=df.index)
high_corr = corr[corr > 0.7].index
subset = df.loc[high_corr, high_corr]
In [114]: subset
Out[114]:
1 2 4
1 1.0 0.0 0.0
2 0.0 1.0 0.0
4 0.0 0.0 1.0

Related

Pandas Cumsum in expanding rows

Looking to learn how to code this solution in a more elegant way. Need to split a set of rows into a smaller pieces and control the utilization as well as calculate the balance. Current solution is not generating the balance properly
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01],]
sdf = pd.DataFrame(box_list, columns = ['Name', 'Size'])
print(sdf)
Name
Size
1
Box1
1.00
2
Box2
1.80
3
Box4
2.00
4
Box8
4.01
df = pd.DataFrame({'Name': np.repeat(sdf['Name'], sdf['Size'].apply(np.ceil)),
'Size': np.repeat(sdf['Size'], sdf['Size'].apply(np.ceil)),})
df['Max_Units']=df['Size'].apply(lambda x: np.ceil(x) if x>1.0 else 1.0)
df = df.reset_index()
df['Utilization'] =df['Size'].apply(lambda x: x-int(x) if x>1.0 else (x if x<1.0 else 1.0))
df['Balance'] =df['Max_Units']
g = df.groupby(['index'], as_index=0, group_keys=0)
df['Utilization'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
1.0,
x.Utilization))).values
df.loc[(df.Utilization == 0.0), ['Utilization']] = 1.0
df['Balance'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
x.Max_Units-x.Utilization,
0))).values
print(df)
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
4.0
8
4
Box8
4.01
5.0
1.00
4.0
9
4
Box8
4.01
5.0
1.00
4.0
10
4
Box8
4.01
5.0
1.00
4.0
I'm not sure if I completely understand what all of these values are supposed to be representing.
However, I've achieved the correct desired output for your sample set in more direct ways:
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01], ]
df = pd.DataFrame(box_list, columns=['Name', 'Size'])
# Set ceil column to ceil of size since it's used more than once
df['ceil'] = df['Size'].apply(np.ceil)
# Duplicate Rows based on Ceil of Size
df = df.loc[df.index.repeat(df['ceil'])]
# Get Max Units by comparing it to the ceil column
df['Max_Units'] = df.apply(lambda s: max(s['ceil'], 1), axis=1)
# Extract Decimal Portion By Using % 1 (Catch Special Case of x == 1)
df['Utilization'] = df['Size'].apply(lambda x: 1 if x == 1 else x % 1)
# Everywhere Max_Units cumcount is not 0 set Utilization to 1
df.loc[df.groupby(df['Max_Units']).cumcount().ne(0), 'Utilization'] = 1
# Set Balance to index cumcount as float
df['Balance'] = df.groupby(df.index).cumcount().astype(float)
# Drop Unnecessary Column and reset index for output
df = df.drop(columns=['ceil']).reset_index()
# For Display
print(df)
Output:
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
1.0
8
4
Box8
4.01
5.0
1.00
2.0
9
4
Box8
4.01
5.0
1.00
3.0
10
4
Box8
4.01
5.0
1.00
4.0

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

Python: How replace non-zero values in a Pandas dataframe with values from a series

I have a dataframe 'A' with 3 columns and 4 rows (X1..X4). Some of the elements in 'A' are non-zero. I have another dataframe 'B' with 1 column and 4 rows (X1..X4). I would like to create a dataframe 'C' so that where 'A' has a nonzero value, it takes the value from the equivalent row in 'B'
I've tried a.where(a!=0,c)..obviously wrong as c is not a scalar
A = pd.DataFrame({'A':[1,6,0,0],'B':[0,0,1,0],'C':[1,0,3,0]},index=['X1','X2','X3','X4'])
B = pd.DataFrame({'A':{'X1':1.5,'X2':0.4,'X3':-1.1,'X4':5.2}})
These are the expected results:
C = pd.DataFrame({'A':[1.5,0.4,0,0],'B':[0,0,-1.1,0],'C':[1.5,0,-1.1,0]},index=['X1','X2','X3','X4'])
np.where():
If you want to assign back to A:
A[:]=np.where(A.ne(0),B,A)
For a new df:
final=pd.DataFrame(np.where(A.ne(0),B,A),columns=A.columns)
A B C
0 1.5 0.0 1.5
1 0.4 0.0 0.0
2 0.0 -1.1 -1.1
3 0.0 0.0 0.0
Usage of fillna
A=A.mask(A.ne(0)).T.fillna(B.A).T
A
Out[105]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Or
A=A.mask(A!=0,B.A,axis=0)
Out[111]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Use:
A.mask(A!=0,B['A'],axis=0,inplace=True)
print(A)
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

How to select and calculate with value from specific variable in dataframe with pandas

I am running below code and get this:
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0].count()*100/1892
x
id 0.528541
date 0.528541
count 0.528541
idade 0.528541
site 0.528541
baseline 0.528541
fuv1 0.528541
fuv2 0.475687
fuv3 0.528541
fuv4 0.475687
dtype: float64
What I want is just to get this result 0.528541 and forgot all the above results.
What to do?
Thanks.
If want count number of 0 values in column fuv1 use sum for count Trues which are processes like 1s:
print ((pf['fuv1'] == 0).sum())
10
x = (pf['fuv1'] == 0).sum()*100/1892
print (x)
0.528541226216
Explanation why different outputs - count exclude NaNs:
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0]
print (x)
id date count idade site baseline fuv1 fuv2 fuv3 fuv4
0 0 4/1/2016 10 13 A 1 0.0 1.0 0.0 1.0
2 2 4/3/2016 9 5 C 1 0.0 NaN 0.0 1.0
3 3 4/4/2016 108 96 D 1 0.0 1.0 0.0 NaN
11 11 4/12/2016 6 13 C 1 0.0 1.0 1.0 0.0
13 13 4/14/2016 12 4 C 1 0.0 1.0 1.0 0.0
40 40 5/11/2016 14 7 C 1 0.0 1.0 1.0 1.0
41 41 5/12/2016 0 26 C 1 0.0 1.0 1.0 1.0
42 42 5/13/2016 10 15 C 1 0.0 1.0 1.0 1.0
60 60 5/31/2016 13 3 D 1 0.0 1.0 1.0 1.0
74 74 6/14/2016 15 7 B 1 0.0 1.0 1.0 1.0
print (x.count())
id 10
date 10
count 10
idade 10
site 10
baseline 10
fuv1 10
fuv2 9
fuv3 10
fuv4 9
dtype: int64
In [282]: pf.loc[pf['fuv1'] == 0, 'id'].count()*100/1892
Out[282]: 0.5285412262156448
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x = (pf['fuv1'] == 0).sum()*100/1892
y=pf["idade"].mean()
l = "Performance"
k = "LTFU"
def test(l1,k1):
return pd.DataFrame({'a':[l1, k1], 'b':[x, y]})
df1 = test(l,k)
df1.columns = [''] * len(df1.columns)
df1.index = [''] * len(df1.index)
print(round(df1, 2))
Performance 0.53
LTFU 14.13