Python: How replace non-zero values in a Pandas dataframe with values from a series - pandas

I have a dataframe 'A' with 3 columns and 4 rows (X1..X4). Some of the elements in 'A' are non-zero. I have another dataframe 'B' with 1 column and 4 rows (X1..X4). I would like to create a dataframe 'C' so that where 'A' has a nonzero value, it takes the value from the equivalent row in 'B'
I've tried a.where(a!=0,c)..obviously wrong as c is not a scalar
A = pd.DataFrame({'A':[1,6,0,0],'B':[0,0,1,0],'C':[1,0,3,0]},index=['X1','X2','X3','X4'])
B = pd.DataFrame({'A':{'X1':1.5,'X2':0.4,'X3':-1.1,'X4':5.2}})
These are the expected results:
C = pd.DataFrame({'A':[1.5,0.4,0,0],'B':[0,0,-1.1,0],'C':[1.5,0,-1.1,0]},index=['X1','X2','X3','X4'])

np.where():
If you want to assign back to A:
A[:]=np.where(A.ne(0),B,A)
For a new df:
final=pd.DataFrame(np.where(A.ne(0),B,A),columns=A.columns)
A B C
0 1.5 0.0 1.5
1 0.4 0.0 0.0
2 0.0 -1.1 -1.1
3 0.0 0.0 0.0

Usage of fillna
A=A.mask(A.ne(0)).T.fillna(B.A).T
A
Out[105]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Or
A=A.mask(A!=0,B.A,axis=0)
Out[111]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

Use:
A.mask(A!=0,B['A'],axis=0,inplace=True)
print(A)
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

Related

how to create a column with the index of the biggest among other columns AND some condition

I have a dataset with some columns, I want to create another column, where values are the column name of the variable with the highest value BUT different from 1
For Example:
df = pd.DataFrame({'A': [1, 0.2, 0.1, 0],
'B': [0.2,1, 0, 0.5],
'C': [1, 0.4, 0.3, 1]},
index=['1', '2', '3', '4'])
df
index
A
B
C
1
1.0
0.2
1.0
2
0.2
1.0
0.4
3
0.1
0.0
0.3
4
0.0
0.5
1.0
Should give an output like
index
A
B
C
NEWCOL
1
1.0
0.2
1.0
B
2
0.2
0.3
0.1
C
3
0.1
0.4
0.2
B
4
0.0
0.5
1.0
B
df2['newcol'] = df2.idxmax(axis=1) if df2.max(index=1) != 1
but didn't work
here is one way to do it
# filter out the data that is 1 and find the id of the max value using idxmax
df['newcol']=df[~df.isin([1])].idxmax(axis=1)
df
A B C newcol
1 1.0 0.2 1.0 B
2 0.2 1.0 0.4 C
3 0.1 0.0 0.3 C
4 0.0 0.5 1.0 B
PS: your input, starting and expected data don't match. The above is based on the input DF

Obtaining a subset of the correlation matrix of a dataframe having only features that are less correlated

If I have a correlation matrix of features for a given target, like this:
feat1 feat2 feat3 feat4 feat5
feat1 1 ....
feat2 1
feat3 1
feat4 1
feat5 .... 1
how can I end up with a subset of the original correlation matrix give only some features that are less correlated? Let's say
feat2 feat3 feat5
feat2 1 ....
feat3 1
feat5 .... 1
In order to subset you just need to loc on both axis, i.e.:
In [105]: df
Out[105]:
0 1 2 3 4
0 0.4 0.0 0.0 0.00 0.0
1 0.0 1.0 0.0 0.00 0.0
2 0.0 0.0 1.0 0.00 0.0
3 0.0 0.0 0.0 0.45 0.0
4 0.0 0.0 0.0 0.00 1.0
target = [0, 2, 3] # ['featX', 'featY', 'etc']
subset = df.loc[target, target]
Or if you want to filter by some logic, do it in steps:
corr = pd.Series(np.diag(df), index=df.index)
high_corr = corr[corr > 0.7].index
subset = df.loc[high_corr, high_corr]
In [114]: subset
Out[114]:
1 2 4
1 1.0 0.0 0.0
2 0.0 1.0 0.0
4 0.0 0.0 1.0

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final

How to return a column value based on integer index and another column value in pandas?

I have a df that I want to groupby Date and apply a function to it.
Date Symbol Shares
0 1990-01-01 A 0.0
1 1990-01-01 B 0.0
2 1990-01-01 C 0.0
3 1990-01-01 D 0.0
4 1990-01-02 A 50.0
5 1990-01-02 B 100.0
6 1990-01-02 C 66.0
7 1990-01-02 D 7.0
8 1990-01-03 A 11.0
9 1990-01-03 B 123.0
10 1990-01-03 C 11.0
11 1990-01-03 D 11.0
I should be able to access the Shares value for a Symbol from the previous Date in the function. How can I do that? Creating df[prev_shares] like df.groupby('Symbol')['Shares'].shift(1) before applying the function is not an option because the function calculates Shares row by row. It should look like:
def calcs(x):
x.loc[some_condition, 'Shares'] = ...
x.loc[other_condition, 'Shares'] = # return 'Shares' from previous 'Date' for this 'Symbol'
df = df.groupby('Date').apply(calcs)
Any help appreciated.
EDIT:
I post the function that I created.
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
x.loc[x['condition'] == False, 'Shares'] = # locate Symbol for the previous Date and return its Shares value
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
IIUC something like this might help, although what you're trying to accomplish in the end is unclear. Below you will find the data grouped by 'Symbol' and 'Date'. Column 'diff' shows the incremental P/L.
please advise what you need additionally.
df = data.set_index(['Symbol', 'Date']).sort_index()[['Shares']]
df['diff'] = np.nan
idx = pd.IndexSlice
for ix in df.index.levels[0]:
df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'Shares' ].diff()
df.fillna(0)
Shares diff
Symbol Date
A 1990-01-01 0.0 0.0
1990-01-02 50.0 50.0
1990-01-03 11.0 -39.0
B 1990-01-01 0.0 0.0
1990-01-02 100.0 100.0
1990-01-03 123.0 23.0
C 1990-01-01 0.0 0.0
1990-01-02 66.0 66.0
1990-01-03 11.0 -55.0
D 1990-01-01 0.0 0.0
1990-01-02 7.0 7.0
1990-01-03 11.0 4.0