Replace value in pandas dataframe based on where condition [duplicate] - pandas

This question already has answers here:
Efficiently replace values from a column to another column Pandas DataFrame
(5 answers)
Closed 10 months ago.
I have created a dataframe called df with this code:
import numpy as np
import pandas as pd
# initialize data of lists.
data = {'Feature1':[1,2,-9999999,4,5],
'Age':[20, 21, 19, 18,34,]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
The dataframe looks like this:
Feature1 Age
0 1 20
1 2 21
2 -9999999 19
3 4 18
4 5 34
Every time there is a value of -9999999 in column Feature1 I need to replace it with the correspondent value from column Age. so, the output dataframe would look this this:
Feature1 Age
0 1 20
1 2 21
2 19 19
3 4 18
4 5 34
Bear in mind that the actual dataframe that I am using has 200K records (the one I have shown above is just an example).
How do I do that in pandas?

You can use np.where or Series.mask
df['Feature1'] = df['Feature1'].mask(df['Feature1'].eq(-9999999), df['Age'])
# or
df['Feature1'] = np.where(df['Feature1'].eq(-9999999), df['Age'], df['Feature1'])

Related

Merge Dataframes and apply different math function by column

I have 3 DataFrames like below.
A =
ops lat
0 9,453 13,536
1 8,666 14,768
2 8,377 15,278
3 8,236 15,536
4 8,167 15,668
5 8,099 15,799
6 8,066 15,867
7 8,029 15,936
8 7,997 16,004
9 7,969 16,058
10 7,962 16,073
B =
ops lat
0 9,865 12,967
1 8,908 14,366
2 8,546 14,976
3 8,368 15,294
4 8,289 15,439
5 8,217 15,571
6 8,171 15,662
7 8,130 15,741
8 8,093 15,809
9 8,072 15,855
10 8,058 15,882
C =
ops lat
0 9,594 13,332
1 8,718 14,670
2 8,396 15,242
3 8,229 15,553
4 8,137 15,725
5 8,062 15,875
6 8,008 15,982
7 7,963 16,070
8 7,919 16,159
9 7,892 16,218
10 7,874 16,255
How do I merge them into a single dataframe where ops column is a sum and lat column will be average of these three dataframes.
pd.concat() - seems to append the dataframes.
There are likely many ways, but to keep it on the same line of thinking as you had with pd.concat, the below will work.
First, concat your dataframes together and then we will calculate .sum() and .mean() on our newly created dataframe and construct our final table with those two fields.
Dummy Data and Example Below:
import pandas as pd
data = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[303,2084,494,2028,4049,112]}
df1 = pd.DataFrame(data)
data2 = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[8,234,75,123,689,1256]}
df2 = pd.DataFrame(data2)
joined = pd.concat([df1,df2])
final = pd.DataFrame({'Sum_Col': [joined["Value"].sum()],
'Mean_Col': [joined["Value2"].mean()]})
display(final)

Using apply for multiple columns

I need to create 2 new columns based one existing 2 columns. I am trying to do it using 1 single apply function instead of 2 apply functions separately.
Initial Df for example is as follows:
ID1 ID2
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
Next I try to create 2 new columns using the below method:
def funct(row):
list1 = row.values
print(list1[0])
return row
df[['s1','s2']] = df[['ID1',"ID2"]].apply(lambda row: funct(row))
The issue is that I want to access the values individually which I am unable to do so . Here i tried converting into list but when I do list[0] i get
1
11
How to access 1 and 11 above? how should I index to access individual series value when I send two series together using apply?
NOTE: the content of funct() is just returning the same now because I still dont know how to access the values inorder to do something
add a parameter axis=1 to your apply function, like
import pandas as pd
from io import StringIO
s = """
,ID1,ID2
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20
"""
df = pd.read_csv(StringIO(s),index_col=0)
def funct(row):
# return row
# update the answer
return pd.Series([row.ID1 + 100, row.ID2 + 20])
df[['s1','s2']] = df[['ID1',"ID2"]].apply(funct, axis=1)

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

the column in csv that comes from the index of DataFrame does not have a header name

here is a pandas DataFrame
>>> print(df)
A B C
0 0 1 2
1 3 4 5
2 6 7 8
with df.to_csv('df.csv') I got this file
the column in csv that comes from the index of DataFrame does not have a header name. Is it possible to specify a column name with pandas?
Try with rename_axis
df.rename_axis('index').to_csv('df.csv')

How to check dependency of one column to another in a pandas dataframe

I have the following dataframe:
import pandas as pd
df=pd.DataFrame([[1,11,'a'],[1,12,'a'],[1,11,'a'],[1,12,'a'],[1,7,'a'],
[1,12,'a']])
df.columns=['id','code','name']
df
id code name
0 1 11 a
1 1 12 a
2 1 11 a
3 1 12 a
4 1 7 a
5 1 12 a
As shown in the above dataframe, the value of column "id" is directly related to the value of column "name". If I have say, a million records, how can I know that a column is totally dependent on other column in a dataframe?
If they are totally dependent, then their factorizations will be the same
(df.id.factorize()[0] == df.name.factorize()[0]).all()
True