Pandas condition across 2 dataframes - pandas

I have 2 dataframes df1 and df2
df1;
A B C
0 11 22 55
1 66 34 54
2 0 34 66
df2;
A B C
0 11 33 455
1 0 0 54
2 0 34 766
Both dataframes have the same dimensions. I want to say if value is 0 in df2 then give that value (based on column and index) a 0 in df1.
So df1 will be
df1;
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Use DataFrame.mask:
df1 = df1.mask(df2 == 0, 0)
For better performance use numpy.where:
df1 = pd.DataFrame(np.where(df2 == 0, 0, df1),
index=df1.index,
columns=df1.columns)
print (df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Using where:
df1 = df1.where(df2.ne(0), 0)
print(df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Another way -
df1 = df1[~df2.eq(0)].fillna(0)

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

How to create and populate pandas columns based on cell values

I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?
Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0
String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0
We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

how to conditionally create new column in dataframe based on other column values in julia

I have a julia dataframe that looks like this:
time data
0 34
1 34
2 30
3 37
4 32
5 35
How do I create a new binary column that is 0 if time is less than 2 and greater than 4, and 1 if not either condition?
Like this:
time data x
0 34 0
1 34 0
2 30 1
3 37 1
4 32 1
5 35 0
In python, I would do something like:
def func(df):
if df.time < 2 or df.time > 4:
return 0
else:
return 1
df['x'] = df.apply(func, axis=1)
In Julia we have the beautiful Dot Syntax which can be gracefully applied here:
julia> df[!, :x] = 2 .<= df[!, :time] .<= 4
6-element BitVector:
0
0
1
1
1
0
or alternatively
df.x = 2 .<= df.time .<= 4

applying defined function over different ranges of rows in pandas

I have a df like this with tons of rows :
BB AA FF
2 5 0
3 7 A
6 5 A
9 6 A
8 3 0
And a function like this :
def test(a,b):
# a=array col AA
# b=array col BB
return (a*b)+a
I would like that for the rows in column FF where values are != 0 to apply the function test over that slice (array) of the df that involves column BB and AA to generate the following output in the new column ZZ:
BB AA FF ZZ
2 5 0 0
3 7 A 28
6 5 A 35
9 6 A 51
8 3 0 0
I was thinking in something like:
df['zz']= df.apply(lambda x: test(df.AA,df.BB) for the range of values among zero)
But my issue is that I am not sure on how to specify de arrays in column FF to apply the column
You can use DataFrame.apply + mask:
def test(x):
return (x[0]*x[1])+x[0]
df['ZZ']=df[['AA','BB']].apply(test,axis=1).mask(df['FF'].eq('0'),0)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0
or you can use lambda function:
df['ZZ']=df.apply(lambda x: x[['BB','AA']].prod()+ x['AA'] if x['FF'] != '0' else x['FF'],axis=1)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0

adding one to all the values in a dataframe

I have a dataframe like the one below. I would like to add one to all of the values in each row. I am new to this forum and python so i can't conceptualise how to do this. I need to add 1 to each value. I intend to use bayes probability and the posterior probability will be 0 when i multiply them. PS. I am also new to probability but others have applied the same method. Thanks for your help in advance. I am using pandas to do this.
Disease Gene1 Gene2 Gene3 Gene4
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
With this being your dataframe:
df = pd.DataFrame({
"Disease":[f"D{i}" for i in range(1,9)],
"Gene1":[0,0,0,24,0,0,0,4],
"Gene2":[0,0,17,0,0,32,0,0],
"Gene3":[25,0,0,0,0,0,0,0],
"Gene4":[0,0,16,0,0,11,0,0]})
Disease Gene1 Gene2 Gene3 Gene4
0 D1 0 0 25 0
1 D2 0 0 0 0
2 D3 0 17 0 16
3 D4 24 0 0 0
4 D5 0 0 0 0
5 D6 0 32 0 11
6 D7 0 0 0 0
7 D8 4 0 0 0
The easiest way to do this is to do
df += 1
However, since you have a column which is string (The Disease column)
This will not work.
But we can conveniently set the Disease column to be the index, like this:
df.set_index('Disease', inplace=True)
Now your dataframe looks like this:
Gene1 Gene2 Gene3 Gene4
Disease
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
And if we do df += 1 now, we get:
Gene1 Gene2 Gene3 Gene4
Disease
D1 1 1 26 1
D2 1 1 1 1
D3 1 18 1 17
D4 25 1 1 1
D5 1 1 1 1
D6 1 33 1 12
D7 1 1 1 1
D8 5 1 1 1
because the plus operation only acts on the data columns, not on the index.
You can also do this on column basis, like this:
df["Gene1"] = df["Gene1"] + 1
You can filter the df whether the underlying dtype is not 'object':
In [110]:
numeric_cols = [col for col in df if df[col].dtype.kind != 'O']
numeric_cols
Out[110]:
['Gene1', 'Gene2', 'Gene3', 'Gene4']
In [111]:
df[numeric_cols] += 1
df
Out[111]:
Disease Gene1 Gene2 Gene3 Gene4
0 D1 1 1 26 1
1 D2 1 1 1 1
2 D3 1 18 1 17
3 D4 25 1 1 1
4 D5 1 1 1 1
5 D6 1 33 1 12
6 D7 1 1 1 1
7 D8 5 1 1 1
EDIT
It looks like your df possibly has strings instead of numeric types, you can convert the dtype to numeric using convert_objects:
df = df.convert_objects(convert_numeric=True)