Pandas condition across 2 dataframes

Pandas condition across 2 dataframes - pandas

I have 2 dataframes df1 and df2
df1;
A B C
0 11 22 55
1 66 34 54
2 0 34 66
df2;
A B C
0 11 33 455
1 0 0 54
2 0 34 766
Both dataframes have the same dimensions. I want to say if value is 0 in df2 then give that value (based on column and index) a 0 in df1.
So df1 will be
df1;
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Use DataFrame.mask:
df1 = df1.mask(df2 == 0, 0)
For better performance use numpy.where:
df1 = pd.DataFrame(np.where(df2 == 0, 0, df1),
index=df1.index,
columns=df1.columns)
print (df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Using where:
df1 = df1.where(df2.ne(0), 0)
print(df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66

Another way -
df1 = df1[~df2.eq(0)].fillna(0)

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)

IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

How to create and populate pandas columns based on cell values

I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?

Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0

String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

how to conditionally create new column in dataframe based on other column values in julia

I have a julia dataframe that looks like this:
time data
0 34
1 34
2 30
3 37
4 32
5 35
How do I create a new binary column that is 0 if time is less than 2 and greater than 4, and 1 if not either condition?
Like this:
time data x
0 34 0
1 34 0
2 30 1
3 37 1
4 32 1
5 35 0
In python, I would do something like:
def func(df):
if df.time < 2 or df.time > 4:
return 0
else:
return 1
df['x'] = df.apply(func, axis=1)

In Julia we have the beautiful Dot Syntax which can be gracefully applied here:
julia> df[!, :x] = 2 .<= df[!, :time] .<= 4
6-element BitVector:
0
0
1
1
1
0
or alternatively
df.x = 2 .<= df.time .<= 4

applying defined function over different ranges of rows in pandas

I have a df like this with tons of rows :
BB AA FF
2 5 0
3 7 A
6 5 A
9 6 A
8 3 0
And a function like this :
def test(a,b):
# a=array col AA
# b=array col BB
return (a*b)+a
I would like that for the rows in column FF where values are != 0 to apply the function test over that slice (array) of the df that involves column BB and AA to generate the following output in the new column ZZ:
BB AA FF ZZ
2 5 0 0
3 7 A 28
6 5 A 35
9 6 A 51
8 3 0 0
I was thinking in something like:
df['zz']= df.apply(lambda x: test(df.AA,df.BB) for the range of values among zero)
But my issue is that I am not sure on how to specify de arrays in column FF to apply the column

You can use DataFrame.apply + mask:
def test(x):
return (x[0]*x[1])+x[0]
df['ZZ']=df[['AA','BB']].apply(test,axis=1).mask(df['FF'].eq('0'),0)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0
or you can use lambda function:
df['ZZ']=df.apply(lambda x: x[['BB','AA']].prod()+ x['AA'] if x['FF'] != '0' else x['FF'],axis=1)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0

adding one to all the values in a dataframe

I have a dataframe like the one below. I would like to add one to all of the values in each row. I am new to this forum and python so i can't conceptualise how to do this. I need to add 1 to each value. I intend to use bayes probability and the posterior probability will be 0 when i multiply them. PS. I am also new to probability but others have applied the same method. Thanks for your help in advance. I am using pandas to do this.
Disease Gene1 Gene2 Gene3 Gene4
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0

With this being your dataframe:
df = pd.DataFrame({
"Disease":[f"D{i}" for i in range(1,9)],
"Gene1":[0,0,0,24,0,0,0,4],
"Gene2":[0,0,17,0,0,32,0,0],
"Gene3":[25,0,0,0,0,0,0,0],
"Gene4":[0,0,16,0,0,11,0,0]})
Disease Gene1 Gene2 Gene3 Gene4
0 D1 0 0 25 0
1 D2 0 0 0 0
2 D3 0 17 0 16
3 D4 24 0 0 0
4 D5 0 0 0 0
5 D6 0 32 0 11
6 D7 0 0 0 0
7 D8 4 0 0 0
The easiest way to do this is to do
df += 1
However, since you have a column which is string (The Disease column)
This will not work.
But we can conveniently set the Disease column to be the index, like this:
df.set_index('Disease', inplace=True)
Now your dataframe looks like this:
Gene1 Gene2 Gene3 Gene4
Disease
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
And if we do df += 1 now, we get:
Gene1 Gene2 Gene3 Gene4
Disease
D1 1 1 26 1
D2 1 1 1 1
D3 1 18 1 17
D4 25 1 1 1
D5 1 1 1 1
D6 1 33 1 12
D7 1 1 1 1
D8 5 1 1 1
because the plus operation only acts on the data columns, not on the index.
You can also do this on column basis, like this:
df["Gene1"] = df["Gene1"] + 1

You can filter the df whether the underlying dtype is not 'object':
In [110]:
numeric_cols = [col for col in df if df[col].dtype.kind != 'O']
numeric_cols
Out[110]:
['Gene1', 'Gene2', 'Gene3', 'Gene4']
In [111]:
df[numeric_cols] += 1
df
Out[111]:
Disease Gene1 Gene2 Gene3 Gene4
0 D1 1 1 26 1
1 D2 1 1 1 1
2 D3 1 18 1 17
3 D4 25 1 1 1
4 D5 1 1 1 1
5 D6 1 33 1 12
6 D7 1 1 1 1
7 D8 5 1 1 1
EDIT
It looks like your df possibly has strings instead of numeric types, you can convert the dtype to numeric using convert_objects:
df = df.convert_objects(convert_numeric=True)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas condition across 2 dataframes - pandas

Use DataFrame.mask: df1 = df1.mask(df2 == 0, 0) For better performance use numpy.where: df1 = pd.DataFrame(np.where(df2 == 0, 0, df1), index=df1.index, columns=df1.columns) print (df1) A B C 0 11 22 55 1 0 0 54 2 0 34 66

Using where: df1 = df1.where(df2.ne(0), 0) print(df1) A B C 0 11 22 55 1 0 0 54 2 0 34 66

Another way - df1 = df1[~df2.eq(0)].fillna(0)

Related

iteration calculation based on another dataframe

How to create and populate pandas columns based on cell values

how to conditionally create new column in dataframe based on other column values in julia

applying defined function over different ranges of rows in pandas

adding one to all the values in a dataframe

Categories

Resources