How to compare 2 dataframes columns and add a value to a new dataframe based on the result

How to compare 2 dataframes columns and add a value to a new dataframe based on the result - pandas

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1

Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined

Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.

Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

Related

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc

Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

Pandas groupby with MultiIndex columns and different levels

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)

IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

How to find average of two tables in pandas?

I have one table with 1000s of rows that looks like this:
file1:
apples1 + hate 0 0 0 2 4 6 0 1
apples2 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 2 0 4 4 6 0 2
and another file with same headers in file2 - nb some headers are missing in file1:
apples1 + hate 0 0 0 1 4 6 0 2
apples2 + hate 0 1 0 6 4 6 0 2
apples3 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 1 0 3 4 3 0 1
I want to compare the two files in pandas and average across common columns. I do not want to print columns that are in one file only. So the resultant file would look like:
apples1 + hate 0 0 0 1.5 4 6 0 1.5
apples2 + hate 0 1.5 0 5 4 6 0 2
apples4 + hate 0 2 0 3.5 4 6 0 2

There are two steps in this solution.
concatenate all your dataframes by stacking them vertically (axis=0, the default) using pandas.concat(...) and specifying a join of 'inner' to only maintain columns that in all the dataframes.
call mean(...) function on resultant dataframe.
Example:
In [1]: df1 = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a','b','c'])
In [2]: df2 = pd.DataFrame([[1,2],[3,4]], columns=['a','c'])
In [3]: df1
Out[3]:
a b c
0 1 2 3
1 4 5 6
In [4]: df2
Out[4]:
a c
0 1 2
1 3 4
In [5]: df3 = pd.concat([df1, df2], join='inner')
In [6]: df3
Out[6]:
a c
0 1 3
1 4 6
0 1 2
1 3 4
In [7]: df3.mean()
Out[7]:
a 2.25
c 3.75
dtype: float64

Let's try this:
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
Set index to first three columns ie.. "apple1 + hate"
df1 = df1.set_index([0,1,2])
df2 = df2.set_index([0,1,2])
Let's use merge to inner join datafiles on indexes, and the groupby columns with the same name and aggregate with mean:
df1.merge(df2, right_index=True, left_index=True)\
.pipe(lambda x: x.groupby(x.columns.str.extract('(\w+)\_[xy]', expand=False),
axis=1, sort=False).mean()).reset_index()
Output:
0 1 2 3 4 5 6 7 8 9 10
0 apples1 + hate 0.0 0.0 0.0 1.5 4.0 6.0 0.0 1.5
1 apples2 + hate 0.0 1.5 0.0 5.0 4.0 6.0 0.0 2.0
2 apples4 + hate 0.0 1.5 0.0 3.5 4.0 4.5 0.0 1.5

pandas rename_axis not taking inplace argument

I have the following code. I thought df should have index name of INDEX at the end, given I set the inplace argument. But that's not the case. What am I missing? Or is it a bug in Pandas?
>>> df = pd.DataFrame([[1,2],[3,4]])
>>> df
0 1
0 1 2
1 3 4
>>> df.rename_axis('INDEX', inplace=True)
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
0 1 2
1 3 4
>>>

It should be straightforward. I tried it myself and experienced the same issue, so, congratulation, you've found a bug in Pandas.
Here is what you can do with your code for now:
>>> df = df.rename_axis('INDEX')
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
INDEX
0 1 2
1 3 4
>>>
In case you're interested to contribute to Pandas: https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

It seems to be a bug, instead of rename_axis you can use index.rename with inplace = True parameter.
df.index.rename('INDEX', inplace=True)
Output of df:
0 1
INDEX
0 1 2
1 3 4

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)

You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to compare 2 dataframes columns and add a value to a new dataframe based on the result - pandas

Since class is 0 and 1, you could try, df1[0].lt(df2[0]).astype(int) For generic solutions, check jezrael's answer.

Related

Pandas iloc and conditional sum

Pandas groupby with MultiIndex columns and different levels

How to find average of two tables in pandas?

pandas rename_axis not taking inplace argument

create new column based on other columns in pandas dataframe

Categories

Resources