How to compare 2 dataframes columns and add a value to a new dataframe based on the result - pandas

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1

Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined

Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.

Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

Related

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

Pandas groupby with MultiIndex columns and different levels

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)
IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

How to find average of two tables in pandas?

I have one table with 1000s of rows that looks like this:
file1:
apples1 + hate 0 0 0 2 4 6 0 1
apples2 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 2 0 4 4 6 0 2
and another file with same headers in file2 - nb some headers are missing in file1:
apples1 + hate 0 0 0 1 4 6 0 2
apples2 + hate 0 1 0 6 4 6 0 2
apples3 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 1 0 3 4 3 0 1
I want to compare the two files in pandas and average across common columns. I do not want to print columns that are in one file only. So the resultant file would look like:
apples1 + hate 0 0 0 1.5 4 6 0 1.5
apples2 + hate 0 1.5 0 5 4 6 0 2
apples4 + hate 0 2 0 3.5 4 6 0 2
There are two steps in this solution.
concatenate all your dataframes by stacking them vertically (axis=0, the default) using pandas.concat(...) and specifying a join of 'inner' to only maintain columns that in all the dataframes.
call mean(...) function on resultant dataframe.
Example:
In [1]: df1 = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a','b','c'])
In [2]: df2 = pd.DataFrame([[1,2],[3,4]], columns=['a','c'])
In [3]: df1
Out[3]:
a b c
0 1 2 3
1 4 5 6
In [4]: df2
Out[4]:
a c
0 1 2
1 3 4
In [5]: df3 = pd.concat([df1, df2], join='inner')
In [6]: df3
Out[6]:
a c
0 1 3
1 4 6
0 1 2
1 3 4
In [7]: df3.mean()
Out[7]:
a 2.25
c 3.75
dtype: float64
Let's try this:
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
Set index to first three columns ie.. "apple1 + hate"
df1 = df1.set_index([0,1,2])
df2 = df2.set_index([0,1,2])
Let's use merge to inner join datafiles on indexes, and the groupby columns with the same name and aggregate with mean:
df1.merge(df2, right_index=True, left_index=True)\
.pipe(lambda x: x.groupby(x.columns.str.extract('(\w+)\_[xy]', expand=False),
axis=1, sort=False).mean()).reset_index()
Output:
0 1 2 3 4 5 6 7 8 9 10
0 apples1 + hate 0.0 0.0 0.0 1.5 4.0 6.0 0.0 1.5
1 apples2 + hate 0.0 1.5 0.0 5.0 4.0 6.0 0.0 2.0
2 apples4 + hate 0.0 1.5 0.0 3.5 4.0 4.5 0.0 1.5

pandas rename_axis not taking inplace argument

I have the following code. I thought df should have index name of INDEX at the end, given I set the inplace argument. But that's not the case. What am I missing? Or is it a bug in Pandas?
>>> df = pd.DataFrame([[1,2],[3,4]])
>>> df
0 1
0 1 2
1 3 4
>>> df.rename_axis('INDEX', inplace=True)
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
0 1 2
1 3 4
>>>
It should be straightforward. I tried it myself and experienced the same issue, so, congratulation, you've found a bug in Pandas.
Here is what you can do with your code for now:
>>> df = df.rename_axis('INDEX')
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
INDEX
0 1 2
1 3 4
>>>
In case you're interested to contribute to Pandas: https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests
It seems to be a bug, instead of rename_axis you can use index.rename with inplace = True parameter.
df.index.rename('INDEX', inplace=True)
Output of df:
0 1
INDEX
0 1 2
1 3 4

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7