How to subtract one dataframe from another? - pandas

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.

If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M

Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)

Related

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

How to *multiply* (for lack of a better term) two dataframes [duplicate]

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is #senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving #senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge now have option cross
left.merge(right, how='cross')
Using itertools product and recreate the value in dataframe
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
Here's an approach with triple concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
One option is with expand_grid from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'left':left, 'right':right}
jn.expand_grid(others = others)
left right
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:
left['dummy'] = 'a'
right['dummy'] = 'a'
cartesian = left.merge(right, how='inner', on='dummy')
del cartesian['dummy']

Pandas groupby sort each group values and order dataframe groups based on max of each group

I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2

Combine two columns of numbers in dataframe into single column using pandas/python

I'm very new to Pandas and Python.
I have a 3226 x 61 dataframe and I would like to combine two columns into a single one.
The two columns I would like to combine are both integers - one has either one or two digits (1 through 52) while the other has three digits (e.g., 1 or 001, 23 or 023). I need the output to be a five digit integer (e.g., 01001 or 52023). There will be no mathematical operations with the resulting integers - I will need them only for look-up purposes.
Based on some of the other posts on this fantastic site, I tried the following:
df['YZ'] = df['Y'].map(str) + df['Z'].map(str)
But that returns "1.00001 for a first column of "1" and second column of "001", I believe because making "1" a str turns it into "1.0", which "001" is added to the end.
I've also tried:
df['YZ'] = df['Y'].join(df['Z'])
Getting the following error:
AttributeError: 'Series' object has no attribute 'join'
I've also tried:
df['Y'] = df['Y'].astype(int)
df['Z'] = df['Z'].astype(int)
df['YZ'] = df[['Y','Z']].apply(lambda x: ''.join(x), axis=1)
Getting the following error:
TypeError: ('sequence item 0: expected str instance, numpy.int32
found', 'occurred at index 0')
A copy of the columns is below:
1 1
1 3
1 5
1 7
1 9
1 11
1 13
I understand there are two issues here:
Combining the two columns
Getting the correct format (five digits)
Frankly, I need help with both but would be most appreciative of the column combining problem.
I think you need convert columns to string, add 0 by zfill and simply sum by +:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
Sample:
df=pd.DataFrame({'Y':[1,3,5,7], 'Z':[10,30,51,74]})
print (df)
Y Z
0 1 10
1 3 30
2 5 51
3 7 74
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01010
1 3 30 03030
2 5 51 05051
3 7 74 07074
If need also change original columns:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df['Y'] + df['Z']
print (df)
Y Z YZ
0 01 010 01010
1 03 030 03030
2 05 051 05051
3 07 074 07074
Solution with join:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df[['Y','Z']].apply('-'.join, axis=1)
print (df)
Y Z YZ
0 01 010 01-010
1 03 030 03-030
2 05 051 05-051
3 07 074 07-074
And without change original columns:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + '-' + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01-010
1 3 30 03-030
2 5 51 05-051
3 7 74 07-074