Pandas join does not retain key fields from both dataframes - pandas

I have two dataframes and I am joining them like this:
merged=prvmthfile.merge(curmthfile, how='outer',on=['CUSTID','CTYPE'],suffixes=['prv','cur'],indicator=True)
Now, it adds the _prv and _cur to the common fields in the dataframes except the key fields CUSTID,CTYPE.
In the final output, I only see one set of CUSTId,CTYPE, is there a way to have CUSTID_prv,CUSTID_cur and CTYPE_prv,CTYPE_Cur?

Probably just add the suffixes before merging and then change the merge keys and remove the suffix argument:
prvmthfile.add_suffix('_prv').merge(
curmthfile.add_suffix('_cur'),
how='outer',
left_on=['CUSTID_prv', 'CTYPE_prv'],
right_on=['CUSTID_cur', 'CTYPE_cur'],
indicator=True)
Example:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5],
'val': [1,2,3,4,5]})
df2 = pd.DataFrame({'id': [1,2,4,5,6],
'val': [11,22,33,44,55]})
df.add_suffix('_prv').merge(df2.add_suffix('_cur'),
how='outer',
left_on=['id_prv'],
right_on=['id_cur'],
indicator=True)
Output:
id_prv val_prv id_cur val_cur _merge
0 1.0 1.0 1.0 11.0 both
1 2.0 2.0 2.0 22.0 both
2 3.0 3.0 NaN NaN left_only
3 4.0 4.0 4.0 33.0 both
4 5.0 5.0 5.0 44.0 both
5 NaN NaN 6.0 55.0 right_only

Related

How do I append an uneven column to an existing one?

I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0

Pandas element-wise min max against a series along one axis

I have a Dataframe:
df =
A B C D
DATA_DATE
20170103 5.0 3.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 1.0 NaN 2.0 3.0
And I have a series
s =
DATA_DATE
20170103 4.0
20170104 0.0
20170105 2.2
I'd like to run an element-wise max() function and align s along the columns of df. In other words, I want to get
result =
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
What is the best way to do this? I've checked single column comparison and series to series comparison but haven't found an efficient way to run dataframe against a series.
Bonus: Not sure if the answer will be self-evident from above, but how to do it if I want to align s along the rows of df (assume dimensions match)?
Data:
In [135]: df
Out[135]:
A B C D
DATA_DATE
20170103 5.0 3.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 1.0 NaN 2.0 3.0
In [136]: s
Out[136]:
20170103 4.0
20170104 0.0
20170105 2.2
Name: DATA_DATE, dtype: float64
Solution:
In [66]: df.clip_lower(s, axis=0)
C:\Users\Max\Anaconda4\lib\site-packages\pandas\core\ops.py:1247: RuntimeWarning: invalid value encountered in greater_equal
result = op(x, y)
Out[66]:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
we can use the following hack in order to ged rid of the RuntimeWarning:
In [134]: df.fillna(np.inf).clip_lower(s, axis=0).replace(np.inf, np.nan)
Out[134]:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
This is called broadcasting and can be done as follows:
import numpy as np
np.maximum(df, s[:, None])
Out:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
Here, s[:, None] will add a new axis to s. The same can be achieved by s[:, np.newaxis]. When you do this, they can be broadcast together because shapes (3, 4) and (3, 1) have a common element.
Note the difference between s and s[:, None]:
s.values
Out: array([ 4. , 0. , 2.2])
s[:, None]
Out:
array([[ 4. ],
[ 0. ],
[ 2.2]])
s.shape
Out: (3,)
s[:, None].shape
Out: (3, 1)
An alternative would be:
df.mask(df.le(s, axis=0), s, axis=0)
Out:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
This reads: Compare df and s. Where df is larger, use df, and otherwise use s.
While there may be better solutions for your problem, I believe this should give you what you need:
for c in df.columns:
df[c] = pd.concat([df[c], s], axis=1).max(axis=1)

Pandas Dataframe multiply with only the right dataframe taking fill_value

The fill_value argument of pandas.DataFrame.multiply() fills missing values in both dataframes. However, I'd like to have only missing values filled in the 2nd DataFrame. What would be a good way beyond my hacky solution below?
>>> df1 = pd.DataFrame({'a':[1, np.nan, 2], 'b':[np.nan, 3, 4]}, index = [1, 2, 3])
>>> df1
a b
1 1.0 NaN
2 NaN 3.0
3 2.0 4.0
>>> df2 = pd.DataFrame({'a':[2, np.nan], 'b':[3, np.nan], 'c':[1, 1]}, index = [1, 2])
>>> df2
a b c
1 2.0 3.0 1.0
2 NaN NaN 1.0
I would like to multiply the two DataFrames element-wise, by keeping df1 as the dominant one so that the resulting shape and NaN entries should match df1, while filling NaNs in df2 by value 1, to get
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
The naive solution doesn't work:
>>> df1.multiply(df2, fill_value=1)
a b c
1 2.0 3.0 1.0
2 NaN 3.0 1.0
3 2.0 4.0 NaN
My hacky solution is to create a matrix with 1s where df1 has value, and update by df2
>>> df3 = df1/df1
>>> df3.update(df2)
>>> df3
a b
1 2.0 3.0
2 NaN 1.0
3 1.0 1.0
>>> df1.multiply(df3)
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
It just doesn't feel very elegant. Any cool idea on direct manipulation with df1 and df2, hopefully a one-liner?
You can use reindex and fillna on df2:
df1.multiply(df2.reindex(df1.index).fillna(1))
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
You don't need to explicitly call multiply in this case, and can just use * for multiplication:
df1 * df2.reindex(df1.index).fillna(1)
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0
Additionally, if you need to align the columns of df2 with df1, use the columns parameter of reindex:
df1 * df2.reindex(index=df1.index, columns=df1.columns).fillna(1)
One alternative would be to filter the result based on the nulls in df1:
df1.multiply(df2, fill_value=1)[df1.notnull()]
Out:
a b
1 2.0 NaN
2 NaN 3.0
3 2.0 4.0

Pandas add new second level column to column multiindex based on other columns

I have a DataFrame with column multi-index:
System A B
Trial Exp1 Exp2 Exp1 Exp2
1 NaN 1 2 3
2 4 5 NaN NaN
3 6 NaN 7 8
Turns out for each system (A, B) and each measurement (1, 2, 3 in index), results from Exp1 is always superior to Exp2. So I want to generate a 3rd column for each system, call it Final, that should take Exp1 whenever available, and default to Exp2 otherwise. The desired result is
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1 1 2 3 2
2 4 5 4 NaN NaN NaN
3 6 NaN 6 7 8 7
What is the best way to do this?
I've tried to use groupby on the columns:
grp = df.groupby(level=0, axis=1)
And was thinking of using either transform or apply combined by assign to achieve it. But am not able to find either a working or an efficient way of doing it. Specifically I am avoiding native python for loops for efficiency reasons (else the problem is trivial).
Use stack for reshape, add column with fillna and then reshape back by unstack with swaplevel + sort_index:
df = df.stack(level=0)
df['Final'] = df['Exp1'].fillna(df['Exp1'])
df = df.unstack().swaplevel(0,1,axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 NaN 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Another solution with xs for select DataFrames, create new DataFrame by combine_first, but there is missing second level - was added by MultiIndex.from_product and last concat both DataFrames together:
a = df.xs('Exp1', axis=1, level=1)
b = df.xs('Exp2', axis=1, level=1)
df1 = a.combine_first(b)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['Final']])
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Similar solution with rename:
a = df.xs('Exp1', axis=1, level=1, drop_level=False)
b = df.xs('Exp2', axis=1, level=1, drop_level=False)
df1 = a.rename(columns={'Exp1':'Final'}).combine_first(b.rename(columns={'Exp2':'Final'}))
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System A B
Trial Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
stack with your first level of the column index stack(0) leaving ['Exp1', 'Exp2'] in the column index
Use a lambda function that gets applied to the whole dataframe within an assign call.
Finally, unstack, swaplevel, sort_index to clean it up and put everything where it belongs.
f = lambda x: x.Exp1.fillna(x.Exp2)
df.stack(0).assign(Final=f).unstack() \
.swaplevel(0, 1, 1).sort_index(1)
A B
Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
Another concept using xs
d1 = df.xs('Exp1', 1, 1).fillna(df.xs('Exp2', 1, 1))
d1.columns = [d1.columns, ['Final'] * len(d1.columns)]
pd.concat([df, d1], axis=1).sort_index(1)
A B
Exp1 Exp2 Final Exp1 Exp2 Final
1 NaN 1.0 1.0 2.0 3.0 2.0
2 4.0 5.0 4.0 NaN NaN NaN
3 6.0 NaN 6.0 7.0 8.0 7.0
doesnt feel super optimal but try this :
for system in df.columns.levels[0]:
df[(system, 'final')] = df[(system, 'Exp1')].fillna(df[(system, 'Exp2')])

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN