Merging Pandas DataFrames with a rule - pandas

I am trying to merge two DataFrames with common row and column indexes, however, I am expecting entries with similar row and column indexes to exist in both DataFrames.
Is there a way to make a rule to keep the entries in df1 if they are present but if there is no value then to use the value in df2?
so df3 = some operation on df1,df21
Example:
df1 = [[[a],[b],[c]],
[[ ],[e],[ ]],
[[g],[h],[i]]]
df2 = [[[ ],[ ],[ ]],
[[d],[x],[f]],
[[y],[z],[z]]]
df3 = [[[a],[b],[c]],
[[d],[e],[f]],
[[g],[h],[i]]]

I think you're looking for pandas df.fillna()::
d1 = [['a','b','c'],[None,'e',None],['g','h','i']]
d2 = [[None,None,None],['d','x','f'],['y','z','z']]
df1,df2 = pd.DataFrame(d1),pd.DataFrame(d2)
df1.fillna(df2)
0 1 2
0 a b c
1 d e f
2 g h i

You can also use this:
df1[df1.isnull()] = df2.values

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Is there a way to use .loc on column names instead of the values inside the columns?

I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?
IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365
One solution:
df3 = df2[[c for c in df2.columns if c in df1]]

Lookup row in pandas dataframe

I have two dataframes (A & B). For each row in A I would like to look up some information that is in B. I tried:
A = pd.DataFrame({'X' : [1,2]}, index=[4,5])
B = pd.DataFrame({'Y' : [3,4,5]}, index=[4,5,6])
C = pd .DataFrame(A.index)
C .columns = ['I']
C['Y'] = B .loc[C.I, 'Y']
I wanted '3, 4' but I got 'NaN', 'NaN'.
Use A.join(B).
The result is:
X Y
4 1 3
5 2 4
Joining is by index and value from B for key 5 is absent, since A does
not contain this key.
What you should do is make the index same , pandas is index sensitive , which mean they will check the index when do assignment
C = pd .DataFrame(A.index,index=A.index) # change here
C .columns = ['I']
C['Y'] = B .loc[C.I, 'Y']
C
Out[770]:
I Y
4 4 3
5 5 4
Or just modify your code adding .values at the end
C['Y'] = B .loc[C.I, 'Y'].values
Since you mentioned lookup let us using lookup
C['Y']=B.lookup(C.I,['Y']*len(C))
#Out[779]: array([3, 4], dtype=int64)

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028

pd.dataframe.apply() create multiple new columns

I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c