concatenate 3 pandas dataframes on index and one column - pandas

I'd like to concatenate 3 dataframes on index and on 'type' column where some index values are missing (dfb and dfc have incomplete index, while dfa has complete index) . when I do concat some columns disappear as shown below. (i'd like final dataframe have MultiIndex so I can pick up parts of concatenated dataframe by type, and df['type2'] should have sorted index).
I tried concat with various parameters but it did not work.
dfa=pd.DataFrame({'type':['type1','type1','type2'],'a':[10,20,30]},index=[1,2,3])
dfb=pd.DataFrame({'type':['type1','type2'],'b':[11,21]},index=[2,3])
dfc=pd.DataFrame({'type':['type3'],'c':[33]},index=[3])
dfa
dfb
dfc
pd.concat([dfa,dfb,dfc],axis=0,keys=['type']) #wrong. columns b and c disappear!
I'd like the efficient solution as I have 5 dataframes whith 2000 "types" and index size of each is around 10K
desired:
example of desired dataframe:
pd.DataFrame({'a':[10,20,30,np.nan],'b':[np.nan,11,21,np.nan],'c':
[np.nan,np.nan,np.nan,33],'type':['type1','type1','type2','type3']},index=
[1,2,3,3])

After creating df:
dfa=pd.DataFrame({'type':['type1','type1','type2'],'a':[10,20,30]},index=[1,2,3])
dfb=pd.DataFrame({'type':['type1','type2'],'b':[11,21]},index=[2,3])
dfc=pd.DataFrame({'type':['type3'],'c':[33]},index=[3])
You can use merge and reset_index like this:
dfs = [dfa, dfb, dfc] # ... add as many df as you wish
res = dfs[0].reset_index()
for i in range(1,len(dfs)):
res = res.merge(dfs[i].reset_index(), how='outer', left_on=['index','type'], right_on=['index','type'])
res = res.set_index('index')
print(res)
The result will be:
type a b c
index
1 type1 10.0 NaN NaN
2 type1 20.0 11.0 NaN
3 type2 30.0 21.0 NaN
3 type3 NaN NaN 33.0

The problems that you aren't defining enough keys to match the number of dataframes concatenated.
Try this:
pd.concat([dfa, dfb, dfc], axis=0, keys=['type_a', 'type_b', 'type_c'])
Output:
a b c type
type_a 1 10.0 NaN NaN type1
2 20.0 NaN NaN type1
3 30.0 NaN NaN type2
type_b 2 NaN 11.0 NaN type1
3 NaN 21.0 NaN type2
type_c 3 NaN NaN 33.0 type3
Or leave keys parameter out all together:
pd.concat([dfa, dfb, dfc], axis=0)
Output:
a b c type
1 10.0 NaN NaN type1
2 20.0 NaN NaN type1
3 30.0 NaN NaN type2
2 NaN 11.0 NaN type1
3 NaN 21.0 NaN type2
3 NaN NaN 33.0 type3

Related

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

How to select the rows having same id and have all missing value in another column

I have the following dataframe:
ID col_1
1 NaN
2 NaN
3 4.0
2 NaN
2 NaN
3 NaN
3 3.0
1 NaN
I need the following output:
ID col_1
1 NaN
1 NaN
2 NaN
2 NaN
2 NaN
how to do this in pandas
You can create a boolean mask with isna then group this mask by ID and transform using all, then you can filter the rows with the help of this mask:
mask = df['col_1'].isna().groupby(df['ID']).transform('all')
df[mask].sort_values('ID')
Alternatively you can use groupby + filter to filter out the groups which satisfy the condition where all values in col_1 are NaN but this method should be slower than the above:
df.groupby('ID').filter(lambda g: g['col_1'].isna().all()).sort_values('ID')
ID col_1
0 1 NaN
7 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
Let us try with isin after groupby with all
s = df['col_1'].isna().groupby(df['ID']).all()
df = df.loc[df.ID.isin(s[s].index.tolist())]
df
Out[73]:
ID col_1
0 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
7 1 NaN
import pandas as pd
import numpy as np
df=pd.read_excel(r"D:\Stack_overflow\test12.xlsx")
df1=(df[df['cols_1'].isnull()]).sort_values(by=['ID'])
I think we can simply take out the null values.

Foward-fill dataframe based on mask. Fill with last valid value

I have a dataframe like the following:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,21
5,10,22
6,11,23
7,12,24
8,13,NaN
9,NaN,NaN
And a boolean mask dataframe like the following:
index,col1,col2
1,False,False
2,False,False
3,False,False
4,False,True
5,False,False
6,False,False
7,True,True
8,True,False
9,False,False
I would like to convert them to this final dataframe:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,20
5,10,22
6,11,23
7,11,23
8,11,NaN
9,NaN,NaN
That is: foward-filling the values matching True on the mask with the last value in the column having False in the mask.
How can I get this?
Let's try:
df.mask(mask).ffill().where(df.notna())
Output:
col1 col2
index
1 NaN NaN
2 NaN NaN
3 NaN 20.0
4 NaN 20.0
5 10.0 22.0
6 11.0 23.0
7 11.0 23.0
8 11.0 NaN
9 NaN NaN

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f