How to merge two differently multi-indexed dataframes - pandas

I have two multi-indexed dataframes df1 with three levels and df2 with two. The indices resulted from df1.groupby([col_1, col_2, col_3]) and df2.groupby([col_1, col_2]). col_1 and col_2 are the same in both dataframes, but, because of the third level in df1 of different lenghts; df1has 2425 rows and df2783.
What I'm trying to do is to merge both dataframes so that df2 gets spread up that the length of indices level 0 and 1 are of the same length in df1and df2 so that the resulting dataframe is also of 2425 rows.
I used df3 = df1.merge(df2, left_index=True, right_index=True) but the resulting dataframe remains with only 2385 rows.
I used df3 = pd.concat([df1, df2], axis=1) but it raised a ValueError: operands could not be broadcast together with shapes.
Is there an elegant way to solve this? I appreciate every help
EDIT: data sample
df1:
Areaclccat1990 ... Areaclccat2012
FID_Weser_Catchments_134_WQ_Stations_FINAL_LAEA... SNR1 gridcode_1 ...
0 3152 1 0.002764 ... 0.007248
2 0.980105 ... 0.972941
3 0.005049 ... 0.017166
4 0.012082 ... 0.002645
3155 1 NaN ... 0.000003
2 1.000000 ... 0.996788
3 NaN ... 0.003209
3255 1 NaN ... 0.058950
2 0.989654 ... 0.941050
4 0.010346 ... NaN
5958 1 NaN ... 0.004463
2 0.955098 ... 0.958452
3 0.014408 ... 0.027835
4 0.030494 ... 0.009250
5966 1 0.007184 ... 0.011448
2 0.955668 ... 0.949824
3 0.037148 ... 0.038728
5970 1 NaN ... 0.001141
2 0.979750 ... 0.930495
3 0.011281 ... 0.068364
df2:
Areaclccat1990 ... Areaclccat2012
FID_Weser_Catchments_134_WQ_Stations_FINAL_LAEA... SNR1 ...
0 3152 1654.636456 ... 1550.415658
3155 1820.433231 ... 1758.125539
3255 43.056576 ... 39.436385
5958 2306.806057 ... 2120.791289
5966 7.444977 ... 5.763853
5970 3087.717009 ... 2615.253450
6435 240.342745 ... 255.033888
6534 647.293171 ... 621.116222
6535 9929.136397 ... 9653.021903
6611 947.912232 ... 754.783147
6631 13528.073523 ... 13545.356498
6632 14023.097062 ... 13897.394309
6633 5913.895620 ... 5398.585720
6634 17463.795952 ... 17159.138628
6635 10791.618411 ... 10306.725199
6636 9664.138661 ... 9742.442935
9473 131.268559 ... 128.477078
9672 107.831005 ... 102.464959
9673 13.044806 ... 29.566828
16051 443.810802 ... 428.493495

Coonvert third level to column before merge with how='left' for left join:
df3 = df1.reset_index(level=2).merge(df2, left_index=True, right_index=True, how='left')

Related

Data manipulation with time series

My dataframe looks like this:
ID date var1 var2
0 1100289299522 2020-12-01 109.046450 8.0125
1 1100289299522 2020-12-02 104.494946 6.1500
2 1100289299522 2020-12-03 117.011582 5.9375
3 1100289299522 2020-12-04 109.615388 5.4750
4 1100289299522 2020-12-05 142.803438 3.8500
... ... ... ... ...
960045 21380318319578 2021-05-27 7.524261 15.4875
960046 21380318319578 2021-05-28 3.256770 17.3625
960047 21380318319578 2021-05-29 0.561512 18.3250
960048 21380318319578 2021-05-30 1.347629 18.7625
960049 21380318319578 2021-05-31 0.112302 20.0750
Is there a simple way in pandas to have one ID per row and set columns like this:
ID 2020-12-01_var1 2020-12-02_var1 ... 2021-05-31_var1 2020-12-01_var2 2020-12-02_var2 ... 2021-05-31 _var2
1100289299522 109.046450 104.494946 ... ___ 8.0125 6.1500 ... ___
Then i can use a dimensionality reduction algorithm (like TSNE) and maybe classify each time serie (and ID).
Do you think this is the correct way to proceed?
Try:
out = df.pivot(index='ID', columns='date', values=['var1', 'var2'])
out.columns = out.columns.to_flat_index().str.join('_')
For your sample:
>>> out
var1_2020-12-01 var1_2020-12-02 var1_2020-12-03 ... var2_2021-05-29 var2_2021-05-30 var2_2021-05-31
ID ...
1100289299522 109.04645 104.494946 117.011582 ... NaN NaN NaN
21380318319578 NaN NaN NaN ... 18.325 18.7625 20.075
[2 rows x 20 columns]

selecting nan values in a pandas dataframe using loc [duplicate]

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!
You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z