Joining two DataFrames with Pandas, one with 1 row per key, and the other with several rows per key - pandas

First, I want to point out that I didn't found the answer for my question here in stackoverflow nor in pandas documentation, so, if the question had been asked before, I'd appreciate a link for that thread.
I want to join two DataFrames as follows.
df1 =
key x y z
0 x0 y0 z0
1 x1 y1 z1
...
10 x10 y10 z10
df2 =
key w v u
0 w0 v0 u0
0 w0 v0 u0
0 w0 v0 u0
1 w1 v1 u1
1 w1 v1 u1
2 w2 v2 u2
3 w3 v3 u3
...
10 w10 v10 u10
10 w10 v10 u10
desired_df_output =
key x y z w v u
0 x0 y0 z0 w0 v0 u0
1 x1 y1 z1 w1 v1 u1
...
10 x10 y10 z10 w10 v10 u10
I've tried this df1.join(df2, how='inner', on='key'), but I get this error: TypeError: object of type 'NoneType' has no len().
Thanks

It seems df2 has duplicates values, so if you drop them using drop_duplicates method and merge with df1 from the right side, you get the desired outcome.
out = df1.merge(df2.drop_duplicates(), on='key')
Output:
key x y z w v u
0 0 x0 y0 z0 w0 v0 u0
1 1 x1 y1 z1 w1 v1 u1
2 10 x10 y10 z10 w10 v10 u10

import pandas as pd
df1 = pd.DataFrame({'k':[0, 1, 2, 3],
'x':['x0', 'x1', 'x2', 'x3'],
'y':['y0', 'y1', 'y2', 'y3'],
'z':['z0', 'z1', 'z2', 'z3']
})
df1.set_index('k', inplace=True)
df2 = pd.DataFrame({'k':[0, 0, 0, 1, 1, 1],
'v':['v0', 'v0', 'v0','v1', 'v1', 'v1',],
'w':['w0', 'w0', 'w0','w1', 'w1', 'w1',],
'u':['u0', 'u0', 'u0','u1', 'u1', 'u1',]
})
df2.set_index('k', inplace=True)
df_merged = pd.merge(df1, df2.drop_duplicates(), how='inner', left_index=True, right_index=True)
df_merged
x y z v w u
k
0 x0 y0 z0 v0 w0 u0
1 x1 y1 z1 v1 w1 u1

Related

Iterating over all columns of dataframe to find list of strings

Suppose I have the following df:
df = pd.DataFrame({
'col1':['x1','x2','x3'],
'col2':['y1','y2','y3'],
'col3':['z1','z2','z3'],
'col4':['a1','b2','c3']
})
and a list of elements:
l = ['x1','x2','y3']
I want to search elements of l in all the columns of my df, as it stands from my list x1 and x2 appear in col1 and y3 is in col2, so I did:
df.loc[df['col1'].apply(lambda x: True if any(i in x for i in l) else False)|
df['col2'].apply(lambda x: True if any(i in x for i in l) else False)]
which gives me
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3
as desired but the above method needs me to make a | operator for each column. So I wonder how can I do this iteration over all columns efficiently without using | for every column?
A much, much more efficient way of doing this would be to use numpy broadcasting.
row_mask = (df.to_numpy() == l[:, None, None]).sum(axis=0).any(axis=1)
filtered = df[row_mask]
Output:
>>> filtered
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3

Pandas Dataframe transformation - Understanding problems with functions I should use and logic I should opt for

I've got a hard problem with transforming a dataframe into another one.
I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.
For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.
Here is my original dataframe:
df1 = pd.DataFrame({
'Key': ['K0', 'K1', 'K2'],
'X0': ['a','b','a'],
'Y0': ['c','d','c'],
'X1': ['e','f','f'],
'Y1': ['g','h','h']
})
Key X0 Y0 X1 Y1
0 K0 a c e g
1 K1 b d f h
2 K2 a c f h
This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.
It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.
I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)
So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.
And here is what I want :
df3 = pd.DataFrame({
'Key':[['K0','K2'],'K1','K0',['K1','K2']],
'X':['a','b','e','f'],
'Y':['c','d','g','h']
})
Key X Y
0 [K0, K2] a c
1 K1 b d
2 K0 e g
3 [K1, K2] f h
To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.
Here is the transition dataframe :
df2 = pd.DataFrame({
'Key':['K0','K1','K2','K0','K1','K2'],
'X':['a','b','a','e','f','f'],
'Y':['c','d','c','g','h','h']
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Transition from df1 to df2
For now, I did this to put X0,X1 and Y0,Y1 into X and Y :
Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
if 'K' in i:
Key = Key.append(df1[i], ignore_index=True)
elif 'X' in i:
X = X.append(df1[i], ignore_index=True)
elif 'Y' in i:
Y = Y.append(df1[i], ignore_index=True)
0 K0
1 K1
2 K2
dtype: object
0 a
1 b
2 a
3 e
4 f
5 f
dtype: object
0 c
1 d
2 c
3 g
4 h
5 h
dtype: object
But I do not know how to get the keys to keep them in front of the right links.
Also, I do this to construct df2, but it's weird and I do not understand how I should do it :
df2 = pd.DataFrame({
'Key':Key,
'X':X,
'Y':Y
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 NaN e g
4 NaN f h
5 NaN f h
I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.
In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.
To convert df1 to df2 you want to look into pandas.wide_to_long.
https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2
X Y
Key num
K0 0 a c
K1 0 b d
K2 0 a c
K0 1 e g
K1 1 f h
K2 1 f h
You can drop the unwanted level "num" from the index using droplevel and turn the index level "Key" into a column using reset_index. Chaining everything:
>>> df2 = (
pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
.droplevel(level='num')
.reset_index()
)
>>> df2
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Finally, to get df3 you just need to group df2 by "X" and "Y", and aggregate the "Key" groups into lists.
>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d [K1]
2 e g [K0]
3 f h [K1, K2]
If you don't want single keys to be lists you can do this instead
>>> df3 = (
df2.groupby(['X','Y'], as_index=False)
.agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d K1
2 e g K0
3 f h [K1, K2]

Pandas: Convert a row into a column and make all other entries the second column

I have a pandas DataFrame like this:
Col1 Col2 Col3 Col4
control x1 x2 x3 x4
obs1 o11 o12 o13 o14
obs2 o21 o22 o23 o24
...
obsn on1 on2 on3 on4
I want to reshape it as follows (column headers are not needed):
control Observation
1 x1 o11
2 x1 o12
3 x1 o13
...
m xk ok1
m+1 xk ok2
...
How do I go about this?
You can select your "control" row and use that to set your columns via set_axis from there its a simple melt.
The sort_values and reset_index aren't functionally necessary, but they align the dataframe with what your expected output is, so I've included them here:
control = df.loc["control", :]
observations = df.drop("control")
out = (observations.set_axis(control, axis=1)
.melt(value_name="observation")
.sort_values("observation")
.reset_index(drop=True))
print(out)
control observation
0 x1 o11
1 x2 o12
2 x3 o13
3 x4 o14
4 x1 o21
5 x2 o22
6 x3 o23
7 x4 o24
I think I have a crude solution but it's not elegant. Say df is my data frame.
mdf = df.melt()
for col in df.columns:
mdf.loc[mdf['variable'] == col, 'variable'] = df.loc['control', col]
mdf.drop(mdf[mdf['variable'] == mdf['value']].index, inplace=True)

In a pandas dataframe with a MultiIndex, how to conditionally fill missing values with group means?

Setup:
# create a MultiIndex
dfx = pd.MultiIndex.from_product([
list('ab'),
list('cd'),
list('xyz'),
], names=['idx1', 'idx2', 'idx3'])
# create a dataframe that fits the index
df = pd.DataFrame([None, .9, -.08, -2.11, 1.09, .38, None, None, -.37, -.86, 1.51, -.49], columns=['random_data'])
df.set_index(dfx, inplace=True)
Output:
random_data
idx1 idx2 idx3
a c x NaN
y 0.90
z -0.08
d x -2.11
y 1.09
z 0.38
b c x NaN
y NaN
z -0.37
d x -0.86
y 1.51
z -0.49
Within this index hierarchy, I am trying to accomplish the following:
When a value is missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1, idx2]
When multiple values are missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1]
I have tried df.apply(lambda col: col.fillna(col.groupby(by='idx1').mean())) as a way to solve #2, but I haven't been able to get it to work.
UPDATE
OK, so I have this solved in parts, but still at a loss about how to apply these conditionally:
For case #1:
df.unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().
I verified that the correct value was filled by looking at this:
df.groupby(by=['idx1', 'idx2']).mean(),
but it also replaces the missing values that I am trying to handle differently in case #2.
Similarly for #2:
df.unstack().unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().stack()
verified the values replaced were correct by looking at
df.groupby(by=['idx1']).mean()
but it also applies to case #1, which I don't want.
I'm sure there is a more elegant way of doing this, but the following should achieve your desired result:
def get_null_count(df, group_levels, column):
result = (
df.loc[:, column]
.groupby(group_levels)
.transform(lambda x: x.isnull().sum())
).astype("int")
return result
def fill_groups(
df,
count_group_levels,
column,
missing_count_idx_map
):
null_counts = get_null_count(
df, count_group_levels, column
)
condition_masks = {
count: ((null_counts == count) & df[col].isnull()).to_numpy()
for count in missing_count_idx_map.keys()
}
condition_values = {
count: df.loc[:, column]
.groupby(indicies)
.transform("mean")
.to_numpy()
for count, indicies in missing_count_idx_map.items()
}
# Defaults
condition_masks[0] = (~df[col].isnull()).to_numpy()
condition_values[0] = df[col].to_numpy()
sorted_keys = sorted(missing_count_idx_map.keys()) + [0]
conditions = [
condition_masks[count]
for count in sorted_keys
]
values = [
condition_values[count]
for count in sorted_keys
]
result = np.select(conditions, values)
return result
col = "random_data"
missing_count_idx_map = {
1: ['idx1', "idx2"],
2: ['idx1']
}
df["filled"] = fill_groups(
df, ['idx1', 'idx2'], col, missing_count_idx_map
)
df then looks like:
random_data filled
idx1 idx2 idx3
a c x NaN -0.20
y 1.16 1.16
z -1.56 -1.56
d x 0.47 0.47
y -0.54 -0.54
z -0.30 -0.30
b c x NaN -0.40
y NaN -0.40
z 0.29 0.29
d x 0.98 0.98
y -0.41 -0.41
z -2.46 -2.46
IIUC, you may try this. Get mean of levelidx1 and mean of level [idx1, idx2]. Fillna use mean of [idx1,idx2]. Next, use mask to assign rows of groups having more than 1 NaN by mean of idx1
Sample `df`:
random_data
idx1 idx2 idx3
a c x NaN
y -0.09
z -0.01
d x -1.30
y -0.11
z 1.33
b c x NaN
y NaN
z 0.74
d x -1.44
y 0.50
z -0.61
df1_m = df.mean(level='idx1')
df12_m = df.mean(level=['idx1', 'idx2'])
m = df.isna().groupby(level=['idx1', 'idx2']).transform('sum').gt(1)
df_filled = df.fillna(df12_m).mask(m & df.isna(), df1_m)
Out[110]:
random_data
idx1 idx2 idx3
a c x -0.0500
y -0.0900
z -0.0100
d x -1.3000
y -0.1100
z 1.3300
b c x -0.2025
y -0.2025
z 0.7400
d x -1.4400
y 0.5000
z -0.6100
OK, solved it.
First, I made a dataframe containing counts by group of non-missing values:
truth_table = df.apply(lambda row: row.count(), axis = 1).groupby(by=['idx1', 'idx2']).sum()
>> truth_table
idx1 idx2
a c 2
d 3
b c 1
d 3
dtype: int64
Then set up a dataframe (one for each case I'm trying to resolve) containing the group means:
means_ab = x.groupby(by=['idx1']).mean()
>> means_ab
idx1
a 0.0360
b -0.0525
means_abcd = x.groupby(by=['idx1', 'idx2']).mean()
>> means_abcd
idx1 idx2
a c 0.410000
d -0.213333
b c -0.370000
d 0.053333
Given the structure of my data, I know:
Case #1 is analogous to truth_table having exactly one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_abcd)
Case #2 is analogous to truth_table having more than one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_ab
fix_case_2 = df.combine_first(df[truth_table > 1].fillna(means_ab, axis=1))
>> fix_case_2
idx1 idx2 idx3
a c x NaN
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900
df = fix_case_2.combine_first(df[truth_table == 1].fillna(means_abcd, axis=1))
>> df
idx1 idx2 idx3
a c x 0.4100 *
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900

Pandas:set order of new generated column

I am working on a data and write a code which will basically split the data of column (COL) with respect to (comma:,) and print the split data into new columns. Now, what I want is that my code is able to generate the new columns in given manner (desired output). The code is attached below. Thank you in advance.
Input
X1 COL Y1
----------------
A X,Y,Z 146#12
B Z 223#13
C Y,X 725#14
Current output:
X1 Y1 COL-0 COL-1 COL-2
-----------------------------
A 146#12 X Y Z
B 223#13 Z NaN NaN
C 725#14 Y X NaN
Desired output:
X1 COL-1 COL-2 COL-3 Y1
------------------------------
A X Y Z 146#12
B Z - - 223#13
C Y X - 725#14
Script
import pandas as pd
import numpy as np
df = pd.read_csv(r"<PATH TO YOUR CSV>")
for row, item in enumerate(df["COL"]):
l = item.split(",")
for idx, elem in enumerate(l):
col = "COL-%s" % idx
if col not in df.columns:
df[col] = np.nan
df[col][row] = elem
df = df.drop(columns=["COL"])
print(df)
Use DataFrame.pop:
df['Y1'] = df.pop('Y1')
Also solution should be changed with Series.str.split:
df = (df.join(df.pop('COL').str.split(',', expand=True)
.fillna('-')
.rename(columns = lambda x: f'COL-{x+1}')))
df['Y1'] = df.pop('Y1')
print (df)
X1 COL-1 COL-2 COL-3 Y1
0 A X Y Z 146#12
1 B Z - - 223#13
2 C Y X - 725#14
If you wish to replace the NaN values with dashes you can use fillna() and, to keep the columns in order you specified you can simply define a dataframe with that column order.
df_output = df[['X1','COL-1','COL-2','COL-3','Y1']].fillna(value='-')
Not the most elegant of methods, but this should handle your real data and intended result :
import re
cols = df.filter(like='COL').columns.tolist()
pat = '(\w+)'
new_cols = [(f'{re.match(pat,col).group(0)} {i}') for i,col in enumerate(cols,1)]
df.rename(columns=dict(zip(cols,new_cols)),inplace=True)
df['Y1'] = df.pop('Y1')
out:
X1 COL 1 COL 2 COL 3 Y1
0 A X Y Z 146#12
1 B Z NaN NaN 223#13
2 C Y X NaN 725#14