I am trying to use Pandas for data analysis. I need to merge two dataframes according to their indexes. However, their indexes are totally different. The rule is that if the index of df2 is the substring of df1, then I should merge them. For example, df1.index == ['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'], and df2.index == ['bb/bbb', 'ccc', 'hello']. Then df1 and df2 have two indexes in common, we should merge them based on these indexes. What should i do?
Having your DataFrame :
>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])
And changing the index to column :
>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
value_df1 col_a
0 a/aa/aaa 1
1 b/bb/bbb 2
2 c/cc/ccc 3
>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
value_df2 col_b
0 bb/bbb 4
1 ccc 5
2 hello 6
We merge both DataFrame on the joincolumn :
>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
value_df1 col_a value_df2 col_b
0 a/aa/aaa 1 bb/bbb 4
1 a/aa/aaa 1 ccc 5
2 a/aa/aaa 1 hello 6
3 b/bb/bbb 2 bb/bbb 4
4 b/bb/bbb 2 ccc 5
5 b/bb/bbb 2 hello 6
6 c/cc/ccc 3 bb/bbb 4
7 c/cc/ccc 3 ccc 5
8 c/cc/ccc 3 hello 6
Then we use an apply to match the initial index value :
>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
value_df1 col_a value_df2 col_b match
0 a/aa/aaa 1 bb/bbb 4 False
1 a/aa/aaa 1 ccc 5 False
2 a/aa/aaa 1 hello 6 False
3 b/bb/bbb 2 bb/bbb 4 True
4 b/bb/bbb 2 ccc 5 False
5 b/bb/bbb 2 hello 6 False
6 c/cc/ccc 3 bb/bbb 4 False
7 c/cc/ccc 3 ccc 5 True
8 c/cc/ccc 3 hello 6 False
Filtering on the row where the column match is True and dropping the match column, we get the expected result :
>>> dfFull[dfFull['match']].drop(['match'], axis=1)
value_df1 col_a value_df2 col_b
3 b/bb/bbb 2 bb/bbb 4
7 c/cc/ccc 3 ccc 5
This solution is inspired by this post.
Since you have a known delimiter, you can split on that delimiter, do some merging, and then add back in the original data.
# sample data
df1 = pd.DataFrame({'ColumnA': [1,2,3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
df2 = pd.DataFrame({'ColumnB': [4,5,6]}, index=['bb/bbb', 'ccc', 'hello'])
# set original index as column
# make a copy of each dataframe to preserve original data
# reset index of copy to keep track of original row number
df1 = df1.reset_index()
copy_df1 = df1
copy_df1.index.name = 'row_df1'
copy_df1 = df1.reset_index()
df2 = df2.reset_index()
copy_df2 = df2
copy_df2.index.name = 'row_df2'
copy_df2 = copy_df2.reset_index()
# split on known delimiter and explode into rows for each substring
copy_df1['index'] = copy_df1['index'].str.split('/')
copy_df1 = copy_df1.explode('index')
copy_df2['index'] = copy_df2['index'].str.split('/')
copy_df2 = copy_df2.explode('index')
# merge based on substrings, drop duplicates in case of multiple substring matches
mrg = copy_df1[['row_df1','index']].merge(copy_df2[['row_df2','index']]).drop(columns='index')
mrg = mrg.drop_duplicates()
# merge back in original details
mrg = mrg.merge(df1, left_on='row_df1', right_index=True)
mrg = mrg.merge(df2, left_on='row_df2', right_index=True, suffixes=('_df1','_df2'))
The final output would be:
row_df1 row_df2 index_df1 ColumnA index_df2 ColumnB
0 1 0 b/bb/bbb 2 bb/bbb 4
2 2 1 c/cc/ccc 3 ccc 5
Related
Assume we have 3 dataframes named df1, df2, df3. Each of these dataframes have 100 rows and 15 columns. I want to create new dataframe that will have the first column of df1, then the first column of df2m then the first column of df3. then it will have the second column of df1 then the second column of df2 then the second column of df3 and so on until all 15 columns of each of the three dataframes are included. For example
df1
A B C ... O
1 1 1 1
1 1 1 1
... ... ... ...
df2
A B C ... O
2 2 2 2
2 2 2 2
... ... ... ...
df3
A B C ... O
3 3 3 3
3 3 3 3
... ... ... ...
The expected output should be something like the following
dfnew
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 ... O_df1 O_df2 O_df3
1 2 3 1 2 3 1 2 3
1 2 3 1 2 3 1 2 3
... ... ... ...
My issue is that I cannot use the names of the columns to specify them. For example I know how to do it like this
# create a list of the dataframes
dfs = [df1, df2, df3]
# concatenate the dataframes along the columns axis (axis=1)
dfnew = pd.concat(dfs, axis=1)
# specify the column names for the new dataframe
column_names = ["column1", "column2", ..., "column15"]
# concatenate the dataframes along the columns axis (axis=1)
# and specify the column names for the new dataframe
dfnew = pd.concat(dfs, axis=1, columns=column_names)
but I cannot use the column names because they will change everytime. Plus it seems like there could be a faster way that hard coding them by using the .loc function
Exmaple
data1 = {'A': {0: 1, 1: 1}, 'B': {0: 1, 1: 1}, 'C': {0: 1, 1: 1}}
df1 = pd.DataFrame(data1)
df2 = df1.replace(1, 2).copy()
df3 = df1.replace(1, 3).copy()
df1
A B C
0 1 1 1
1 1 1 1
df2
A B C
0 2 2 2
1 2 2 2
df3
A B C
0 3 3 3
1 3 3 3
Code
dfs = (pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3'])
.sort_index(level=1, axis=1).swaplevel(0, 1, axis=1))
dfs
A B C
df1 df2 df3 df1 df2 df3 df1 df2 df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3
dfs.set_axis(dfs.columns.map('_'.join), axis=1)
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 C_df1 C_df2 C_df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3
I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]
I have the following:
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
When I add a new nested column as above, it separates it from the existing hello column. How do I add a new nested column such that new_col is associated/beneath with the existing hello column? Can this be done during assignment or only after? I.e. I want the below
hello world
data new_col data
0 1 4 8
1 2 5 10
2 3 6 12
You can do this after:
df = df[['hello', 'world']]
print(df)
hello world
data new_col data
0 1 8 4
1 2 10 5
2 3 12 6
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(data = l)
col1
0 [1, 2, 3]
1 [4, 5, 6]
Desired output:
col1
0 1
1 2
2 3
3 4
4 5
5 6
Here is explode
df.explode('col1')
col1
0 1
0 2
0 3
1 4
1 5
1 6
You can use np.ravel to flatten the list of lists:
import numpy as np, pandas as pd
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(np.ravel(*l.values()),columns=l.keys())
>>> df
col1
0 1
1 2
2 3
3 4
4 5
5 6
I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3