creating a new dataframe from 3 other dataframes but columns must have specific order without specifying the name of the columns - pandas

Assume we have 3 dataframes named df1, df2, df3. Each of these dataframes have 100 rows and 15 columns. I want to create new dataframe that will have the first column of df1, then the first column of df2m then the first column of df3. then it will have the second column of df1 then the second column of df2 then the second column of df3 and so on until all 15 columns of each of the three dataframes are included. For example
df1
A B C ... O
1 1 1 1
1 1 1 1
... ... ... ...
df2
A B C ... O
2 2 2 2
2 2 2 2
... ... ... ...
df3
A B C ... O
3 3 3 3
3 3 3 3
... ... ... ...
The expected output should be something like the following
dfnew
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 ... O_df1 O_df2 O_df3
1 2 3 1 2 3 1 2 3
1 2 3 1 2 3 1 2 3
... ... ... ...
My issue is that I cannot use the names of the columns to specify them. For example I know how to do it like this
# create a list of the dataframes
dfs = [df1, df2, df3]
# concatenate the dataframes along the columns axis (axis=1)
dfnew = pd.concat(dfs, axis=1)
# specify the column names for the new dataframe
column_names = ["column1", "column2", ..., "column15"]
# concatenate the dataframes along the columns axis (axis=1)
# and specify the column names for the new dataframe
dfnew = pd.concat(dfs, axis=1, columns=column_names)
but I cannot use the column names because they will change everytime. Plus it seems like there could be a faster way that hard coding them by using the .loc function

Exmaple
data1 = {'A': {0: 1, 1: 1}, 'B': {0: 1, 1: 1}, 'C': {0: 1, 1: 1}}
df1 = pd.DataFrame(data1)
df2 = df1.replace(1, 2).copy()
df3 = df1.replace(1, 3).copy()
df1
A B C
0 1 1 1
1 1 1 1
df2
A B C
0 2 2 2
1 2 2 2
df3
A B C
0 3 3 3
1 3 3 3
Code
dfs = (pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3'])
.sort_index(level=1, axis=1).swaplevel(0, 1, axis=1))
dfs
A B C
df1 df2 df3 df1 df2 df3 df1 df2 df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3
dfs.set_axis(dfs.columns.map('_'.join), axis=1)
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 C_df1 C_df2 C_df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3

Related

Keys as Rows (Pandas Dataframe from dictionary)

I have this dictionary:
d = {'a': (1,2,3), 'b': (4,5,6)}
I would like it to be formed as a dataframe where the key is shown as row along with its corresponding values, like the table below:
Keys
Values
a
1
a
2
a
3
b
4
b
5
Any ideas?
Here is my suggestion.
Create your dataframe with the following command:
df = pd.DataFrame({'Keys': list(dict.keys()), 'Values': list(dict.values())})
Explode your dataframe on column of 'Values' with the following command:
df = df.explode(column='Values').reset_index(drop=True
The output result is something like this:
Keys Values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
d = {'a': (1,2,3), 'b': (4,5,6)}
df = pd.DataFrame(d).unstack().droplevel(1).reset_index().rename({'index':'Keys', 0:'Values'}, axis=1)
Output:
>>> df
Keys Values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6

How to merge two dataframes according to their indexes?

I am trying to use Pandas for data analysis. I need to merge two dataframes according to their indexes. However, their indexes are totally different. The rule is that if the index of df2 is the substring of df1, then I should merge them. For example, df1.index == ['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'], and df2.index == ['bb/bbb', 'ccc', 'hello']. Then df1 and df2 have two indexes in common, we should merge them based on these indexes. What should i do?
Having your DataFrame :
>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])
And changing the index to column :
>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
value_df1 col_a
0 a/aa/aaa 1
1 b/bb/bbb 2
2 c/cc/ccc 3
>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
value_df2 col_b
0 bb/bbb 4
1 ccc 5
2 hello 6
We merge both DataFrame on the joincolumn :
>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
value_df1 col_a value_df2 col_b
0 a/aa/aaa 1 bb/bbb 4
1 a/aa/aaa 1 ccc 5
2 a/aa/aaa 1 hello 6
3 b/bb/bbb 2 bb/bbb 4
4 b/bb/bbb 2 ccc 5
5 b/bb/bbb 2 hello 6
6 c/cc/ccc 3 bb/bbb 4
7 c/cc/ccc 3 ccc 5
8 c/cc/ccc 3 hello 6
Then we use an apply to match the initial index value :
>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
value_df1 col_a value_df2 col_b match
0 a/aa/aaa 1 bb/bbb 4 False
1 a/aa/aaa 1 ccc 5 False
2 a/aa/aaa 1 hello 6 False
3 b/bb/bbb 2 bb/bbb 4 True
4 b/bb/bbb 2 ccc 5 False
5 b/bb/bbb 2 hello 6 False
6 c/cc/ccc 3 bb/bbb 4 False
7 c/cc/ccc 3 ccc 5 True
8 c/cc/ccc 3 hello 6 False
Filtering on the row where the column match is True and dropping the match column, we get the expected result :
>>> dfFull[dfFull['match']].drop(['match'], axis=1)
value_df1 col_a value_df2 col_b
3 b/bb/bbb 2 bb/bbb 4
7 c/cc/ccc 3 ccc 5
This solution is inspired by this post.
Since you have a known delimiter, you can split on that delimiter, do some merging, and then add back in the original data.
# sample data
df1 = pd.DataFrame({'ColumnA': [1,2,3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
df2 = pd.DataFrame({'ColumnB': [4,5,6]}, index=['bb/bbb', 'ccc', 'hello'])
# set original index as column
# make a copy of each dataframe to preserve original data
# reset index of copy to keep track of original row number
df1 = df1.reset_index()
copy_df1 = df1
copy_df1.index.name = 'row_df1'
copy_df1 = df1.reset_index()
df2 = df2.reset_index()
copy_df2 = df2
copy_df2.index.name = 'row_df2'
copy_df2 = copy_df2.reset_index()
# split on known delimiter and explode into rows for each substring
copy_df1['index'] = copy_df1['index'].str.split('/')
copy_df1 = copy_df1.explode('index')
copy_df2['index'] = copy_df2['index'].str.split('/')
copy_df2 = copy_df2.explode('index')
# merge based on substrings, drop duplicates in case of multiple substring matches
mrg = copy_df1[['row_df1','index']].merge(copy_df2[['row_df2','index']]).drop(columns='index')
mrg = mrg.drop_duplicates()
# merge back in original details
mrg = mrg.merge(df1, left_on='row_df1', right_index=True)
mrg = mrg.merge(df2, left_on='row_df2', right_index=True, suffixes=('_df1','_df2'))
The final output would be:
row_df1 row_df2 index_df1 ColumnA index_df2 ColumnB
0 1 0 b/bb/bbb 2 bb/bbb 4
2 2 1 c/cc/ccc 3 ccc 5

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

issue merging two dataframes with pandas by summing element by element [duplicate]

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20