Merge two DataFrames on multiple columns - pandas

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!

You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z

Related

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Python/Pandas: Transformation of column within a list of columns

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?
If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1
I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4