I would like to join two dataframes (df1,df2), but i can't figure it.
import pandas as pd
data = {'Step_ID': ["Step1", "Step1", "Step1", "Step2", "Step2", "Step3", "Step3"],
'value_01': [2, 2.3, 2.2, 0, 0, 5, 5.2]}
df1 = pd.DataFrame(data)
data = {'Step_ID': ["Step1", "Step1", "Step1", "Step1", "Step2", "Step2", "Step2", "Step3", "Step3", "Step3"],
'value_02': [2.3, 2.5, 2.1, 2.5, 0, 0, 0, 5.1, 5.6, 5.8]}
df2 = pd.DataFrame(data)
I would like to merge the on the column "Step_ID" as follows:
I tried several merges and its settings, but without any sucess.
pd.merge(df1, df2, left_on = ['Step_ID'], right_on = ['Step_ID'], how = 'outer')
The closest solution i have done with the following code, but it is not as required:
df1.combine_first(df2)
Is there any possibility to join those two dataframe in the required way? See the picture above.
We can try with cumcount + merge
new_df = df1.assign(index_count=df1.groupby('Step_ID').cumcount())\
.merge(df2.assign(index_count=df2.groupby('Step_ID').cumcount()),
on=['Step_ID', 'index_count'], how='outer')\
.sort_values(['Step_ID', 'index_count'])\
.drop('index_count', axis=1)
print(new_df)
Step_ID value_01 value_02
0 Step1 2.0 2.3
1 Step1 2.3 2.5
2 Step1 2.2 2.1
7 Step1 NaN 2.5
3 Step2 0.0 0.0
4 Step2 0.0 0.0
8 Step2 NaN 0.0
5 Step3 5.0 5.1
6 Step3 5.2 5.6
9 Step3 NaN 5.8
Related
Follow-up from my last question:
regex column selection inside pd.melt function
I used the following code to pivot_longer columns that match the string "rating".
import pandas as pd
import janitor
(df
.pivot_longer(
column_names = "*rating",
names_to = ("category", ".value"))
)
id category rating
0 1 first 1.
1 1 first 2.0
2 1 first 3.0
3 2 first 1.0
4 2 first 2.0
5 2 first 3.0
6 3 first 1.0
7 3 first 2.0
8 3 first 3.0
9 1 second 2.8
10 1 second 2.9
11 1 second 2.2
12 2 second 2.0
13 2 second 1.8
14 2 second 1.9
15 3 second 2.2
16 3 second 2.3
17 3 second 2.1
18 1 third 3.4
19 1 third 3.8
20 1 third 2.9
21 2 third 3.2
22 2 third 2.8
23 2 third 2.4
24 3 third 3.3
25 3 third 3.4
26 3 third 2.9
Now, I also have columns that match the string "type" and "estimate" that have the prefix "first", "second", and "third".
df = pd.DataFrame({
'id': [1, 1, 1],
'first_rating': [1, 2, 3],
'second_rating': [2.8, 2.9, 2.2],
'third_rating': [3.4, 3.8, 2.9],
'first_estimate': [1.2, 2.4, 2.8],
'second_estimate': [2.4, 3, 2.4],
'third_estimate':[3.4, 3.8, 2.9],
'first_type': ['red', 'green', 'blue'],
'second_type': ['red', 'green', 'yellow'],
'third_type': ['red', 'red', 'blue'],
})
The header of my desired output is the following:
id
category
rating
estimate
type
1
first
1.0
1.2
'red'
I want to pivot longer the type and estimate columns too, while having only one 'id' and 'category' column. I know I could pivot longer for ratings, estimates, and types separately and then join the dfs, but I imagine there is a more efficient solution.
I tried to the following but it doesn't work:
(df
.pivot_longer(
column_names = "*rating",
names_to = "category",
values_to = "rating")
.pivot_longer(
column_names = "*type",
names_to = "category",
values_to = "type")
.pivot_longer(
column_names = "*estimate",
names_to = "category",
values_to = "estimate"))
The same concept as the previous solution applies, you just need to adapt the column names selection based on a regex or a shell glob (under the hood it uses fnmatch):
# pip install pyjanitor
import pandas as pd
import janitor
# import re - if you are using regex to select the column names
(df
.pivot_longer(
# you can use a regex here instead
# re.compile(".+_.+")
column_names="*_*",
names_to = ("category", ".value"),
names_sep="_")
)
id category rating estimate type
0 1 first 1.0 1.2 red
1 1 first 2.0 2.4 green
2 1 first 3.0 2.8 blue
3 1 second 2.8 2.4 red
4 1 second 2.9 3.0 green
5 1 second 2.2 2.4 yellow
6 1 third 3.4 3.4 red
7 1 third 3.8 3.8 red
8 1 third 2.9 2.9 blue
I have a dataframe:
df = C1 A1. A2. A3. Type
A 1. 5. 2. AG
A 7. 3. 8. SC
And I want to create:
df = C1 A1_AG A1_SC A2_AG A2_SC
A 1. 7. 5. 3
How can it be done?
You can rather use a melt and transpose:
(df.melt('Type')
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.set_index('col')[['value']].T
)
Output:
col AG_A1 SC_A1 AG_A2 SC_A2 AG_A3 SC_A3
value 1 7 5 3 2 8
with additional columns(s):
(df.melt(['C1', 'Type'])
.assign(col=lambda d: d['Type']+'_'+d['variable'])
.pivot(index=['C1'], columns='col', values='value')
.reset_index()
)
Output:
col C1 AG_A1 AG_A2 AG_A3 SC_A1 SC_A2 SC_A3
0 A 1 5 2 7 3 8
Use DataFrame.set_index with DataFrame.unstack:
df = df.set_index(['C1','Type']).unstack()
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
One convenience option with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_wider(index = 'C1', names_from = 'Type')
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
Of course, you can skip the convenience function and use pivot directly:
result = df.pivot(index='C1', columns='Type')
result.columns = result.columns.map('_'.join)
result.reset_index()
C1 A1_AG A1_SC A2_AG A2_SC A3_AG A3_SC
0 A 1.0 7.0 5.0 3.0 2.0 8.0
I'm dealing with incomplete data and would like to assign scoring to different rows.
For example:
Bluetooth and WLAN are non integers but I would like to assign the value of 1 if data is available. 0 if there is no data (or NaN).
Samsung's score would be 1 + 1 + 4 = 6
Nokia's score would be 0 + 0 + 5 = 5
Bluetooth WLAN Rating Score
Apple Class-A USB-A NaN
Samsung Class-B USB-B 4
Nokia NaN NaN 5
I'm using Pandas at the moment but I'm not sure if Pandas alone is capable without Numpy.
Thanks a lot!
import pandas as pd
import numpy as np
data = {'Bluetooth': ['class-A', 'class-B', np.nan], 'WLAN': ['usb-A', 'usb-B', np.nan],'Rating': [np.nan, 4, 5]}
df = pd.DataFrame(data)
df = df.replace(np.nan, 0)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(1)
df['score'] = df.sum(axis=1)
print(df.head())
Output:
Bluetooth WLAN Rating score
0 1.0 1.0 0.0 2.0
1 1.0 1.0 4.0 6.0
2 0.0 0.0 5.0 5.0
try this :
import pandas as pd
import numpy as np
df['Nan_count']=df.isnull().sum(axis=1)
df['score']=-df['Nan_count']+df['Rating'].replace(np.nan,0)+2
With this solution we do need to change the Nan in our dataframe et as computation is pretty low also
I have two data frames that both contain coordinates. One of them, df1, has coordinates at a better resolution (with decimals), and I would like to join it to df2 which has a less-good resolution:
import pandas as pd
df1 = pd.DataFrame({'x': [1.1, 2.2, 3.3],
'y': [2.3, 3.3, 4.1],
'val': [10,11,12]})
df2 = pd.DataFrame({'x': [1,2,3,5.5],
'y': [2,3,4,5.6]})
df1['x_org']=df1['x']
df1['y_org']=df1['y']
df1[['x','y']] = df1[['x','y']].round()
df1 = pd.merge(df1, df2, how='left', on=['x','y'])
df1.drop({'x','y'}, axis=1)
# rename...
The code above does exactly what I want, but it is a bit cumbersome. Is there an easier way to achieve this?
Use:
df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).drop(['x_','y_'], axis=1)
Also is possible remove columns ending by _ dynamic:
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).filter(regex='.*[^_]$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_end')).filter(regex='.*(?<!_end)$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
Or:
df = (df1.set_index(['x','y'], drop=False).rename(lambda x: round(x))
.merge(df2.set_index(['x','y']),
left_index=True,
right_index=True,
how='left').reset_index(drop=True))
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
IIUC, you could pass the rounded values as joining keys:
pd.merge(df1.rename(columns={'x': 'x_org', 'y': 'y_org'}),
df2,
how='left',
left_on=[df1['x'].round(), df1['x'].round()],
right_on=['x', 'y'])#.drop({'x','y'}, axis=1) # if x/y are unwanted
output:
x_org y_org val x y
0 1.1 2.3 10 1.0 1.0
1 2.2 3.3 11 2.0 2.0
2 3.3 4.1 12 3.0 3.0
I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group
If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0