Merge columns and assign it to a column in pandas - pandas

I have 2 pandas dataframes in the below formats:
df1:
Code Temp tmp_Code tmp_Age
ABCDFG NaN ABCDF NaN
ABCDEF 15 ABCDE NaN
df2
Code Temp
ABCDF 18
ABCDL 21
I am trying to merge 2 pandas dataframes based on tmp_Code in df1 with Code in df2. If there is a match, value in df2['Temp'] has to be filled in df1['tmp_Age']. I was able to do the join but not sure how to assign it to df1['tmp_Age'].
Code I tried:
df['tmp_Age'] = pd.merge(df[['tmp_Code','Temp']], df2[['Code','Temp']],left_on='tmp_Code',right_on='Code',how='left')
Desired output:
Code Temp tmp_Code tmp_Age
ABCDFG NaN ABCDF 18
ABCDEF 15 ABCDE NaN
Any suggestions would be appreciated.

Select the column Temp_y and set it as a new column:
df['tmp_Age'] = pd.merge(df[['tmp_Code','Temp']], df2[['Code','Temp']],
left_on='tmp_Code', right_on='Code',
how='left')['Temp_y'] # <- HERE
print(df)
# Output:
Code Temp tmp_Code tmp_Age
0 ABCDFG NaN ABCDF 18.0
1 ABCDEF 15.0 ABCDE NaN
Alternative to merge one column from a dataframe to another:
df['tmp_Age'] = df['tmp_Code'].map(df2.set_index('Code')['Temp'])
print(df)
# Output:
Code Temp tmp_Code tmp_Age
0 ABCDFG NaN ABCDF 18.0
1 ABCDEF 15.0 ABCDE NaN

Related

pandas: generate a dataframe, column a: start till end date (months) and two more columns

My question was to generic.
Ok, other try.
I want a dataframe with monthly dates in the first column a.
THen i want to go through the dates and fill the values in row b and c
import pandas as pd
from pandas import *
import datetime as dt
#try to generate a dataframe with dates
#This ist the dataframe, but how can I fill the dates
dfa = pd.DataFrame(columns=['date', '1G', '10G'])
print(dfa)
#This are the dates, but how to get them into the dataframe
#and how to add values in the empty cells
idx = pd.date_range("2016-01-01", periods=55, freq="M")
ts = pd.Series(range(len(idx)), index=idx)
print(ts)
If need column filled by datetimes:
dfa = pd.DataFrame({'date':pd.date_range("2016-01-01", periods=55, freq="M"),
'1G':np.nan,
'10G':np.nan})
print (dfa.head())
date 1G 10G
0 2016-01-31 NaN NaN
1 2016-02-29 NaN NaN
2 2016-03-31 NaN NaN
3 2016-04-30 NaN NaN
4 2016-05-31 NaN NaN
Or if need DatetimeIndex:
dfa = pd.DataFrame({'1G':np.nan,
'10G':np.nan},
index=pd.date_range("2016-01-01", periods=55, freq="M"))
print (dfa.head())
1G 10G
2016-01-31 NaN NaN
2016-02-29 NaN NaN
2016-03-31 NaN NaN
2016-04-30 NaN NaN
2016-05-31 NaN NaN

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

regroup uneven number of rows pandas df

I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i

Split one column of dataframe to new columns in pandas

I need to change the following data frame in which one column contains a list of tuple
df = pd.DataFrame({'columns1':list('AB'),'columns2':[1,2],
'columns3':[[(122,0.5), (104, 0)], [(104, 0.6)]]})
print (df)
columns1 columns2 columns3
0 A 1 [(122, 0.5), (104, 0)]
1 B 2 [(104, 0.6)]
in to this, in which the tuple first element should be the column header
columns1 columns2 104 122
0 A 1 0.0 0.5
1 B 2 0.6 NaN
How can I do this using panda in Jupiter notebook
Use list comprehension with convert values to dictionaries, sorting columns and add to original with DataFrame.join:
df = pd.read_csv('Sample - Sample.csv.csv')
print (df)
column1 column2 column3
0 A U1 [(187, 0.674), (111, 0.738)]
1 B U2 [(54, 1.0)]
2 C U3 [(169, 0.474), (107, 0.424), (88, 0.519), (57,...
import ast
df1 = pd.DataFrame([dict(ast.literal_eval(x)) for x in df.pop('column3')], index=df.index).sort_index(axis=1)
df = df.join(df1)
print (df)
column1 column2 54 57 64 88 107 111 169 187
0 A U1 NaN NaN NaN NaN NaN 0.738 NaN 0.674
1 B U2 1.0 NaN NaN NaN NaN NaN NaN NaN
2 C U3 NaN 0.526 0.217 0.519 0.424 NaN 0.474 NaN

Pandas subtract column values only when non nan

I have a dataframe df as follows with about 200 columns:
Date Run_1 Run_295 Prc
2/1/2020 3
2/2/2020 2 6
2/3/2020 5 2
I want to subtract column Prc from columns Run_1 Run_295 Run_300 only when they are non-Nan or non empty, to get the following:
Date Run_1 Run_295
2/1/2020
2/2/2020 -4
2/3/2020 3
I am not sure how to proceed with the above.
Code to reproduce the dataframe:
import pandas as pd
from io import StringIO
s = """Date,Run_1,Run_295,Prc
2/1/2020,,,3
2/2/2020,2,,6
2/3/2020,,5,2"""
df = pd.read_csv(StringIO(s))
print(df)
You can simply subtract it. It exactly does what you want:
df.Run_1-df.Prc
Here is the complete code to your output:
df.Run_1= df.Run_1-df.Prc
df.Run_295= df.Run_295-df.Prc
df.drop('Prc', axis=1, inplace=True)
df
Date Run_1 Run_295
0 2/1/2020 NaN NaN
1 2/2/2020 -4.0 NaN
2 2/3/2020 NaN 3.0
Three steps, melt to unpivot your dataframe
Then loc to handle assignment
& GroupBy to reomake your original df.
sure there is a better way to this, but this avoids loops and apply
cols = df.columns
s = pd.melt(df,id_vars=['Date','Prc'],value_name='Run Rate')
s.loc[s['Run Rate'].isnull()==False,'Run Rate'] = s['Run Rate'] - s['Prc']
df_new = s.groupby([s["Date"], s["Prc"], s["variable"]])["Run Rate"].first().unstack(-1)
print(df_new[cols])
variable Date Run_1 Run_295 Prc
0 2/1/2020 NaN NaN 3
1 2/2/2020 -4.0 NaN 6
2 2/3/2020 NaN 3.0 2