Pandas subtract column values only when non nan - pandas

I have a dataframe df as follows with about 200 columns:
Date Run_1 Run_295 Prc
2/1/2020 3
2/2/2020 2 6
2/3/2020 5 2
I want to subtract column Prc from columns Run_1 Run_295 Run_300 only when they are non-Nan or non empty, to get the following:
Date Run_1 Run_295
2/1/2020
2/2/2020 -4
2/3/2020 3
I am not sure how to proceed with the above.
Code to reproduce the dataframe:
import pandas as pd
from io import StringIO
s = """Date,Run_1,Run_295,Prc
2/1/2020,,,3
2/2/2020,2,,6
2/3/2020,,5,2"""
df = pd.read_csv(StringIO(s))
print(df)

You can simply subtract it. It exactly does what you want:
df.Run_1-df.Prc
Here is the complete code to your output:
df.Run_1= df.Run_1-df.Prc
df.Run_295= df.Run_295-df.Prc
df.drop('Prc', axis=1, inplace=True)
df
Date Run_1 Run_295
0 2/1/2020 NaN NaN
1 2/2/2020 -4.0 NaN
2 2/3/2020 NaN 3.0

Three steps, melt to unpivot your dataframe
Then loc to handle assignment
& GroupBy to reomake your original df.
sure there is a better way to this, but this avoids loops and apply
cols = df.columns
s = pd.melt(df,id_vars=['Date','Prc'],value_name='Run Rate')
s.loc[s['Run Rate'].isnull()==False,'Run Rate'] = s['Run Rate'] - s['Prc']
df_new = s.groupby([s["Date"], s["Prc"], s["variable"]])["Run Rate"].first().unstack(-1)
print(df_new[cols])
variable Date Run_1 Run_295 Prc
0 2/1/2020 NaN NaN 3
1 2/2/2020 -4.0 NaN 6
2 2/3/2020 NaN 3.0 2

Related

How to get each column merged from a separate file to line up next to each other rather than under each df?

Trying to append single column dfs to once csv. Each column representing theold dfs. Do not know how to stop the dfs from stacking in a csv file.
Csv
master_df = pd.DataFrame()
for file in os.listdir('TotalDailyMCUSDEachPool'):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('MasterFile.csv', index=False)
You should use master_df.append(pd.read_csv(file), axis=1) for your purpose. Check this for more info: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html.
An example is as follows:
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
}
)
df2 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
}
)
result = pd.concat([df1, df2], axis=1)
# A B
# 0 A0 B2
# 1 A1 B3
# 2 A2 B6
# 3 A3 B7
If appends works fine "out of the box" only if column names of datasets are the same and you want to just append rows. In this example:
master_df = pd.DataFrame()
df = pd.DataFrame({'Numbers': [1010, 2020, 3030, 2020, 1515, 3030, 4545]})
df2 = pd.DataFrame({'Others': [1015, 2132, 3030, 2020, 1515, 3030, 4545]})
master_df = master_df.append(df)
master_df = master_df.append(df2)
master_df
You would get this output:
Numbers Others
0 1010.0 NaN
1 2020.0 NaN
2 3030.0 NaN
3 2020.0 NaN
4 1515.0 NaN
5 3030.0 NaN
6 4545.0 NaN
0 NaN 1015.0
1 NaN 2132.0
2 NaN 3030.0
3 NaN 2020.0
4 NaN 1515.0
5 NaN 3030.0
6 NaN 4545.0
To prevent it, use pd.concat() with axis=1
master_df = pd.concat([df,df2], axis = 1)
master_df
Which should result in good dataset
Numbers Others
0 1010 1015
1 2020 2132
2 3030 3030
3 2020 2020
4 1515 1515
5 3030 3030
6 4545 4545

Merge columns and assign it to a column in pandas

I have 2 pandas dataframes in the below formats:
df1:
Code Temp tmp_Code tmp_Age
ABCDFG NaN ABCDF NaN
ABCDEF 15 ABCDE NaN
df2
Code Temp
ABCDF 18
ABCDL 21
I am trying to merge 2 pandas dataframes based on tmp_Code in df1 with Code in df2. If there is a match, value in df2['Temp'] has to be filled in df1['tmp_Age']. I was able to do the join but not sure how to assign it to df1['tmp_Age'].
Code I tried:
df['tmp_Age'] = pd.merge(df[['tmp_Code','Temp']], df2[['Code','Temp']],left_on='tmp_Code',right_on='Code',how='left')
Desired output:
Code Temp tmp_Code tmp_Age
ABCDFG NaN ABCDF 18
ABCDEF 15 ABCDE NaN
Any suggestions would be appreciated.
Select the column Temp_y and set it as a new column:
df['tmp_Age'] = pd.merge(df[['tmp_Code','Temp']], df2[['Code','Temp']],
left_on='tmp_Code', right_on='Code',
how='left')['Temp_y'] # <- HERE
print(df)
# Output:
Code Temp tmp_Code tmp_Age
0 ABCDFG NaN ABCDF 18.0
1 ABCDEF 15.0 ABCDE NaN
Alternative to merge one column from a dataframe to another:
df['tmp_Age'] = df['tmp_Code'].map(df2.set_index('Code')['Temp'])
print(df)
# Output:
Code Temp tmp_Code tmp_Age
0 ABCDFG NaN ABCDF 18.0
1 ABCDEF 15.0 ABCDE NaN

Insert multiples dates at start of every group in pandas

I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0

subtract each value in column by entire column

I have the following df1 :
prueba
12-03-2018 7
08-03-2018 1
06-03-2018 9
05-03-2018 5
I would like to get each value in the column beggining by the last (5) and substract the entire column by that value. then iterate upwards and subtract the remaining values in the column. for each subtraction I would like to generate a column and generate a df with the results of each subtraction:
The desired output would be something like this:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2 -2 6 0
08-03-2018 -4 -8 0 NaN
06-03-2018 4 0 NaN NaN
05-03-2018 0 NaN NaN NaN
What I tried to obtain the desired output was, first take df1 and
df2=df1.sort_index(ascending=True)
create an empty df:
main_df=pd.DataFrame()
and then iterate over the values in the column df2 and subtract to the df1 column
for index, row in df2.iterrows():
datos=df1-row['pruebas']
df=pd.DataFrame(data=datos,index=index)
if main_df.empty:
main_df= df
else:
main_df=main_df.join(df)
print(main_df)
However the following error outputs:
TypeError: Index(...) must be called with a collection of some kind, '05-03-2018' was passed
You can using np.triu, with array subtract
s=df.prueba.values.astype(float)
s=np.triu((s-s[:,None]).T)
s[np.tril_indices(s.shape[0], -1)]=np.nan
pd.DataFrame(s,columns=df.index,index=df.index).reindex(columns=df.index[::-1])
Out[482]:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2.0 -2.0 6.0 0.0
08-03-2018 -4.0 -8.0 0.0 NaN
06-03-2018 4.0 0.0 NaN NaN
05-03-2018 0.0 NaN NaN NaN
Kind of messy but does the work:
temp = 0
count = 0
df_new = pd.DataFrame()
for i, v, date in zip(df.index, df["prueba"][::-1], df.index[::-1]):
print(i,v)
new_val = df["prueba"] - v
if count > 0:
new_val[-count:] = np.nan
df_new[date] = new_val
temp += v
count += 1
df_new

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40