Adding NA data for future dates in pandas dataframe - pandas

I have a pandas dataframe with monthly date index till the current month. I would like to impute NA values n periods into the future (in my case 1 year). I tried adding future dates into the existing index in the following manner:
recentDate = inputFileDf.index[-1]
outputFileDf.index = outputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
This throws ValueError: Length mismatch: Expected axis has 396 elements, new values have 408 elements.
Would appreciate any help to "extend" the dataframe by adding the dates and NA values.

You can use df.reindex here.
Example data:
df = pd.DataFrame(
{'num': [*range(5)]},
index=pd.date_range('2022-10-10', periods=5, freq='D'))
print(df)
num
2022-10-10 0
2022-10-11 1
2022-10-12 2
2022-10-13 3
2022-10-14 4
recentDate = df.index[-1]
new_data = pd.date_range(recentDate , periods=4, freq="M")
new_idx = df.index.append(new_data)
new_df = df.reindex(new_idx)
print(new_df)
num
2022-10-10 0.0
2022-10-11 1.0
2022-10-12 2.0
2022-10-13 3.0
2022-10-14 4.0
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN

Use concat - if indices are unique or not:
recentDate = inputFileDf.index[-1]
df = pd.DataFrame(index=pd.date_range(recentDate, periods=12, freq="M"))
outputFileDf = pd.concat([inputFileDf, df])
If indices are unique in idx use DataFrame.reindex:
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
outputFileDf = outputFileDf.reindex(idx)
EDIT: If original DataFrame has months and need append new one need add 1 month to last index for avoid double last indices from original DataFrame:
inputFileDf = pd.DataFrame(columns=['col'],
index=pd.date_range('2022-10-31', periods=4, freq='M'))
print(inputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate + pd.DateOffset(months=1) , periods=12, freq="M"))
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN
Or use first index value:
firstDate = inputFileDf.index[0]
idx = pd.date_range(firstDate, periods=12 + len(inputFileDf), freq="M")
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN

Related

Plot stacked (100%) bar chart for multiple categories on multiple dates

I have following initial dataframe:
Post ID
Submission_Date
Flair
0
row1
01.12.2020
NaN
1
row2
03.12.2020
Discussion
2
row3
03.12.2020
News
3
row4
03.12.2020
Discussion
4
row5
06.12.2020
Due Diligence
5
row6
07.12.2020
Discussion
6
row7
31.12.2020
Discussion
1
row8
01.01.2021
Hedge Fund Tears
Multiple Dates with missing dates in between
Multiple categories on dates
I grouped the dataframe with:
import pandas as pd
import numpy as np # for test data
data = {'Post ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6', 'row7', 'row8'], 'Submission_Date': ['01.12.2020', '03.12.2020', '03.12.2020', '03.12.2020', '06.12.2020', '07.12.2020', '31.12.2020', '01.01.2021'], 'Flair': [np.nan, 'Discussion', 'News', 'Discussion', 'Due Diligence', 'Discussion', 'Discussion', 'Hedge Fund Tears']}
df = pd.DataFrame(data)
df['Submission_Date'] = pd.to_datetime(df['Submission_Date'], format = "%Y-%m-%d %H:%M:%S").dt.strftime('%Y-%m-%d')
df = df.groupby('Submission_Date')['Flair'].value_counts(normalize=True).unstack()
The result is this:
I want to fill the dates with "empty" bars and plot like this
something like this:
I tried already this:
fig, ax = plt.subplots(figsize=(20,10))
df.plot(kind='bar',ax=ax, stacked=True, width=1)
plt.xlabel('Submission_Date', fontsize=16)
plt.ylabel('Ratio of Flairs used', fontsize=16)
But the dates are incorrect since the empty days are not displayed
Assuming this input as df2 (the output of your groupby operation):
Flair Discussion Due Diligence Hedge Fund Tears News
Submission_Date
01.01.2021 NaN NaN 1.0 NaN
03.12.2020 0.666667 NaN NaN 0.333333
06.12.2020 NaN 1.0 NaN NaN
07.12.2020 1.000000 NaN NaN NaN
31.12.2020 1.000000 NaN NaN NaN
You can reindex from pd.date_range:
df2.index = pd.to_datetime(df2.index, format='%d.%m.%Y')
df2 = df2.reindex(pd.date_range(df2.index.min(), df2.index.max()))
df2.index = df2.index.strftime('%Y-%m-%d')
Flair Discussion Due Diligence Hedge Fund Tears News
2020-12-03 0.666667 NaN NaN 0.333333
2020-12-04 NaN NaN NaN NaN
2020-12-05 NaN NaN NaN NaN
2020-12-06 NaN 1.0 NaN NaN
2020-12-07 1.000000 NaN NaN NaN
...
2020-12-30 NaN NaN NaN NaN
2020-12-31 1.000000 NaN NaN NaN
2021-01-01 NaN NaN 1.0 NaN
graphical outcome (small size):

Is it possible to turn quartely data to monthly?

I'm struggling with this problem and I'm not sure if I'm approaching it correctly.
I have this dataset:
ticker date filing_date_x currency_symbol_x researchdevelopment effectofaccountingcharges incomebeforetax minorityinterest netincome sellinggeneraladministrative grossprofit ebit nonoperatingincomenetother operatingincome otheroperatingexpenses interestexpense taxprovision interestincome netinterestincome extraordinaryitems nonrecurring otheritems incometaxexpense totalrevenue totaloperatingexpenses costofrevenue totalotherincomeexpensenet discontinuedoperations netincomefromcontinuingops netincomeapplicabletocommonshares preferredstockandotheradjustments filing_date_y currency_symbol_y totalassets intangibleassets earningassets othercurrentassets totalliab totalstockholderequity deferredlongtermliab ... totalcurrentliabilities shorttermdebt shortlongtermdebt shortlongtermdebttotal otherstockholderequity propertyplantequipment totalcurrentassets longterminvestments nettangibleassets shortterminvestments netreceivables longtermdebt inventory accountspayable totalpermanentequity noncontrollinginterestinconsolidatedentity temporaryequityredeemablenoncontrollinginterests accumulatedothercomprehensiveincome additionalpaidincapital commonstocktotalequity preferredstocktotalequity retainedearningstotalequity treasurystock accumulatedamortization noncurrrentassetsother deferredlongtermassetcharges noncurrentassetstotal capitalleaseobligations longtermdebttotal noncurrentliabilitiesother noncurrentliabilitiestotal negativegoodwill warrants preferredstockredeemable capitalsurpluse liabilitiesandstockholdersequity cashandshortterminvestments propertyplantandequipmentgross accumulateddepreciation commonstocksharesoutstanding
116638 JNJ.US 2019-12-31 2020-02-18 USD 3.232000e+09 NaN 4.218000e+09 NaN 4.010000e+09 6.039000e+09 1.363200e+10 6.119000e+09 6.500000e+07 4.238000e+09 NaN 85000000.0 208000000.0 81000000.0 -4000000.0 NaN 104000000.0 NaN 208000000.0 2.074700e+10 9.414000e+09 7.115000e+09 -1.200000e+08 NaN 4.010000e+09 4.010000e+09 NaN 2020-02-18 USD 1.577280e+11 4.764300e+10 NaN 2.486000e+09 9.825700e+10 5.947100e+10 5.958000e+09 ... 3.596400e+10 1.202000e+09 1.202000e+09 NaN -1.589100e+10 1.765800e+10 4.527400e+10 1.149000e+09 -2.181100e+10 1.982000e+09 1.448100e+10 2.649400e+10 9.020000e+09 3.476200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.106590e+11 -3.841700e+10 NaN 5.695000e+09 7.819000e+09 1.124540e+11 NaN 2.649400e+10 2.984100e+10 6.229300e+10 NaN NaN NaN NaN 1.577280e+11 1.928700e+10 NaN NaN 2.632507e+09
116569 JNJ.US 2020-03-31 2020-04-29 USD 2.580000e+09 NaN 6.509000e+09 NaN 5.796000e+09 5.203000e+09 1.364400e+10 8.581000e+09 7.460000e+08 5.788000e+09 NaN 25000000.0 713000000.0 67000000.0 42000000.0 300000000.0 58000000.0 NaN 713000000.0 2.069100e+10 7.135000e+09 7.047000e+09 6.210000e+08 NaN 5.796000e+09 5.796000e+09 NaN 2020-04-29 USD 1.550170e+11 4.733800e+10 NaN 2.460000e+09 9.372300e+10 6.129400e+10 5.766000e+09 ... 3.368900e+10 2.190000e+09 2.190000e+09 NaN -1.624300e+10 1.740100e+10 4.422600e+10 NaN -1.951500e+10 2.494000e+09 1.487400e+10 2.539300e+10 8.868000e+09 3.149900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.129010e+11 -3.848400e+10 NaN 5.042000e+09 NaN 7.539000e+09 NaN 2.539300e+10 2.887500e+10 6.003400e+10 NaN NaN NaN NaN 1.550170e+11 1.802400e+10 4.324700e+10 -2.584600e+10 2.632392e+09
116420 JNJ.US 2020-06-30 2020-07-24 USD 2.707000e+09 NaN 3.940000e+09 NaN 3.626000e+09 4.993000e+09 1.177900e+10 5.711000e+09 -5.000000e+06 3.990000e+09 NaN 45000000.0 314000000.0 19000000.0 -26000000.0 NaN 67000000.0 NaN 314000000.0 1.833600e+10 7.839000e+09 6.557000e+09 -8.500000e+07 NaN 3.626000e+09 3.626000e+09 NaN 2020-07-24 USD 1.583800e+11 4.741300e+10 NaN 2.688000e+09 9.540200e+10 6.297800e+10 5.532000e+09 ... 3.677200e+10 5.332000e+09 5.332000e+09 NaN -1.553300e+10 1.759800e+10 4.589200e+10 NaN -1.832500e+10 7.961000e+09 1.464500e+10 2.506200e+10 9.424000e+09 3.144000e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138980e+11 -3.850700e+10 NaN 5.782000e+09 NaN 7.805000e+09 NaN 2.506200e+10 2.803600e+10 5.863000e+10 NaN NaN NaN NaN 1.583800e+11 1.913500e+10 4.405600e+10 -2.645800e+10 2.632377e+09
116235 JNJ.US 2020-09-30 2020-10-23 USD 2.840000e+09 NaN 4.401000e+09 NaN 3.554000e+09 5.431000e+09 1.411000e+10 4.445000e+09 -1.188000e+09 5.633000e+09 NaN 44000000.0 847000000.0 12000000.0 -32000000.0 NaN 206000000.0 NaN 847000000.0 2.108200e+10 8.477000e+09 6.972000e+09 -1.268000e+09 NaN 3.554000e+09 3.554000e+09 NaN 2020-10-23 USD 1.706930e+11 4.700600e+10 NaN 2.619000e+09 1.062200e+11 6.447300e+10 5.615000e+09 ... 3.884700e+10 5.078000e+09 5.078000e+09 NaN -1.493800e+10 1.785500e+10 5.757800e+10 NaN -1.684000e+10 1.181600e+10 1.457900e+10 3.268000e+10 9.599000e+09 3.376900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.148310e+11 -3.854000e+10 NaN 6.131000e+09 NaN 7.816000e+09 NaN 3.268000e+10 2.907800e+10 6.737300e+10 NaN NaN NaN NaN 1.706930e+11 3.078100e+10 4.516200e+10 -2.730700e+10 2.632167e+09
116135 JNJ.US 2020-12-31 2021-02-22 USD 4.032000e+09 NaN 1.647000e+09 NaN 1.738000e+09 6.457000e+09 1.466100e+10 1.734000e+09 -2.341000e+09 4.075000e+09 NaN 87000000.0 -91000000.0 13000000.0 -74000000.0 NaN 97000000.0 NaN -91000000.0 2.247500e+10 1.058600e+10 7.814000e+09 -2.414000e+09 NaN 1.738000e+09 1.738000e+09 NaN 2021-02-22 USD 1.748940e+11 5.340200e+10 NaN 3.132000e+09 1.116160e+11 6.327800e+10 7.214000e+09 ... 4.249300e+10 2.631000e+09 2.631000e+09 NaN -1.524200e+10 1.876600e+10 5.123700e+10 NaN -2.651700e+10 1.120000e+10 1.357600e+10 3.263500e+10 9.344000e+09 3.986200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138900e+11 -3.849000e+10 NaN 6.562000e+09 NaN 8.534000e+09 NaN 3.263500e+10 2.927400e+10 6.912300e+10 NaN NaN NaN NaN 1.748940e+11 2.518500e+10 NaN NaN 2.632512e+09
then I have this dataframe(daily) prices:
ticker date open high low close adjusted_close volume
0 JNJ.US 2021-08-02 172.470 172.840 171.300 172.270 172.2700 3620659
1 JNJ.US 2021-07-30 172.540 172.980 171.840 172.200 172.2000 5346400
2 JNJ.US 2021-07-29 172.740 173.340 171.090 172.180 172.1800 4214100
3 JNJ.US 2021-07-28 172.730 173.380 172.080 172.180 172.1800 5750700
4 JNJ.US 2021-07-27 171.800 172.720 170.670 172.660 172.6600 7089300
I have daily data in the price data but I have quarterly data in the first data frame. I want to merge the dataframe in a way that all the prices between Jan-01-2020 and Mar-01-2020 are being merged with the correct row.
I'm not sure exactly how to do this. I thought of extracting the date to month-year but I still don't know how to merge based on the range of values?
Any suggestions would be welcomed, if I'm not clear please let me know and I can clarify.
If I understand correctly you could create common year and quarter columns for each DataFrame and do a merge on those columns. I did a left merge if you only want to match columns in the left dataset (daily data).
If this is not what you are looking for, could you please clarify with a sample input/output?
# importing pandas as pd
import pandas as pd
# Creating dummy data of daily values
dt = pd.Series(['2020-08-02', '2020-07-30', '2020-07-29',
'2020-07-28', '2020-07-27'])
# Convert the underlying data to datetime
dt = pd.to_datetime(dt)
dt_df = pd.DataFrame(dt, columns=['date'])
dt_df['quarter_1'] = dt_df['date'].dt.quarter
dt_df['year_1'] = dt_df['date'].dt.year
print(dt_df)
date quarter_1 year_1
0 2020-08-02 3 2020
1 2020-07-30 3 2020
2 2020-07-29 3 2020
3 2020-07-28 3 2020
4 2020-07-27 3 2020
# Creating dummy data of quarterly values
dt2 = pd.Series(['2019-12-31', '2020-03-31', '2020-06-30',
'2020-09-30', '2020-12-31'])
# Convert the underlying data to datetime
dt2 = pd.to_datetime(dt2)
dt2_df = pd.DataFrame(sr2, columns=['date2'])
dt2_df['quarter_2'] = dt2_df['date2'].dt.quarter
dt2_df['year_2'] = dt2_df['date2'].dt.year
print(dt2_df)
date_quarter quarter_2 year_2
0 2019-12-31 4 2019
1 2020-03-31 1 2020
2 2020-06-30 2 2020
3 2020-09-30 3 2020
4 2020-12-31 4 2020
Then you can just merge on how ever you want.
dt_df.merge(dt2_df, how='left', left_on=['quarter_1', 'year_1'], right_on=['quarter_2', 'year_2'] , validate="many_to_many")
OUTPUT:
date quarter_1 year_1 date_quarter quarter_2 year_2
0 2020-08-02 3 2020 2020-09-30 3 2020
1 2020-07-30 3 2020 2020-09-30 3 2020
2 2020-07-29 3 2020 2020-09-30 3 2020
3 2020-07-28 3 2020 2020-09-30 3 2020
4 2020-07-27 3 2020 2020-09-30 3 2020

How to get a df with the first non nan value onwards?

I have this df:
CODE DATE TMAX TMIN PP
0 000130 1991-01-01 NaN NaN 0.0
1 000130 1991-01-02 31.2 NaN 0.0
2 000130 1991-01-03 32.0 21.2 0.0
3 000130 1991-01-04 NaN NaN 0.0
4 000130 1991-01-05 NaN 22.0 0.0
... ... ... ... ...
34995 000135 1997-04-24 NaN NaN 0.0
34996 000135 1997-04-25 NaN NaN 4.0
34997 000135 1997-04-26 NaN 22.1 0.0
34998 000135 1997-04-27 31.0 NaN 5.0
34999 000135 1997-04-28 28.8 24.0 0.0
I'm counting the NaN values by CODE column, in columns TMAX TMIN and PP. So i'm using this code.
dfna=df[['TMAX','TMIN','PP']].isna().groupby(df.CODE).sum()
But i want to start counting NaN values since the first non NaN value.
Expected df:
CODE TMAX TMIN PP
000130 2 1 0
000135 0 1 0
...
...
How can i do this?
Thanks in advance.
Think in term of the whole frame, you can use ffill to remove the later nan values. So you can use this to detect the nan's that come after the first valid values:
df.isna() & df.ffill().notna()
Now, you can try groupby.apply
(df[['TMAX','TMIN','PP']].groupby(df['CODE'])
.apply(lambda d: (d.isna() & d.ffill().notna()).sum())
)
Output:
TMAX TMIN PP
CODE
130 2 1 0
135 0 1 0

data manipulation with na values

My code are like that df2 DataFrame (toy data) :
import pandas as pd
import numpy as np
end = '20201117'
np.random.seed(107)
df = pd.DataFrame()
for i in range(10):
start = np.random.choice(['20000101', '20100101', '20160101', '20121010'])
df_tempo = pd.DataFrame({'product': 'p'+str(i),
'category': 'cat'+str(np.random.choice(4))
},
index=pd.bdate_range(start=pd.to_datetime(start),
end=end,
freq=np.random.choice(['D', 'W'])))
df_tempo['valeur1'] = np.random.randint(50, 100, df_tempo.shape[0])
df_tempo['valeur2'] = np.random.randint(100, 200, df_tempo.shape[0])
df = pd.concat([df, df_tempo])
df = df.reset_index().rename(columns={'index': 'date'})
df2 = df.pivot(index='date', columns=['category', 'product'], values=['valeur1', 'valeur2'])
from this I would like to calculate the growth rate for 2020, 2019,..years
I tried :
df2.apply([
lambda x: 100*x.valeur1[x.valeur1['2020'].last_valid_index()]/x.valeur1[x.valeur1['2019'].last_valid_index()]-100,
])
but get an error message with is (I would like to calculate different stat for valeur1 and valeur2) :
AttributeError: 'Series' object has no attribute 'valeur1'
so I did (it's not what I would like to do because as I said I would like to apply different stat on valeur1 and valeur2) :
df2['valeur1'].apply([
lambda x: 100*x[x['2020'].last_valid_index()]/x[x['2019'].last_valid_index()]-100,
lambda x: 100*x[x['2010'].last_valid_index()]/x[x['2009'].last_valid_index()]-100,
])
It works except that
lambda x: 100*x[x['2010'].last_valid_index()]/x[x['2009'].last_valid_index()]-100
because I have data series that have no data for 2010 and so return an error.
I tried then a dropna()but not help
Any ideas to help on these 2 problems ?
Since date is already the index of df2, pd.Grouper(freq="Y").last() can be used to retrieve the value on the last date. Is this what you are expecting?
Note: df2 = df2.groupby(pd.Grouper(freq="Y")).ffill() will fill the last date of each year with previous valid value(s). For this sample dataset, however, no difference was introduced.
Code
df_last = df2.groupby(pd.Grouper(freq="Y")).last()
out = df_last.diff() / df_last.shift() * 100
Result
print(out)
valeur1 ... valeur2
category cat0 cat2 ... cat0 cat2
product p0 p1 p2 ... p7 p8 p9
date ...
2000-12-31 NaN NaN NaN ... NaN NaN NaN
2001-12-31 NaN 62.000000 -35.869565 ... NaN -37.113402 NaN
2002-12-31 NaN 8.641975 28.813559 ... NaN 13.934426 NaN
2003-12-31 NaN -40.909091 19.736842 ... NaN 30.935252 NaN
2004-12-31 NaN 25.000000 -30.769231 ... NaN -18.131868 NaN
2005-12-31 NaN -20.000000 57.142857 ... NaN -18.791946 NaN
2006-12-31 NaN 19.230769 -15.151515 ... NaN 14.049587 NaN
2007-12-31 NaN 25.806452 -40.476190 ... NaN -5.072464 NaN
2008-12-31 NaN -1.282051 4.000000 ... NaN 32.824427 NaN
2009-12-31 NaN 7.792208 38.461538 ... NaN -9.770115 NaN
2010-12-31 NaN -33.734940 -26.388889 ... NaN 12.738854 NaN
2011-12-31 NaN 29.090909 83.018868 ... NaN -12.994350 43.442623
2012-12-31 NaN -19.718310 -40.206186 ... NaN -5.194805 13.142857
2013-12-31 NaN 17.543860 34.482759 ... NaN 12.328767 -19.696970
2014-12-31 NaN 8.955224 23.076923 ... NaN -22.560976 -18.867925
2015-12-31 NaN -6.849315 -14.583333 ... NaN 43.307087 23.255814
2016-12-31 NaN 33.823529 -7.317073 ... NaN -22.527473 -6.289308
2017-12-31 -22.580645 -35.164835 -2.631579 ... -8.176101 10.638298 -22.818792
2018-12-31 -6.944444 55.932203 5.405405 ... -30.821918 1.923077 35.652174
2019-12-31 14.925373 -8.695652 -1.282051 ... 70.297030 -6.918239 19.230769
2020-12-31 2.597403 8.333333 -20.779221 ... -28.488372 -6.081081 -17.204301
[21 rows x 20 columns]

change dataframe values based on next column value

I have the following problem to solve:
Let's consider a Dataframe like:
df0: A B C
2013-12-31 NaN 8 10
2014-01-31 NaN NaN NaN
2014-02-28 NaN NaN NaN
I want to fill with 0 column A only if the corresponding value in the column B is not 'NaN
df1: A B C
2013-12-31 0 8 10
2014-01-31 NaN NaN NaN
2014-02-28 NaN NaN NaN
Use loc to find where column B values are not null:
In [160]:
df.loc[df.B.notnull(),'A']=0
df
Out[160]:
A B C
2013-12-31 0 8 10
2014-01-31 NaN NaN NaN
2014-02-28 NaN NaN NaN
[3 rows x 3 columns]