Insert 0 in pandas series for timeseries gaps - pandas

In order to properly plot data, I need the missing values to be shown as 0. I do not want to have a 0 value for each missing day, as that bloats the storage. How do I insert 0 value for each type column for each gap's first and last day? I do not need 0 inserted before and after the whole sequence. Bonus: what if timeseries is monthly or weekly data (date set to the first of the month, or to every Monday)
For example, this timeseries contains one gap between 3rd and 10th of January for type A. I need to insert a 0 value on the 4th and the 9th of January.
df = DataFrame({"date":[datetime(2015,1,1) + timedelta(days=x) for x in range(0, 3)+range(8, 13)+range(2, 9)], "type": ['A']*8+['B']*7, "value": np.random.randint(10, 100, size=15)})
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89 <-- last date before the gap
3 2015-01-09 A 31 <-- first day after the gap
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Desired result (the row indexes would would be different)
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89
. 2015-01-03 A 0 <-- gap starts - new value
<-- do NOT insert any more values for 04--07
. 2015-01-08 A 0 <-- gap ends - new value
3 2015-01-09 A 31
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85

Maybe an inelegant solution, but it seems to be easiest to split the dataframe up, fill in the missing dates, and recombine, like so:
# with pandas imported as pd
dfA = df[df.type=='A']
new_axis = pd.date_range(df.date.min(), df.date.max())
dfA.set_index('date', inplace=True)
missing_dates = list(set(new_axis).difference(dfA.index))
dfA.loc[min(missing_dates)] = 'A', 0
dfA.loc[max(missing_dates)] = 'A', 0
df = pd.concat([df[df.type=='B'].set_index('date'), dfA])

Related

how to transpose m x n into k x 2 form dataframe in pandas

I have m x n form dataframe such as belows.
date1 amt1 date2 amt2
2021-01-02 120 1991-01-02 90
2021-01-03 100 1991-01-03 95
2021-01-04 110 1991-01-04 95
....
Is there any way to transpose into k x 2 form dataframe like...
date amt
2021-01-02 120
2021-01-03 100
2021-01-04 110
...
1991-01-02 90
1991-01-03 95
1991-01-04 95
...
This can be done easily with reshape, although a bit different order:
pd.DataFrame(df.to_numpy().reshape(-1, 2), columns=['date', 'amt'])
Output:
date amt
0 2021-01-02 120
1 1991-01-02 90
2 2021-01-03 100
3 1991-01-03 95
4 2021-01-04 110
5 1991-01-04 95
reset_index then u se pd.wide_to long,
df.reset_index(inplace=True)
pd.wide_to_long(df, stubnames=['date', 'amt'], i=['index'], j='id').reset_index(drop=True)
date amt
0 2021-01-02 120
1 2021-01-03 100
2 2021-01-04 110
3 1991-01-02 90
4 1991-01-03 95
5 1991-01-04 95

Calculate the minimum value of a certain column for all the groups and subtract the value from all the values of a certain column of that group

df = pd.DataFrame([['2018-02-03',42],
['2018-02-03',22],
['2018-02-03',10],
['2018-02-03',32],
['2018-02-03',10],
['2018-02-04',8],
['2018-02-04',2],
['2018-02-04',12],
['2018-02-03',20],
['2018-02-05',30],
['2018-02-05',5],
['2018-02-05',15]])
df.columns = ['product','date','quantity']
I want to create groups by date and calculate the minimum value of a 'quantity' column for all the groups respectively and subtract the value from all the values of a 'quantity' column of that group. The desired output is:
day value
2018-02-03 32 #(because, 42-10 = 32), 10 is minimum for date 2018-02-03.
2018-02-03 12
2018-02-03 0
2018-02-03 22
2018-02-03 0
2018-02-04 6
2018-02-04 0
2018-02-04 10
2018-02-03 10
2018-02-05 25
2018-02-05 0
2018-02-05 10
Now, this is what I tried:
df = df.groupby('Date', as_index = True)
datamin = df.groupby('Date')['quantity'].min()
Bu this creates a dataframe with the first quantity by Date ana I also do not know, how to proceed after this!!
try via groupby() and transform():
df['value']=df.groupby('date')['quantity'].transform(lambda x:x-x.min())
output of df:
date quantity value
0 2018-02-03 42 32
1 2018-02-03 22 12
2 2018-02-03 10 0
3 2018-02-03 32 22
4 2018-02-03 10 0
5 2018-02-04 8 6
6 2018-02-04 2 0
7 2018-02-04 12 10
8 2018-02-03 20 10
9 2018-02-05 30 25
10 2018-02-05 5 0
11 2018-02-05 15 10
For improve performance use GroupBy.transform without lambda function, better is subtract all values of column like:
df['value'] = df['quantity'].sub(df.groupby('date')['quantity'].transform('min'))

How to merge two dataframe base on dates which the datediff is one day?

Input
df1
id A
2020-01-01 10
2020-02-07 20
2020-04-09 30
df2
id B
2019-12-31 50
2020-02-06 20
2020-02-07 70
2020-04-08 34
2020-04-09 44
Goal
df
id A B
2020-01-01 10 50
2020-02-07 20 20
2020-04-09 30 34
The detail as follows:
df1 merges df2 base on id, which add columns from df2.
the type of id is datetime.
merge rules: df1 based on yesterday
Could you simply add 1 day to df2's ID column before merging?
df1.merge(df2.assign(id=df2['id'] + pd.Timedelta(days=1)), on='id')
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34
Try pd.merge_asof
df = pd.merge_asof(df1,df2,on='id',tolerance=pd.Timedelta('1 day'),allow_exact_matches=False)
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34

From 10 years of data, I want to select only calendar days with max or min value

Ok, so I have a dataset of temperatures for each day of the year, over a period of ten years. Index is date converted to datetime.
I want to get a dataset with only the min and max value for each calendar day throughout the 10-year period.
I can convert the index to a string, remove the year and get the dataset that way, but I'm guessing there is a smarter way to do it.
Use Series.dt.strftime with aggregate by GroupBy.agg with min and max:
np.random.seed(2020)
d = pd.date_range('2000-01-01', '2010-12-31')
df = pd.DataFrame({"temp": np.random.randint(0, 30, size=len(d))}, index=d)
print(df)
temp
2000-01-01 0
2000-01-02 8
2000-01-03 3
2000-01-04 22
2000-01-05 3
...
2010-12-27 16
2010-12-28 10
2010-12-29 28
2010-12-30 1
2010-12-31 28
[4018 rows x 1 columns]
df = df.groupby(df.index.strftime('%m-%d'))['temp'].agg(['min','max'])
print (df)
min max
01-01 0 28
01-02 0 29
01-03 3 21
01-04 1 28
01-05 0 26
... ...
12-27 3 29
12-28 4 27
12-29 0 29
12-30 1 29
12-31 2 28
[366 rows x 2 columns]
Last for datetimes is possible add year (be careful with leap years):
df.index = pd.to_datetime('2000-' + df.index, format='%Y-%m-%d')
print (df)
min max
2000-01-01 0 28
2000-01-02 0 29
2000-01-03 3 21
2000-01-04 1 28
2000-01-05 0 26
... ...
2000-12-27 3 29
2000-12-28 4 27
2000-12-29 0 29
2000-12-30 1 29
2000-12-31 2 28
[366 rows x 2 columns]

expand year values to month in pandas

I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?
One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24