Pandas cycling through a period of 28 days using timestamp (e.g.) - pandas

I am trying to work out how I could get a cycle of 28 days in my df. What I mean is that I have dataset spanning more than a couple of years with full dates. There doesn't seem to be a stfrtime that looks at a 28 day (lunar month).
How can I have the minimum date be 1 (say for 1st of January 2018) and 28 for (28th of January 2018)
The 29th of January again becomes 1 and 30th of January 2 and so on ...
How can I achieve this cycle?

Are you looking for something like that:
Input data (the worst case):
Date
Date
0 2018-07-12
1 2018-07-13
2 2018-07-14
3 2018-07-14 # <- duplicate day
4 2018-07-15
5 2018-07-30 # <- missing days
6 2018-07-31
7 2018-08-08 # <- end cycle day, missing days
8 2018-08-08 # <- duplicate day
9 2018-08-09 # <- new cycle day
10 2018-08-10
Use np.tile to repeat the period:
df['Day'] = (df['Date'].dt.day_of_year - df['Date'].min().day_of_year) % 28 + 1
Result output:
>>> df
Date Day
0 2018-07-12 1
1 2018-07-13 2
2 2018-07-14 3
3 2018-07-14 3
4 2018-07-15 4
5 2018-07-30 19
6 2018-07-31 20
7 2018-08-08 28
8 2018-08-08 28
9 2018-08-09 1
10 2018-08-10 2

Related

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0

From 10 years of data, I want to select only calendar days with max or min value

Ok, so I have a dataset of temperatures for each day of the year, over a period of ten years. Index is date converted to datetime.
I want to get a dataset with only the min and max value for each calendar day throughout the 10-year period.
I can convert the index to a string, remove the year and get the dataset that way, but I'm guessing there is a smarter way to do it.
Use Series.dt.strftime with aggregate by GroupBy.agg with min and max:
np.random.seed(2020)
d = pd.date_range('2000-01-01', '2010-12-31')
df = pd.DataFrame({"temp": np.random.randint(0, 30, size=len(d))}, index=d)
print(df)
temp
2000-01-01 0
2000-01-02 8
2000-01-03 3
2000-01-04 22
2000-01-05 3
...
2010-12-27 16
2010-12-28 10
2010-12-29 28
2010-12-30 1
2010-12-31 28
[4018 rows x 1 columns]
df = df.groupby(df.index.strftime('%m-%d'))['temp'].agg(['min','max'])
print (df)
min max
01-01 0 28
01-02 0 29
01-03 3 21
01-04 1 28
01-05 0 26
... ...
12-27 3 29
12-28 4 27
12-29 0 29
12-30 1 29
12-31 2 28
[366 rows x 2 columns]
Last for datetimes is possible add year (be careful with leap years):
df.index = pd.to_datetime('2000-' + df.index, format='%Y-%m-%d')
print (df)
min max
2000-01-01 0 28
2000-01-02 0 29
2000-01-03 3 21
2000-01-04 1 28
2000-01-05 0 26
... ...
2000-12-27 3 29
2000-12-28 4 27
2000-12-29 0 29
2000-12-30 1 29
2000-12-31 2 28
[366 rows x 2 columns]

Segregating data based on last 3 months and this time last year

I need to filter out my data into two different index.
(1) last three months, includes December as current month minus three
(2) current month (December 2019) and current month values from the year before
pDate Name Date Year Month
11/17/2019 12:18 A 2019/11 2019 11
12/23/2018 11:52 B 2018/12 2018 12
12/1/2019 11:42 C 2019/12 2019 12
12/10/2018 14:31 D 2018/12 2018 12
12/14/2018 12:42 E 2018/12 2018 12
10/15/2019 15:19 F 2019/10 2019 10
10/23/2019 10:50 G 2019/10 2019 10
12/2/2018 15:14 H 2018/12 2018 12
I was able to group them based upon their last 3 months values, relatively quick as:
df1 = df.sort_values(by="pDate",ascending=True).set_index("pDate").last("3M")
How do I get a dataframe which maps December 2019 (current month) and December 2018 only.
Idea is create month periods by Series.dt.to_period and then you can subtract values for past periods filtering by Series.between with boolean indexing:
$changed sample datetimes
df['pDate'] = pd.to_datetime(df['pDate'])
df = df.sort_values(by="pDate")
print (df)
pDate Name Date Year Month
7 2018-12-02 15:14:00 H 2018/12 2018 12
4 2018-12-14 12:42:00 E 2018/12 2018 12
3 2019-10-10 14:31:00 D 2018/12 2018 12
5 2019-10-15 15:19:00 F 2019/10 2019 10
6 2019-10-23 10:50:00 G 2019/10 2019 10
2 2019-11-01 11:42:00 C 2019/12 2019 12
1 2019-12-23 11:52:00 B 2018/12 2018 12
0 2020-01-17 12:18:00 A 2019/11 2019 11
nowp = pd.to_datetime('now').to_period('m')
print (nowp)
2020-01
df['per'] = df['pDate'].dt.to_period('m')
df = df[df['per'].between(nowp-4, nowp-1) | df['per'].eq(nowp-13)]
print (df)
pDate Name Date Year Month per
7 2018-12-02 15:14:00 H 2018/12 2018 12 2018-12
4 2018-12-14 12:42:00 E 2018/12 2018 12 2018-12
3 2019-10-10 14:31:00 D 2018/12 2018 12 2019-10
5 2019-10-15 15:19:00 F 2019/10 2019 10 2019-10
6 2019-10-23 10:50:00 G 2019/10 2019 10 2019-10
2 2019-11-01 11:42:00 C 2019/12 2019 12 2019-11
1 2019-12-23 11:52:00 B 2018/12 2018 12 2019-12
Detail:
print (nowp)
2020-01
print (nowp-1)
2019-12
print (nowp-13)
2018-12
print (nowp-4)
2019-09

expand year values to month in pandas

I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?
One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).