Convert decimal Day-of-year dataframe to datetime with HH:MM - pandas

Is it possible to convert an entire column of decimal Day-Of-Year into datetime format YYYY-mm-dd HH:MM ? I tried counting the amount of seconds and minutes in a day, but decimal DOY is different from decimal hours.
Example:
DOY = 181.82015046296297
Converted to:
Timestamp('2021-06-05 14:00:00')
Here the date would be a datetime object appearing only as 2021-06-05 14:00:00 in my dataframe. And the year I am interested in is 2021.

Use Timedelta to create an offset from the first day of year
Input data:
>>> df
DayOfYear
0 254
1 156
2 303
3 32
4 100
5 8
6 329
7 82
8 218
9 293
df['Date'] = pd.to_datetime('2021') \
+ df['DayOfYear'].sub(1).apply(pd.Timedelta, unit='D')
Output result:
>>> df
DayOfYear Date
0 254 2021-09-11
1 156 2021-06-05
2 303 2021-10-30
3 32 2021-02-01
4 100 2021-04-10
5 8 2021-01-08
6 329 2021-11-25
7 82 2021-03-23
8 218 2021-08-06
9 293 2021-10-20

Related

Grouping data month-wise with Categorical data in pandas

How can I group data into months from dates where a data frame has both categorical and numerical data in pandas. I tried the groupby function but I think it won't work with categorical data. There are multiple values in the categorical column. Sample data:
Date
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
2021-06-12
Dove_birds
1268
7656
2021-02-05
Pantene_winner
657
8964
2021-09-15
Budweiser_wazap
7642
76
2021-05-13
Pantene_winner
425
6578
2021-12-12
Budweiser_wazap
9867
111
2021-09-09
Dove_birds
1578
11456
2021-05-24
Pantene_winner
678
7475
2021-09-27
Budweiser_wazap
8742
96
2021-09-09
Dove_soft
1175
15486
Now I need to group the data months wise and show for example that Budweiser_wazap in September gained a total audience of xxxx and in December gained xxxx audience and so on for the other campaigns as well.
Expected output sample:
Month
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
February
Pantene_winner
657
8964
September
Budweiser_wazap
16384
172
Since Budweiser_wazap campaign ran twice in September, the resulting output for No_of_Male_Viewers is: 7642 + 8742 = 16384, and for No_of_Female_Viewers is: 76 + 96 = 172.
USE-
##Get Month Name for each date
df['Month'] = df['Date'].dt.month_name()
#Groupby `Month` & `Campaign_Name`
df.groupby(['Month', 'Campaign_Name'])[['No_of_Male_viewers', 'No_of_Female_viewers']].sum().reset_index()
df
Sample Reproducible code-
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-06-08', '2015-08-05', '2015-05-06', '2015-05-05', '2015-07-08', '2015-05-07', '2015-06-05', '2015-07-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month_name()
df
df output-
Date Sym Data2 Data3 Month
0 2015-06-08 aapl 11 5 June
1 2015-08-05 aapl 8 8 August
2 2015-05-06 aapl 10 6 May
3 2015-05-05 aapl 15 1 May
4 2015-07-08 aaww 110 50 July
5 2015-05-07 aaww 60 100 May
6 2015-06-05 aaww 100 60 June
7 2015-07-05 aaww 40 120 July
Groupby Condition-
df.groupby(['Month', 'Sym'])[['Data2', 'Data3']].sum().reset_index()
Output-
Month Sym Data2 Data3
0 August aapl 8 8
1 July aaww 150 170
2 June aapl 11 5
3 June aaww 100 60
4 May aapl 25 7
5 May aaww 60 100
Ref link- Pandas - dataframe groupby - how to get sum of multiple columns
If you use strftime('%B') - that automatically extracts Month names - you can reach the same result with one line of code :)
#download dataframe from Stackoverflw and convert column to datetime
df=pd.read_clipboard()
df['Date']=pd.to_datetime(df['Date'], infer_datetime_format=True)
#'%B' - will return full month name, '%b' - 3-Letter- like Dec, Sep
df.groupby([df['Date'].dt.strftime('%B'), "Campaign_Name"]).sum()
Here is how the output looks like:

From 10 years of data, I want to select only calendar days with max or min value

Ok, so I have a dataset of temperatures for each day of the year, over a period of ten years. Index is date converted to datetime.
I want to get a dataset with only the min and max value for each calendar day throughout the 10-year period.
I can convert the index to a string, remove the year and get the dataset that way, but I'm guessing there is a smarter way to do it.
Use Series.dt.strftime with aggregate by GroupBy.agg with min and max:
np.random.seed(2020)
d = pd.date_range('2000-01-01', '2010-12-31')
df = pd.DataFrame({"temp": np.random.randint(0, 30, size=len(d))}, index=d)
print(df)
temp
2000-01-01 0
2000-01-02 8
2000-01-03 3
2000-01-04 22
2000-01-05 3
...
2010-12-27 16
2010-12-28 10
2010-12-29 28
2010-12-30 1
2010-12-31 28
[4018 rows x 1 columns]
df = df.groupby(df.index.strftime('%m-%d'))['temp'].agg(['min','max'])
print (df)
min max
01-01 0 28
01-02 0 29
01-03 3 21
01-04 1 28
01-05 0 26
... ...
12-27 3 29
12-28 4 27
12-29 0 29
12-30 1 29
12-31 2 28
[366 rows x 2 columns]
Last for datetimes is possible add year (be careful with leap years):
df.index = pd.to_datetime('2000-' + df.index, format='%Y-%m-%d')
print (df)
min max
2000-01-01 0 28
2000-01-02 0 29
2000-01-03 3 21
2000-01-04 1 28
2000-01-05 0 26
... ...
2000-12-27 3 29
2000-12-28 4 27
2000-12-29 0 29
2000-12-30 1 29
2000-12-31 2 28
[366 rows x 2 columns]

Handle Perpetual Maturity Bonds with Maturity date of 31-12-9999 12:00:00 AM

I have a number of records in a dataframe where the maturity date
column is 31-12-9999 12:00:00 AM as the bonds never mature. This
naturally raises the error:
Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
I see the max date is:
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
I just wanted to clarify what the best approach to clean all date columns in the datframe and fix my bug? My code modelled off the docs:
df_Fix_Date = df_Date['maturity_date'].head(8)
display(df_Fix_Date)
display(df_Fix_Date.dtypes)
0 2020-08-15 00:00:00.000
1 2022-11-06 00:00:00.000
2 2019-03-15 00:00:00.000
3 2025-01-15 00:00:00.000
4 2035-05-29 00:00:00.000
5 2027-06-01 00:00:00.000
6 2021-04-01 00:00:00.000
7 2022-04-03 00:00:00.000
Name: maturity_date, dtype: object
def conv(x):
return pd.Period(day = x%100, month = x//100 % 100, year = x // 10000, freq='D')
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date']) # convert to datetype
df_Fix_Date['maturity_date'] = pd.PeriodIndex(df_Fix_Date['maturity_date'].apply(conv)) # fix error
display(df_Fix_Date)
Output:
KeyError: 'maturity_date'
There is problem you cannot convert to out of bounds datetimes.
One solution is replace 9999 to 2261:
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].replace('^9999','2261',regex=True)
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Another solution is replace all dates with year higher as 2261 to 2261:
m = df_Fix_Date['maturity_date'].str[:4].astype(int) > 2261
df_Fix_Date['maturity_date'] = df_Fix_Date['maturity_date'].mask(m, '2261' + df_Fix_Date['maturity_date'].str[4:])
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'])
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 2261-04-03
Or replace problematic dates to NaTs by parameter errors='coerce':
df_Fix_Date['maturity_date'] = pd.to_datetime(df_Fix_Date['maturity_date'], errors='coerce')
print (df_Fix_Date)
maturity_date
0 2020-08-15
1 2022-11-06
2 2019-03-15
3 2025-01-15
4 2035-05-29
5 2027-06-01
6 2021-04-01
7 NaT

Insert 0 in pandas series for timeseries gaps

In order to properly plot data, I need the missing values to be shown as 0. I do not want to have a 0 value for each missing day, as that bloats the storage. How do I insert 0 value for each type column for each gap's first and last day? I do not need 0 inserted before and after the whole sequence. Bonus: what if timeseries is monthly or weekly data (date set to the first of the month, or to every Monday)
For example, this timeseries contains one gap between 3rd and 10th of January for type A. I need to insert a 0 value on the 4th and the 9th of January.
df = DataFrame({"date":[datetime(2015,1,1) + timedelta(days=x) for x in range(0, 3)+range(8, 13)+range(2, 9)], "type": ['A']*8+['B']*7, "value": np.random.randint(10, 100, size=15)})
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89 <-- last date before the gap
3 2015-01-09 A 31 <-- first day after the gap
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Desired result (the row indexes would would be different)
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89
. 2015-01-03 A 0 <-- gap starts - new value
<-- do NOT insert any more values for 04--07
. 2015-01-08 A 0 <-- gap ends - new value
3 2015-01-09 A 31
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Maybe an inelegant solution, but it seems to be easiest to split the dataframe up, fill in the missing dates, and recombine, like so:
# with pandas imported as pd
dfA = df[df.type=='A']
new_axis = pd.date_range(df.date.min(), df.date.max())
dfA.set_index('date', inplace=True)
missing_dates = list(set(new_axis).difference(dfA.index))
dfA.loc[min(missing_dates)] = 'A', 0
dfA.loc[max(missing_dates)] = 'A', 0
df = pd.concat([df[df.type=='B'].set_index('date'), dfA])

convert hourly time period in 15-minute time period

I have a dataframe like that:
df = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 02:00 0 30
2 01.08.2009 03:00 10 18
But I need that one (in 15-min-periods):
Date Buy Sell
0 01.08.2009 01:00 15 25
1 01.08.2009 01:15 15 25
2 01.08.2009 01:30 15 25
3 01.08.2009 01:45 15 25
4 01.08.2009 02:00 0 30
5 01.08.2009 02:15 0 30
6 01.08.2009 02:30 0 30
7 01.08.2009 02:45 0 30
8 01.08.2009 03:00 10 18
....and so on.
I have tried df.resample(). But it does not worked. Does someone know a nice pandas method?!
If fileA.csv looks like this:
Date;Buy;Sell
01.08.2009 01:00;15;25
01.08.2009 02:00;0;30
01.08.2009 03:00;10;18
then you could parse the data with
df = pd.read_csv("fileA.csv", delimiter=";", parse_dates=['Date'])
so that df will look like this:
In [41]: df
Out[41]:
Date Buy Sell
0 2009-01-08 01:00:00 15 25
1 2009-01-08 02:00:00 0 30
2 2009-01-08 03:00:00 10 18
You might want to check df.info() to make sure you successfully parsed your data into a DataFrame with three columns, and that the Date column has dtype datetime64[ns]. Since the repr(df) you posted prints the date in a different format and the column headers do not align with the data, there is a good chance that the data has not yet been parsed properly. If that's true and you post some sample lines from the csv, we should be able help you parse the data into a DataFrame.
In [51]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
Date 3 non-null datetime64[ns]
Buy 3 non-null int64
Sell 3 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 96.0 bytes
Once you have the DataFrame correctly parsed, resampling to 15 minute time periods can be done with asfreq with forward-filling the missing values:
In [50]: df.set_index('Date').asfreq('15T', method='ffill')
Out[50]:
Buy Sell
2009-01-08 01:00:00 15 25
2009-01-08 01:15:00 15 25
2009-01-08 01:30:00 15 25
2009-01-08 01:45:00 15 25
2009-01-08 02:00:00 0 30
2009-01-08 02:15:00 0 30
2009-01-08 02:30:00 0 30
2009-01-08 02:45:00 0 30
2009-01-08 03:00:00 10 18