How to convert days to hours in pandas - pandas

Example:
coupon expiration
Restaurant 1d
College 2d
Coffee House 2h
o/p:
coupon expiration
Restaurant 24h
College 48h
Coffee House 2h
How to convert days to hours in pandas

You can use pd.to_timedelta, but the values in the expiration column must be valid timedelta strings:
import pandas as pd
df = pd.read_clipboard() # Your df here
tds = pd.to_timedelta(df["expiration"])
# 0 1 days 00:00:00
# 1 2 days 00:00:00
# 2 0 days 02:00:00
# Name: expiration, dtype: timedelta64[ns]
# I would recommend stopping here, but you can reformat this into a string of hours:
df["expiration"] = tds.dt.total_seconds().div(3600).apply("{:g}h".format)
# coupon expiration
# 0 Restaurant 24h
# 1 College 48h
# 2 CoffeeHouse 2h

You can use str.replace on the expiration column and use a regex pattern to select those entries that have a day (d) suffix. You can also call a function for the repl parameter - which is where I chose to do the conversion to hours.
Code:
import pandas as pd
df = pd.DataFrame({"coupon":['Restaurant','College','Coffee House'], "expiration":['1d','2d','2h']})
def replacement(m):
x = int(m.group(0).split('d')[0]) * 24
return f"{x}h"
df.expiration = df.expiration.str.replace(pat=r'^\d+d$', repl=replacement, regex=True)
print(df)
Output:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h
Regex Pattern:
r'^\d+d$'
^ : start of string
\d+ : one or more digits [0-9]
d : followed by the letter d
$ : end of string
Note:
If you would rather a one-liner using a lambda function instead:
df.expiration = df.expiration.str.replace(pat=r'^\d+d$', repl= lambda m:f"{int(m.group(0).split('d')[0]) * 24}h", regex=True)

A simply Apply can help here
def convert(x):
if 'd' in x:
return f"{int(x.replace('d',''))*24}h"
return x
df['expiration']= df['expiration'].apply(lambda x:convert(x))
df
Out[57]:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h

Another possible solution, based on eval:
df['expiration'] = [str(eval(x)) + 'h' for x in
df['expiration'].str.replace('d', '*24').str.replace('h', '')]
Output:
coupon expiration
0 Restaurant 24h
1 College 48h
2 Coffee House 2h

Related

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

Fecebook NeauralProphet - adding holidays

I have one common data set for my prediction that includes data across the globe.
ds y country_id
01/01/2021 09:00:00 5.0 1
01/01/2021 09:10:00 5.2 1
01/01/2021 09:20:00 5.4 1
01/01/2021 09:30:00 6.1 1
01/01/2021 09:00:00 2.0 2
01/01/2021 09:10:00 2.2 2
01/01/2021 09:20:00 2.4 2
01/01/2021 09:30:00 3.1 2
playoffs = pd.DataFrame({
'holiday': 'playoff',
'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
superbowls = pd.DataFrame({
'holiday': 'superbowl',
'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((playoffs, superbowls))
Now, I would like to add holidays to the model.
m = NeuralProphet(holidays=holidays)
m.add_country_holidays(country_name='US')
m.fit(df)
How can I add multiple country holidays to add_country_holidays (m.add_country_holidays)?
How to add country specific holidays to holidays data?
Do I need to generate different model specific to country? Or, one model for the entire dataset is fine and then will be able to add the regressor. What is the recommendation?
Here is a possible solution:
The program:
# NOTE 1: tested on google colab
# Un-comment the following (!pip) line if you need to install the libraries
# on google colab notebook:
#!pip install neuralprophet pandas numpy holidays
import pandas as pd
import numpy as np
import holidays
from neuralprophet import NeuralProphet
import datetime
# NOTE 2: Most of the code comes from:
# https://neuralprophet.com/html/events_holidays_peyton_manning.html
# Context:
# We will use the time series of the log daily page views for the Wikipedia
# page for Peyton Manning (American former football quarterback ) as an example.
# During playoffs and super bowls, the Peyton Manning's wiki page is more frequently
# viewed. We would like to see if countries specific holidays also have an
# influence.
# First, we load the data:
data_location = "https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "wp_log_peyton_manning.csv")
# To simulate your case, we add a country_id column filled with random values {1,2}
# Let's assume US=1 and Canada=2
import numpy as np
np.random.seed(0)
df['country_id']=np.random.randint(1,2+1,df['ds'].count())
print("The dataframe we are working on:")
print(df.head())
# We would like to add holidays for US and Canada to see if holidays have an
# influence on the # of daily's views on Manning's wiki page.
# The data in df starts in 2007 and ends in 2016:
StartingYear=2007
LastYear=2016
# Holidays for US and Canada:
US_holidays = holidays.US(years=[year for year in range(StartingYear, LastYear+1)])
CA_holidays = holidays.CA(years=[year for year in range(StartingYear, LastYear+1)])
holidays_US=pd.DataFrame()
holidays_US['ds']=[]
holidays_US['event']=[]
holidays_CA=pd.DataFrame()
holidays_CA['ds']=[]
holidays_CA['event']=[]
for i in df.index:
# Convert date string to datetime object:
datetimeobj=[int(x) for x in df['ds'][i].split('-')]
# Check if the corresponding day is a holyday in the US;
if df['country_id'][i]==1 and (datetime.datetime(*datetimeobj) in US_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_US']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_US
holidays_US=holidays_US.append(df1,ignore_index=True)
# Check if the corresponding day is a holyday in Canada:
if df['country_id'][i]==2 and (datetime.datetime(*datetimeobj) in CA_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_CA']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_CA
holidays_CA=holidays_CA.append(df1,ignore_index=True)
# Now we can drop the country_id in df:
df.drop('country_id', axis=1, inplace=True)
print("Days in df that are holidays in the US:")
print(holidays_US.head())
print()
print("Days in df that are holidays in Canada:")
print(holidays_CA.head())
# user specified events
# history events
playoffs = pd.DataFrame({
'event': 'playoff',
'ds': pd.to_datetime([
'2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07',
]),
})
superbowls = pd.DataFrame({
'event': 'superbowl',
'ds': pd.to_datetime([
'2010-02-07', '2012-02-05', '2014-02-02',
'2016-02-07',
]),
})
# Create the events_df:
events_df = pd.concat((playoffs, superbowls, holidays_US, holidays_CA))
# Create neural network and fit:
# NeuralProphet Object
m = NeuralProphet(loss_func="MSE")
m = m.add_events("playoff")
m = m.add_events("superbowl")
m = m.add_events("holiday_US")
m = m.add_events("holiday_CA")
# create the data df with events
history_df = m.create_df_with_events(df, events_df)
# fit the model
metrics = m.fit(history_df, freq="D")
# forecast with events known ahead
future = m.make_future_dataframe(df=history_df, events_df=events_df, periods=365, n_historic_predictions=len(df))
forecast = m.predict(df=future)
fig = m.plot(forecast)
fig_param = m.plot_parameters()
fig_comp = m.plot_components(forecast)
RESULT:
The results (see PARAMETERS figure) seem to show that when a day is a holiday, there are less views in both US and Canada. Does it make sense? Maybe... It looks plausible that people on holiday have more interesting things to do than browsing Manning's wiki page :-) I don't know.
PROGRAM'S OUTPUT:
The dataframe we are working on:
ds y country_id
0 2007-12-10 9.5908 1
1 2007-12-11 8.5196 2
2 2007-12-12 8.1837 2
3 2007-12-13 8.0725 1
4 2007-12-14 7.8936 2
Days in df that are holidays in the US:
ds event
0 2007-12-25 holiday_US
1 2008-01-21 holiday_US
2 2008-07-04 holiday_US
3 2008-11-27 holiday_US
4 2008-12-25 holiday_US
Days in df that are holidays in Canada:
ds event
0 2008-01-01 holiday_CA
1 2008-02-18 holiday_CA
2 2008-08-04 holiday_CA
3 2008-09-01 holiday_CA
4 2008-10-13 holiday_CA
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 138
88%
241/273 [00:02<00:00, 121.69it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.51E+00
88%
241/273 [00:02<00:00, 123.87it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.63E+00
89%
242/273 [00:02<00:00, 121.58it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.62E-02, min: 2.58E+00
INFO - (NP.forecaster._init_train_loader) - lr-range-test selected learning rate: 3.44E-02
Epoch[138/138]: 100%|██████████| 138/138 [00:29<00:00, 4.74it/s, MSELoss=0.012, MAE=0.344, RMSE=0.478, RegLoss=0]
The figures:
FORECASTS:
PARAMETERS:
COMPONENTS:

Add hours to a timestamp that is formatted as a string

I have a dataframe (df) that has employee start and end times formatted at strings
emp_id|Start|End
001|07:00:00|04:00:00
002|07:30:00|04:30:00
I want to add two hours to the Start and 2 hours to the End on a set of employees, not all employees. I do this by taking a slice of the main dataframe into a separate dataframe (df2). I then update the values and need to merge the updated values back into the main dataframe (df1) where I will coerce back to a string, as there is a method later in the code expecting these values to be strings.
I tried doing this:
df1['Start'] = pd.to_datetime(df1.Start)
df1['End'] = pd.to_datetime(df1.End)
df2 = df1.sample(frac=0.1, replace=False, random_state=1) #takes a random 10% slice
df2['Start'] = df2['Start'] + timedelta(hours=2)
df2['End'] = df2['End'] + timedelta(hours=2)
df1.loc[df1.emp_id.isin(df2.emp_id), ['Start, 'End']] = df2[['Start', 'End']]
df1['Start'] = str(df1['Start'])
df1['End'] = str(df1['End']))
I'm getting a TypeError: addition/subtraction of integers and integer arrays with DateTimeArray is no longer supported. How do I do this in Python3?
You can use .applymap() on the Start and End columns of your selected subset. Hour addition can be done by string extraction and substitution.
Code
df1 = pd.DataFrame({
"emp_id": ['001', '002'],
"Start": ['07:00:00', '07:30:00'],
"End": ['04:00:00', '04:30:00'],
})
# a subset of employee id
set_id = set(['002'])
# locate the subset
mask = df1["emp_id"].isin(set_id)
# apply hour addition
df1.loc[mask, ["Start", "End"]] = df1.loc[mask, ["Start", "End"]].applymap(lambda el: f"{int(el[:2])+2:02}{el[2:]}")
Result
print(df1)
emp_id Start End
0 001 07:00:00 04:00:00
1 002 09:30:00 06:30:00 <- 2 hrs were added
Note: f-strings require python 3.6+. For earlier versions, replace the f-string with
"%02d%s" % (int(el[:2])+2, el[2:])
Note: mind corner cases (time later than 22:00) if they exist.

Finding the longest sequence of dates in a dataframe

I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.
Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can use .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.