Add random datetimes to timestamps - pandas

I have a column of timestamps that span over 24 hours. I want to convert these to differentiate between days. I've done this by converting to timedelta. The result is displayed below.
The question I have is, can these be converted or re-arranged again to provide random datetimes. e.g. dd:mm:yyyy hh:mm:ss.
import pandas as pd
df = pd.DataFrame({
'Time' : ['8:00','18:00','28:00'],
})
df['Time'] = [x + ':00' for x in df['Time']]
df['Time'] = pd.to_timedelta(df['Time'])
Out:
Time
0 0 days 08:00:00
1 0 days 18:00:00
2 1 days 04:00:00
Intended Output:
Time
0 1/01/1904 08:00:00 AM
1 1/01/1904 18:00:00 PM
2 2/01/1904 04:00:00 AM
The input timestamps will never go over more than 2 days. Is there a package that can achieve this or would a dummy start and end dates.

After you convert the Time just adding the date part
df.Time+pd.to_datetime('1904-01-01')
0 1904-01-01 08:00:00
1 1904-01-01 18:00:00
2 1904-01-02 04:00:00
Name: Time, dtype: datetime64[ns]

Related

Summarize rows from a Pandas dataframe B that fall in certain time periods from another dataframe A

I am looking for an efficient way to summarize rows (in groupby-style) that fall in a certain time period, using Pandas in Python. Specifically:
The time period is given in dataframe A: there is a column for "start_timestamp" and a column for "end_timestamp", specifying the start and end time of the time period that is to be summarized. Hence, every row represents one time period that is meant to be summarized.
The rows to be summarized are given in dataframe B: there is a column for "timestamp" and a column "metric" with the values to be aggregated (with mean, max, min etc.). In reality, there might be more than just 1 "metric" column.
For every row's time period from dataframe A, I want to summarice the values of the "metric" column in dataframe B that fall in the given time period. Hence, the number of rows of the output dataframe will be exactly the same as the number of rows of dataframe A.
Any hints would be much appreciated.
Additional Requirements
The number of rows in dataframe A and dataframe B may be large (several thousand rows).
There may be many metrics to summarize in dataframe B (~100).
I want to avoid solving this problem with a for loop (as in the reproducible example below).
Reproducible Example
Input Dataframe A
# Input dataframe A
df_a = pd.DataFrame({
"start_timestamp": ["2022-08-09 00:30", "2022-08-09 01:00", "2022-08-09 01:15"],
"end_timestamp": ["2022-08-09 03:30", "2022-08-09 04:00", "2022-08-09 08:15"]
})
df_a.loc[:, "start_timestamp"] = pd.to_datetime(df_a["start_timestamp"])
df_a.loc[:, "end_timestamp"] = pd.to_datetime(df_a["end_timestamp"])
print(df_a)
start_timestamp
end_timestamp
0
2022-08-09 00:30:00
2022-08-09 03:30:00
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2
2022-08-09 01:15:00
2022-08-09 08:15:00
Input Dataframe B
# Input dataframe B
df_b = pd.DataFrame({
"timestamp":[
"2022-08-09 01:00",
"2022-08-09 02:00",
"2022-08-09 03:00",
"2022-08-09 04:00",
"2022-08-09 05:00",
"2022-08-09 06:00",
"2022-08-09 07:00",
"2022-08-09 08:00",
],
"metric": [1, 2, 3, 4, 5, 6, 7, 8],
})
df_b.loc[:, "timestamp"] = pd.to_datetime(df_b["timestamp"])
print(df_b)
timestamp
metric
0
2022-08-09 01:00:00
1
1
2022-08-09 02:00:00
2
2
2022-08-09 03:00:00
3
3
2022-08-09 04:00:00
4
4
2022-08-09 05:00:00
5
5
2022-08-09 06:00:00
6
6
2022-08-09 07:00:00
7
7
2022-08-09 08:00:00
8
Expected Output Dataframe
# Expected output dataframe
df_target = df_a.copy()
for i, row in df_target.iterrows():
condition = (df_b["timestamp"] >= row["start_timestamp"]) & (df_b["timestamp"] <= row["end_timestamp"])
df_b_sub = df_b.loc[condition, :]
df_target.loc[i, "metric_mean"] = df_b_sub["metric"].mean()
df_target.loc[i, "metric_max"] = df_b_sub["metric"].max()
df_target.loc[i, "metric_min"] = df_b_sub["metric"].min()
print(df_target)
start_timestamp
end_timestamp
metric_mean
metric_max
metric_min
0
2022-08-09 00:30:00
2022-08-09 03:30:00
2.0
3.0
1.0
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2.5
4.0
1.0
2
2022-08-09 01:15:00
2022-08-09 08:15:00
5.0
8.0
2.0
You can use pd.IntervalIndex and contains to create a dataframe with selected metric values and then compute the mean, max, min:
ai = pd.IntervalIndex.from_arrays(
df_a["start_timestamp"], df_a["end_timestamp"], closed="both"
)
t = df_b.apply(
lambda x: pd.Series((ai.contains(x["timestamp"])) * x["metric"]), axis=1
)
df_a[["metric_mean", "metric_max", "metric_min"]] = t[t.ne(0)].agg(
["mean", "max", "min"]
).T.values
print(df_a):
start_timestamp end_timestamp metric_mean metric_max metric_min
0 2022-08-09 00:30:00 2022-08-09 03:30:00 2.0 3.0 1.0
1 2022-08-09 01:00:00 2022-08-09 04:00:00 2.5 4.0 1.0
2 2022-08-09 01:15:00 2022-08-09 08:15:00 5.0 8.0 2.0
Check Below Code using SQLITE3
import sqlite3
conn = sqlite3.connect(':memory:')
df_a.to_sql('df_a',con=conn, index=False)
df_b.to_sql('df_b',con=conn, index=False)
pd.read_sql("""SELECT df_a.start_timestamp, df_a.end_timestamp
, AVG(df_b.metric) as metric_mean
, MAX(df_b.metric) as metric_max
, MIN(df_b.metric) as metric_min
FROM
df_a INNER JOIN df_b
ON df_b.timestamp BETWEEN df_a.start_timestamp AND df_a.end_timestamp
GROUP BY df_a.start_timestamp, df_a.end_timestamp""", con=conn)
Output:

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

pandas to_datetime convert 6PM to 18

is there a nice way to convert Series data, represented like 1PM or 11AM to 13 and 11 accordingly with to_datetime or similar (other, than re)
data:
series
1PM
11AM
2PM
6PM
6AM
desired output:
series
13
11
14
18
6
pd.to_datetime(df['series']) gives the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 11:00:00
You can provide the format you want to use, with as format %I%p:
pd.to_datetime(df['series'], format='%I%p').dt.hour
The .dt.hour [pandas-doc] will thus obtain the hour for that timestamp. This gives us:
>>> df = pd.DataFrame({'series': ['1PM', '11AM', '2PM', '6PM', '6AM']})
>>> pd.to_datetime(df['series'], format='%I%p').dt.hour
0 13
1 11
2 14
3 18
4 6
Name: series, dtype: int64

pandas to_datetime does not accept '24' as time

The time is in the YYYYMMDDHH format.The first time 2010010101, increases by 1 hour, reaches 2010010124, then 2010010201.
date
0 2010010101
1 2010010124
2 2010010201
df['date'] = pd.to_datetime(df['date'], format ='%Y%m%d%H')
I am getting error:
'int' object is unsliceable
If I run:
df2['date'] = pd.to_datetime(df2['date'], format ='%Y%m%d%H', errors = 'coerce')
All the '24' hour is labeled as NaT.
[
Time starts from 00 (midnight) till 23 so the time 24 in your date is 00 of the next day. One way is to define a custom to_datetime to handle the date format.
df = pd.DataFrame({'date':['2010010101', '2010010124', '2010010201']})
def custom_to_datetime(date):
# If the time is 24, set it to 0 and increment day by 1
if date[8:10] == '24':
return pd.to_datetime(date[:-2], format = '%Y%m%d') + pd.Timedelta(days=1)
else:
return pd.to_datetime(date, format = '%Y%m%d%H')
df['date'] = df['date'].apply(custom_to_datetime)
date
0 2010-01-01 01:00:00
1 2010-01-02 00:00:00
2 2010-01-02 01:00:00

Combine date column and time column into datetime

I have two columns (both text objects), one date, the other hour-ending.
df = pd.DataFrame({'Date' : ['2018-10-01', '2018-10-01', '2018-10-01'],
'Hour_Ending': ['1.0', '2.0', '3.0']})
How do I add the two columns together to get a datetime object that looks like this?
2018-10-01 01:00
As a bonus, how do I change Hour_Ending to Hour_Starting?
Using to_datetime and Timedelta
pd.to_datetime(df.Date)+pd.to_timedelta(df.Hour_Ending.astype('float'), unit='h')
Out[122]:
0 2018-10-01 01:00:00
1 2018-10-01 02:00:00
2 2018-10-01 03:00:00
dtype: datetime64[ns]