Difference between datetime object returns only days - pandas

I'm trying to calculate the difference between two datetime objects but it only returns the difference between days and not between hours/minutes/seconds.
This is my code:
import pandas as pd
import datetime as dt
df = pd.read_csv(r'recorridos-realizados-2020.csv')
df.head(2)
Id_start start_date end_date Id_end ID_cyclist
75 2020-09-14 11:52:21 2020-09-14 11:58:10 186.0 155721
210 2020-09-14 11:51:41 2020-09-14 11:53:06 210.0 191320
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d %H:%M:%S')
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y-%m-%d %H:%M:%S')
df['timelapse'] = df['end_date'] - df['start_date']
df['timelapse'].head()
0 0 days
1 0 days
The result should be:
0 days, 00:05:49
0 days, 00:01:25
What I'm doing wrong?

Please look at pandas time deltas.
d1 = pd.to_datetime('2020-09-14 11:52:21')
d2 = pd.to_datetime('2020-09-14 11:58:10')
delta = (d2-d1)
print('seconds: ', delta.seconds)

Related

how to create monthly and season 24 hours average table using pandas

I have a dataframe with 2 columns: Date and LMP and there are totals of 8760 rows. This is the dummy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2023-01-01 00:00', '2023-12-31 23:00', freq='1H'), 'LMP': np.random.randint(10, 20, 8760)})
I extract month from the date and then created the season column for the specific dates. Like this
df['month'] = pd.DatetimeIndex(df['Date']).month
season = []
for i in df['month']:
if i <= 2 or i == 12:
season.append('Winter')
elif 2 < i <= 5:
season.append('Spring')
elif 5 < i <= 8:
season.append('Summer')
else:
season.append('Autumn')
df['Season'] = season
df2 = df.groupby(['month']).mean()
df3 = df.groupby(['Season']).mean()
print(df2['LMP'])
print(df3['LMP'])
Output:
**month**
1 20.655113
2 20.885532
3 19.416946
4 22.025248
5 26.040606
6 19.323863
7 51.117965
8 51.434093
9 21.404680
10 14.701989
11 20.009590
12 38.706160
**Season**
Autumn 18.661426
Spring 22.499365
Summer 40.856845
Winter 26.944382
But I want the output to be in 24 hour average for both monthly and seasonal.
Desired Output:
for seasonal 24 hours average
For monthyl 24 hours average
Note: in the monthyl 24 hour average columns are months(1,2,3,4,5,6,7,8,9,10,11,12) and rows are hours(starting from 0).
Can anyone help?
try:
df['hour']=pd.DatetimeIndex(df['Date']).hour
dft = df[['Season', 'hour', 'LMP']]
dftg = dft.groupby(['hour', 'Season'])['LMP'].mean()
dftg.reset_index().pivot(index='hour', columns='Season')
result:

window function for moving average

I am trying to replicate SQL's window function in pandas.
SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders
I have this dataframe:
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""
u_cols=['customer_id', 'date', 'price']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
df=df.sort_values(list(df.columns))
And after calculating moving average restricted to last 1 month, it will look like this...
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100,100
cust_2,2020-10-10,15,15
cust_1,2020-10-15,200,150
cust_1,2020-10-16,240,180
cust_2,2020-12-20,25,25
cust_1,2020-12-25,140,140
cust_2,2021-01-01,5,15
"""
u_cols=['customer_id', 'date', 'price', 'my_average']
myf = StringIO(myst)
import pandas as pd
my_df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
my_df=my_df.sort_values(list(my_df.columns))
As shown in this image:
https://trino.io/assets/blog/window-features/running-average-range.svg
I tried to write a function like this...
import numpy as np
def mylogic(myro):
mylist = list()
mydate = myro['date'][0]
for i in range(len(myro)):
if myro['date'][i] > mydate:
mylist.append(myro['price'][i])
mydate = myro['date'][i]
return np.mean(mylist)
But that returned a key_error.
You can use the rolling function on the last 30 days
df['date'] = pd.to_datetime(df['date'])
df['my_average'] = (df.groupby('customer_id')
.apply(lambda d: d.rolling('30D', on='date')['price'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)
output:
customer_id date price my_average
0 cust_1 2020-10-10 100 100
2 cust_1 2020-10-15 200 150
3 cust_1 2020-10-16 240 180
5 cust_1 2020-12-25 140 140
1 cust_2 2020-10-10 15 15
4 cust_2 2020-12-20 25 25
6 cust_2 2021-01-01 5 15

Pandas: drop out of sequence row

My Pandas df:
import pandas as pd
import io
data = """date value
"2015-09-01" 71.925000
"2015-09-06" 71.625000
"2015-09-11" 71.333333
"2015-09-12" 64.571429
"2015-09-21" 72.285714
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.date = pd.to_datetime(df.date)
I Given a user input date ( 01-09-2015).
I would like to keep only those date where difference between date and input date is multiple of 5.
Expected output:
input = 01-09-2015
df:
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
3 2015-09-21 72.285714
My Approach so far:
I am taking the delta between input_date and date in pandas and saving this delta in separate column.
If delta%5 == 0, keep the row else drop. Is this the best that can be done?
Use boolean indexing for filter by mask, here convert input values to datetimes and then timedeltas to days by Series.dt.days:
input1 = '01-09-2015'
df = df[df.date.sub(pd.to_datetime(input1)).dt.days % 5 == 0]
print (df)
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
4 2015-09-21 72.285714

pandas to_datetime does not accept '24' as time

The time is in the YYYYMMDDHH format.The first time 2010010101, increases by 1 hour, reaches 2010010124, then 2010010201.
date
0 2010010101
1 2010010124
2 2010010201
df['date'] = pd.to_datetime(df['date'], format ='%Y%m%d%H')
I am getting error:
'int' object is unsliceable
If I run:
df2['date'] = pd.to_datetime(df2['date'], format ='%Y%m%d%H', errors = 'coerce')
All the '24' hour is labeled as NaT.
[
Time starts from 00 (midnight) till 23 so the time 24 in your date is 00 of the next day. One way is to define a custom to_datetime to handle the date format.
df = pd.DataFrame({'date':['2010010101', '2010010124', '2010010201']})
def custom_to_datetime(date):
# If the time is 24, set it to 0 and increment day by 1
if date[8:10] == '24':
return pd.to_datetime(date[:-2], format = '%Y%m%d') + pd.Timedelta(days=1)
else:
return pd.to_datetime(date, format = '%Y%m%d%H')
df['date'] = df['date'].apply(custom_to_datetime)
date
0 2010-01-01 01:00:00
1 2010-01-02 00:00:00
2 2010-01-02 01:00:00

Pandas Dataframe merging columns

I have a pandas dataframe like the following
Year Month Day Securtiy Trade Value NewDate
2011 1 10 AAPL Buy 1500 0
My question is, how can I merge the columns Year, Month, Day into column NewDate
so that the newDate column looks like the following
2011-1-10
The best way is to parse it when reading as csv:
In [1]: df = pd.read_csv('foo.csv', sep='\s+', parse_dates=[['Year', 'Month', 'Day']])
In [2]: df
Out[2]:
Year_Month_Day Securtiy Trade Value NewDate
0 2011-01-10 00:00:00 AAPL Buy 1500 0
You can do this without the header, by defining column names while reading:
pd.read_csv(input_file, header=['Year', 'Month', 'Day', 'Security','Trade', 'Value' ], parse_dates=[['Year', 'Month', 'Day']])
If it's already in your DataFrame, you could use an apply:
In [11]: df['Date'] = df.apply(lambda s: pd.Timestamp('%s-%s-%s' % (s['Year'], s['Month'], s['Day'])), 1)
In [12]: df
Out[12]:
Year Month Day Securtiy Trade Value NewDate Date
0 2011 1 10 AAPL Buy 1500 0 2011-01-10 00:00:00
df['Year'] + '-' + df['Month'] + '-' + df['Date']
You can create a new Timestamp as follows:
df['newDate'] = df.apply(lambda x: pd.Timestamp('{0}-{1}-{2}'
.format(x.Year, x.Month, x.Day),
axix=1)
>>> df
Year Month Day Securtiy Trade Value NewDate newDate
0 2011 1 10 AAPL Buy 1500 0 2011-01-10