difference in two date column in Pandas - pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]

IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

Pandas adding row to categorical index

I have a scenario where I would like to group my datasets by personally defined week indexes that are then averaged and aggregate the averages into a "Total" row. I am able to achieve the first half of my scenario, but when I try to append/insert a new "Total" row that sums these rows I am receiving error messages.
I attempted to create this row via two different methods:
Method 1:
week_index_avg_unit.loc['Total'] = week_index_avg_unit.sum()
TypeError: cannot append a non-category item to a CategoricalIndex
Method 2:
week_index_avg_unit.index.insert(['Total'], week_index_avg_unit.sum())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have used the first approach in this scenario multiple times, but this is the first time where I'm cutting the data into multiple categories and clearly see where the CategoricalIndex type is the problem.
Here is the format of my data:
date organic ppc oa other content_partnership total \
0 2018-01-01 379 251 197 51 0 878
1 2018-01-02 880 527 405 217 0 2029
2 2018-01-03 859 589 403 323 0 2174
3 2018-01-04 835 533 409 335 0 2112
4 2018-01-05 760 449 355 272 0 1836
year_month day weekday weekday_name week_index
0 2018-01 1 0 Monday Week 1
1 2018-01 2 1 Tuesday Week 1
2 2018-01 3 2 Wednesday Week 1
3 2018-01 4 3 Thursday Week 1
4 2018-01 5 4 Friday Week 1
Here is the code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
historicals = pd.read_csv("2018-2019_plants.csv")
# Capture dates for additional date columns
date_col = pd.to_datetime(historicals['date'])
historicals['year_month'] = date_col.dt.strftime("%Y-%m")
historicals['day'] = date_col.dt.day
historicals['weekday'] = date_col.dt.dayofweek
historicals['weekday_name'] = date_col.dt.day_name()
# create week ranges segment (7 day range)
historicals['week_index'] = pd.cut(historicals['day'],[0,7,14,21,28,32], labels=['Week 1','Week 2','Week 3','Week 4','Week 5'])
# Week Index Average (Units)
week_index_avg_unit = historicals[df_monthly_average].groupby(['week_index']).mean().astype(int)
type(week_index_avg_unit.index)
pandas.core.indexes.category.CategoricalIndex
Here is the week_index_avg_unit table:
organic ppc oa other content_partnership total day weekday
week_index
Week 1 755 361 505 405 22 2027 4 3
Week 2 787 360 473 337 19 1959 11 3
Week 3 781 382 490 352 18 2006 18 3
...
pd.CategoricalIndex is a special animal. It is immutable, so to do the trick you may need to use something like pd.CategoricalIndex.set_categories to add a new category.
See pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.CategoricalIndex.html

Generate Pandas DF OHLC data with Numpy

I would like to generate the following test data in my dataframe in a way similar to this:
df = pd.DataFrame(data=np.linspace(1800, 100, 400), index=pd.date_range(end='2015-07-02', periods=400), columns=['close'])
df
close
2014-05-29 1800.000000
2014-05-30 1795.739348
2014-05-31 1791.478697
2014-06-01 1787.218045
But using the following criteria:
intervals of 1 minute
increments of .25
prices moving up and down around 1800.00
maximum 2100.00, minimum 1700.00
parse_dates= "Timestamp"
Volume column rows have a range of min = 50 - max = 300
Day start 09:30 Day End 16:29:59
Please see desired output:
Open High Low Last Volume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81
2014-03-04 09:33:00 1786.00
I have limited python experience and find the example for Numpy etc hard to follow as they look to be focused on academia. Is it possible to assist with this?