Generate Pandas DF OHLC data with Numpy - numpy

I would like to generate the following test data in my dataframe in a way similar to this:
df = pd.DataFrame(data=np.linspace(1800, 100, 400), index=pd.date_range(end='2015-07-02', periods=400), columns=['close'])
df
close
2014-05-29 1800.000000
2014-05-30 1795.739348
2014-05-31 1791.478697
2014-06-01 1787.218045
But using the following criteria:
intervals of 1 minute
increments of .25
prices moving up and down around 1800.00
maximum 2100.00, minimum 1700.00
parse_dates= "Timestamp"
Volume column rows have a range of min = 50 - max = 300
Day start 09:30 Day End 16:29:59
Please see desired output:
Open High Low Last Volume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81
2014-03-04 09:33:00 1786.00
I have limited python experience and find the example for Numpy etc hard to follow as they look to be focused on academia. Is it possible to assist with this?

Related

Create extrapolated time series in Pandas from dataset

I have a time series of daily potential evaporation [mm/day] going back 11 years, but I need a time series going back to 1975. What I would like to do is calculate a "normal"/mean year from the data I have, and fill that into a time series with daily values all the way back to 1975.
I tried reindexing and resample into that df, but it didn't do the trick.
Below are some sample data:
epot [mm]
tid
2011-01-01 00:00:00+00:00 0.3
2011-01-02 00:00:00+00:00 0.2
2011-01-03 00:00:00+00:00 0.1
2011-01-04 00:00:00+00:00 0.1
2011-01-05 00:00:00+00:00 0.1
...
2021-12-27 00:00:00+00:00 0.1
2021-12-28 00:00:00+00:00 0.1
2021-12-29 00:00:00+00:00 0.1
2021-12-30 00:00:00+00:00 0.1
2021-12-31 00:00:00+00:00 0.1
epot [mm]
count 4018.000000
mean 1.688477
std 1.504749
min 0.000000
25% 0.300000
50% 1.300000
75% 2.800000
max 5.900000
The plot shows the daily values, it shows that there isn't a lot of difference from year to year, hence using a mean year for all the years prior would probably be just fine.
EDIT:
I have managed to calculate a normalised year of all my data, using both min, mean, 0.9 quantile and max. Which is really useful. But I still struggle to take these values and putting them in a time series stretching over several years.
I used the groupby function to get this far.
df1 = E_pot_d.groupby([E_pot_d.index.month, E_pot_d.index.day]).agg(f)
df2 = df1.rolling(30, center=True, min_periods=10).mean().fillna(method='bfill')
df2
Out[75]:
epot [mm]
min mean q0.90 max
tid tid
1 1 0.046667 0.161818 0.280000 0.333333
2 0.043750 0.165341 0.281250 0.337500
3 0.047059 0.165775 0.282353 0.341176
4 0.044444 0.169697 0.288889 0.344444
5 0.042105 0.172249 0.300000 0.352632
... ... ... ...
12 27 0.020000 0.137273 0.240000 0.290000
28 0.021053 0.138278 0.236842 0.289474
29 0.022222 0.138889 0.238889 0.288889
30 0.017647 0.139572 0.241176 0.294118
31 0.018750 0.140909 0.237500 0.293750
[366 rows x 4 columns]
If you want to take the daily average of the current years and project it back to 1975, you can try this:
s = pd.date_range("1975-01-01", "2010-12-31")
extrapolated = (
df.groupby(df.index.dayofyear)
.mean()
.join(pd.Series(s, index=s.dayofyear, name="tid"), how="outer")
.set_index("tid")
.sort_index()
)
# Combine the 2 data setes
result = pd.concat([extrapolated, df])
Note that this algorithm will give you the same value for Jan 1, 1975 and Jan 1, 1976, and Jan 1, 1977, etc. since they are the average of all Jan 1s from 2011 to 2021.

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

How to get that graph in plotly?

I have hourly data time series for 2016 till 2020 and I want to get a graph looks like the attached picture
datetime Demand
2016-1-1 01:00:00. 500
2016-1-1 02:00:00. 450
2016-1-1 03:00:00. 650
.........................
2017-1-1 01:00:00. 570
2017-1-1 02:00:00. 470
2017-1-1 03:00:00. 600
.........................
.........................
2020-1-1 01:00:00. 900
2020-1-1 02:00:00. 800
2020-1-1 03:00:00. 950
My dataframe looks like aboved dataframe
Basically you'll need to create two new columns for your dataframe
Year and Hour.
You can use datetime in order to do that.
With those two columns you can now create a px.line graph where x is your Hour column, where Y is your Demand column and where color is your Year column.
References:
Datetime
Line Charts With Plotly

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35