I have a time series of daily potential evaporation [mm/day] going back 11 years, but I need a time series going back to 1975. What I would like to do is calculate a "normal"/mean year from the data I have, and fill that into a time series with daily values all the way back to 1975.
I tried reindexing and resample into that df, but it didn't do the trick.
Below are some sample data:
epot [mm]
tid
2011-01-01 00:00:00+00:00 0.3
2011-01-02 00:00:00+00:00 0.2
2011-01-03 00:00:00+00:00 0.1
2011-01-04 00:00:00+00:00 0.1
2011-01-05 00:00:00+00:00 0.1
...
2021-12-27 00:00:00+00:00 0.1
2021-12-28 00:00:00+00:00 0.1
2021-12-29 00:00:00+00:00 0.1
2021-12-30 00:00:00+00:00 0.1
2021-12-31 00:00:00+00:00 0.1
epot [mm]
count 4018.000000
mean 1.688477
std 1.504749
min 0.000000
25% 0.300000
50% 1.300000
75% 2.800000
max 5.900000
The plot shows the daily values, it shows that there isn't a lot of difference from year to year, hence using a mean year for all the years prior would probably be just fine.
EDIT:
I have managed to calculate a normalised year of all my data, using both min, mean, 0.9 quantile and max. Which is really useful. But I still struggle to take these values and putting them in a time series stretching over several years.
I used the groupby function to get this far.
df1 = E_pot_d.groupby([E_pot_d.index.month, E_pot_d.index.day]).agg(f)
df2 = df1.rolling(30, center=True, min_periods=10).mean().fillna(method='bfill')
df2
Out[75]:
epot [mm]
min mean q0.90 max
tid tid
1 1 0.046667 0.161818 0.280000 0.333333
2 0.043750 0.165341 0.281250 0.337500
3 0.047059 0.165775 0.282353 0.341176
4 0.044444 0.169697 0.288889 0.344444
5 0.042105 0.172249 0.300000 0.352632
... ... ... ...
12 27 0.020000 0.137273 0.240000 0.290000
28 0.021053 0.138278 0.236842 0.289474
29 0.022222 0.138889 0.238889 0.288889
30 0.017647 0.139572 0.241176 0.294118
31 0.018750 0.140909 0.237500 0.293750
[366 rows x 4 columns]
If you want to take the daily average of the current years and project it back to 1975, you can try this:
s = pd.date_range("1975-01-01", "2010-12-31")
extrapolated = (
df.groupby(df.index.dayofyear)
.mean()
.join(pd.Series(s, index=s.dayofyear, name="tid"), how="outer")
.set_index("tid")
.sort_index()
)
# Combine the 2 data setes
result = pd.concat([extrapolated, df])
Note that this algorithm will give you the same value for Jan 1, 1975 and Jan 1, 1976, and Jan 1, 1977, etc. since they are the average of all Jan 1s from 2011 to 2021.
Related
I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")
I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858
I'm trying to calculate event recurrence over the last 7 days. An event is defined by a specific amount deducted from my bank account (see dataframe example below). I tried using various tools such as rolling, groupby, resample, etc. but couldn't integrate them into a working solution. 2 main problems I encountered:
I need to perform rolling.count() only if the amount is equal
I need a full 7 day window and NOT 7 row window (some days there are no transactions)
Any ideas? I would really appreciate an explanation as well. Thank you!!
date description amount desired column (amount count in the last 7 days)
9/5/2019 asdkfjlskd 500 1
9/6/2019 dfoais 1200 1
9/7/2019 sadlfuhasd\ -12.99 1
9/8/2019 sdaf 500 2
9/9/2019 sdaf -267.01 1
9/10/2019 sdaf -39.11 1
9/11/2019 sdaf -18 1
9/11/2019 sdaf 500 3
9/13/2019 sdaf 500 1
9/14/2019 sdaf -450 1
9/15/2019 sdaf -140 1
9/16/2019 sdaf -6.8 1
The right way to do this in pandas is to use groupby-rolling, with the rolling window equal to seven days ('7D'), like that:
df["date"] = pd.to_datetime(df.date)
df.set_index("date").groupby("amount").rolling("7D").count()
This results in:
amount date
-450.00 2019-09-14 1.0
-267.01 2019-09-09 1.0
-140.00 2019-09-15 1.0
-39.11 2019-09-10 1.0
-18.00 2019-09-11 1.0
-12.99 2019-09-07 1.0
-6.80 2019-09-16 1.0
500.00 2019-09-05 1.0
2019-09-08 2.0
2019-09-11 3.0
2019-09-13 3.0
1200.00 2019-09-06 1.0
Note that the date in this time frame relates to the end of the 7-day period. That is, in the 7 days ending on 2019-09-13, you had 3 transactions of 500.
and if you want to 'flatten' it back to a row per transaction:
tx_count = df.set_index("date").groupby("amount").rolling("7D").count()
tx_count.columns=["similar_tx_count_prev_7_days"]
tx_count = tx_count.reset_index()
tx_count
results in:
amount date similar_tx_count_prev_7_days
0 -450.00 2019-09-14 1.0
1 -267.01 2019-09-09 1.0
2 -140.00 2019-09-15 1.0
3 -39.11 2019-09-10 1.0
4 -18.00 2019-09-11 1.0
I eventually used the following method. Is it less efficient?
df3['test']=df3.apply(lambda x : df3[(df3['amount']== x.amount) & (df3['date'] < x.date ) & (df3['date'] >= (x.date-pd.DateOffset(days=7)))]['date'].count(), axis=1)
My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.
I would like to generate the following test data in my dataframe in a way similar to this:
df = pd.DataFrame(data=np.linspace(1800, 100, 400), index=pd.date_range(end='2015-07-02', periods=400), columns=['close'])
df
close
2014-05-29 1800.000000
2014-05-30 1795.739348
2014-05-31 1791.478697
2014-06-01 1787.218045
But using the following criteria:
intervals of 1 minute
increments of .25
prices moving up and down around 1800.00
maximum 2100.00, minimum 1700.00
parse_dates= "Timestamp"
Volume column rows have a range of min = 50 - max = 300
Day start 09:30 Day End 16:29:59
Please see desired output:
Open High Low Last Volume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81
2014-03-04 09:33:00 1786.00
I have limited python experience and find the example for Numpy etc hard to follow as they look to be focused on academia. Is it possible to assist with this?