I have the dataset consisting of energy demand per each of the user profiles, sampled on every half an hour during different days (for example Weekday in Summer or Saturday in Winter). What I would like to do is to extend this for each day of a year using existing data with some randomness here and there.
What methods do you recommend for time series extrapolation?
The dataset I'm working on is here:
https://data.ukedc.rl.ac.uk/browse/edc/efficiency/residential/LoadProfile/data
You may use imputeTS or pandas interpolate for filling missing values in betweens the days (i.e., imputation/interpolation) and forecast for filling future values (forecast/extrapolation).
Related
if have a DataFrame like the one in the photo (https://i.stack.imgur.com/5u3WR.png), and I would like to have, for each grid point the same time series (repeated over and over again), namely:
t_index_np = np.arange('2013-01-01', '2022-12-31 23:00:00', dtype='datetime64[h]')
The frequency is hourly.
You have to take into account that for many grid points there is only one associated date.
What I have tried so far is using a for cycle with resample and pd.merge, but the problem there is that it doesn't work for such points (those with only one date data). Concerning the Total Power column, it must be forward-filled.
Thanks in advance!
I have data reflecting the daily count of events related to users - it's structured roughly as
date, user, count.
The data is somewhat sparse in that some users have no events on a given day, some days have no events from any user, etc.
I'd like to generate data that reflects on a given day which users have had events on at least three days in the previous seven days.
I was thinking of creating a date range, using a window function of some sort, but find myself going around in circles a bit.
I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.
this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do
churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100
The Problem
I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach.
My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.
I am looking for an algorithm to extract data from one system in to another but on a sliding scale. Here are the details:
Every two weeks, 80 weeks of data needs to be extracted.
Extracts take a long time and are resource intensive so we would like to distribute the load of the extract over time.
The first 8-12 weeks are the most important and need be updated more often over the two week window. Data further out can be updated less frequently to the point where the last 40 weeks+ could even just be extracted once every two weeks.
Every two weeks, the start date shifts two weeks ahead and so two new weeks are extracted.
Extract procedure takes a start and end date (this is already made and should be treated like a black box). The procedure could be run for multiple date spans in a day if required but contiguous dates are faster than multiple blocks of dates.
Extracts blocks should be no smaller than 2 weeks and probably no greater than 16 weeks. Longer blocks are possible but at 16 weeks are already a significant load to the system.
4 contiguous weeks of data takes about 1 hour approximately. It takes a long time because the data needs to be generated/calculated.
Data that is newly extracted replaces the old data for the timespan. No need to merge or diff the data, it is just replaced.
This algorithm needs to be built into a SQL job which will handle the daily process (triggered once a day only).
My initial thought was to create a sliding schedule pretty much. Rotate the first 4 week block every second day and then the second 4 week block every 3 to 4 days. The rest of the data would be extracted in blocks in smaller chunks over the two week period.
What I am going to do will work but I wanted to spend some time seeing if there might be a better way to approach the problem. Mainly looking for an algorithm to do the start/end date schedule for the daily extract.