if have a DataFrame like the one in the photo (https://i.stack.imgur.com/5u3WR.png), and I would like to have, for each grid point the same time series (repeated over and over again), namely:
t_index_np = np.arange('2013-01-01', '2022-12-31 23:00:00', dtype='datetime64[h]')
The frequency is hourly.
You have to take into account that for many grid points there is only one associated date.
What I have tried so far is using a for cycle with resample and pd.merge, but the problem there is that it doesn't work for such points (those with only one date data). Concerning the Total Power column, it must be forward-filled.
Thanks in advance!
Related
I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.
this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do
churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100
The Problem
I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach.
My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.
I have the dataset consisting of energy demand per each of the user profiles, sampled on every half an hour during different days (for example Weekday in Summer or Saturday in Winter). What I would like to do is to extend this for each day of a year using existing data with some randomness here and there.
What methods do you recommend for time series extrapolation?
The dataset I'm working on is here:
https://data.ukedc.rl.ac.uk/browse/edc/efficiency/residential/LoadProfile/data
You may use imputeTS or pandas interpolate for filling missing values in betweens the days (i.e., imputation/interpolation) and forecast for filling future values (forecast/extrapolation).
Please see the attached image below.
I have a data set which record value for each time instance (Time Stamp) in the data set.
Now, I want a single data record (for each column) for any given time. For example, the red box (for 'Time Stamp' 23054350) that I have made should sum up to be a single 'SMS in', 'SMS out', 'Call in' etc.
Similar example can be seen for other 'Time Stamp'. Note that all the instances of time should be summed together.
I know I can run a loop to solve this problem. But my data is very huge and I have multiple files (of huge data per file) and running a loop is very inefficient. Can I do it in a quicker way, sort of using vectorized implementation?
This should work
df.groupby(['Grid ID', 'Time Stamp'], as_index = False).sum()
Try this
df.groupby('Time Stamp').sum()
Please see code example and output below. Using groupby I am able to identify the max "Ask Volume Bid Volume Total"value for each day.
I also want to see what time of day this happened for each day. Basically I don't want to lose the time from the timestamp. How do I do this please?
Use idxmax on the groupby object to index back into your original df so you can see the full resolution timestamps associated with those max values:
data.loc[grouped['Ask Volume Bid Volumne Total'].idxmax()]