Averaging over specified time period - pandas

I opened multiple netcdfs (each one corresponds to an hour 0-30hr) with xr.open_mfdataset('*.nc'). Now I have an extra dimension (time). I am considering one of my variables(u,v,w). I want to average u (time: 30 z: 200 y: 100 x: 100) over 24 hours instead of the whole time period I have. How can I do that?

To select the first 24 observations along the time dim, you can use .isel, e.g.:
ds24 = ds.isel(time=range(24))
See the xarray docs on indexing and selecting data for more options and examples.
Now, you can average over the time dim:
ds24.mean(dim='time')

Made some simple test case:
#!/usr/bin/env ipython
# ---------------------
import numpy as np
import xarray as xr
import datetime
# ---------------------
# let us make test serie:
N = 30;
datesout = np.array([datetime.datetime(2022,10,23,0) + datetime.timedelta(seconds = nn*3600) for nn in range(N)])
serieout = np.array([10 + nn for nn in range(N) ]);
# let us save netcdf:
dsout = xr.Dataset(data_vars=dict(someserie=(["time"], serieout)),coords=dict(time=datesout),attrs=dict(description="Test data."))
dsout.to_netcdf('test.nc')
# ---------------------
# let us make daily average:
with xr.open_dataset('test.nc') as ncin:
# let us calculate daily means:
dfmeanout = ncin.groupby("time.day").mean()
# -----------------------------------------
# keep only the first day:
dfmeanout = dfmeanout.isel(day=0)
# -----------------------------------------
print('Got value: ',dfmeanout.values)
# --------------------------------------------
# simple test:
expected_mean = np.sum(np.array([vv+10 for vv in range(24)]))/24
print('Correct mean is : ',expected_mean)
# ---------------------------------------------

Related

Visualising frequency of events from time series

I have a series like this:
00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00
It is time in HH:MM:SS (as series in time dataframe)
I'm interested in finding / visualising amount of data points in (for example) 10 second window and plotting it as histogram barplot.
# make it a list
time_series = "00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00"
time_series = time_series.split(',')
def histogram_from_time_series(_time_series):
time_list = []
for item in _time_series:
t = item.split(":")
try:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, hour=int(t[-3]), minute=int(t[-2]), second=int(t[-1]))) # has to be a datetime for it to work
except IndexError:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, minute=int(t[-2]), second=int(t[-1])))
_dict = {'Timestamp': time_list,} # make a dict from list
_df = pd.DataFrame(_dict) # make df from dict
_df.insert(0, 'A', 1) # create a column and fill it with int(1)
grouped = _df.groupby(pd.Grouper(key="Timestamp", axis=0, freq='30S')).sum() # group in given frequency and sum the 'A'
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
grouped.plot.bar(grid=True, figsize=(9,9)) # plot histogram
histogram_from_time_series(time_series)
Ok so I did it myself. Sharing if someone needs it later.

how to store the dataframe into csv files based on group pandas

I have a dataframe like below
id B C
1 2 3
1 3 4
2 4 2
3 12 32
finally I want to store the csv file
1.csv, 2.csv, 3.csv which contains all the rows specific to id column
Can I do this efficiently.I know we can do using for loop which is time consuming
From the Pandas DOC, the method from the DataFrame you have to write the content in CSV file is to_csv. Looks like there is no specific parameter to optimize it for you.
As you can see here.
You can solve this problem in an O(n) operation, considering ordered IDs. You already have the entire DataFrame in memory. By saving pieces in single files you also can free some space in memory by splitting the entire DataFrame each loop step.
As suggested by #Lazyer, you can use multiprocessing:
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def to_csv(name, df):
df.to_csv(f'export/{name}.csv', index=False)
if __name__ == '__main__': # Do not remove this line! Mandatory
# Setup a minimal reproducible example
N = 10_000_000
rng = np.random.default_rng(2022)
df = pd.DataFrame(np.random.randint(1, 10000, (N, 3)),
columns=['id', 'B', 'C'])
# Multi processing
start = time.time()
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(to_csv, df.groupby('id'))
end = time.time()
print(f"[MP] Elapsed time: {end - start:.2f} seconds")
# Single processing
start = time.time()
for name, subdf in df.groupby('id'):
subdf.to_csv(f'export/{name}.csv', index=False)
end = time.time()
print(f"[SP] Elapsed time: {end - start:.2f} seconds")
Test for 10,000,000 records:
[...]$ python mp.py
[MP] Elapsed time: 2.99 seconds
[SP] Elapsed time: 12.97 seconds

Pandas get max delta in a timeseries for a specified period

Given a dataframe with a non-regular time series as an index, I'd like to find the max delta between the values for a period of 10 secs. Here is some code that does the same thing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
xs = np.cumsum(np.random.rand(200))
# This function is to create a general situation where the max is not aways at the end or beginning
ys = xs**1.2 + 10 * np.sin(xs)
plt.plot(xs, ys, '+-')
threshold = 10
xs_thresh_ind = np.zeros_like(xs, dtype=int)
deltas = np.zeros_like(ys)
for i, x in enumerate(xs):
# Find indices that lie within the time threshold
period_end_ind = np.argmax(xs > x + threshold)
# Only operate when the window is wide enough (this can be treated differently)
if period_end_ind > 0:
xs_thresh_ind[i] = period_end_ind
# Find extrema in the period
period_min = np.min(ys[i:period_end_ind + 1])
period_max = np.max(ys[i:period_end_ind + 1])
deltas[i] = period_max - period_min
max_ind_low = np.argmax(deltas)
max_ind_high = xs_thresh_ind[max_ind_low]
max_delta = deltas[max_ind_low]
print(
'Max delta {:.2f} is in period x[{}]={:.2f},{:.2f} and x[{}]={:.2f},{:.2f}'
.format(max_delta, max_ind_low, xs[max_ind_low], ys[max_ind_low],
max_ind_high, xs[max_ind_high], ys[max_ind_high]))
df = pd.DataFrame(ys, index=xs)
OUTPUT:
Max delta 48.76 is in period x[167]=86.10,200.32 and x[189]=96.14,249.09
Is there an efficient pandaic way to achieve something similar?
Create a Series from ys values, indexed by xs - but convert xs to be actual timedelta elements, rather than the float equivalent.
ts = pd.Series(ys, index=pd.to_timedelta(xs, unit="s"))
We want to apply a leading, 10 second window in which we calculate the difference between max and min. Because we want it to be leading, we'll sort the Series in descending order and apply a trailing window.
deltas = ts.sort_index(ascending=False).rolling("10s").agg(lambda s: s.max() - s.min())
Find the maximum delta with deltas[deltas == deltas.max()], which gives
0 days 00:01:26.104797298 48.354851
meaning a delta of 48.35 was found in the interval [86.1, 96.1)

Fecebook NeauralProphet - adding holidays

I have one common data set for my prediction that includes data across the globe.
ds y country_id
01/01/2021 09:00:00 5.0 1
01/01/2021 09:10:00 5.2 1
01/01/2021 09:20:00 5.4 1
01/01/2021 09:30:00 6.1 1
01/01/2021 09:00:00 2.0 2
01/01/2021 09:10:00 2.2 2
01/01/2021 09:20:00 2.4 2
01/01/2021 09:30:00 3.1 2
playoffs = pd.DataFrame({
'holiday': 'playoff',
'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
superbowls = pd.DataFrame({
'holiday': 'superbowl',
'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((playoffs, superbowls))
Now, I would like to add holidays to the model.
m = NeuralProphet(holidays=holidays)
m.add_country_holidays(country_name='US')
m.fit(df)
How can I add multiple country holidays to add_country_holidays (m.add_country_holidays)?
How to add country specific holidays to holidays data?
Do I need to generate different model specific to country? Or, one model for the entire dataset is fine and then will be able to add the regressor. What is the recommendation?
Here is a possible solution:
The program:
# NOTE 1: tested on google colab
# Un-comment the following (!pip) line if you need to install the libraries
# on google colab notebook:
#!pip install neuralprophet pandas numpy holidays
import pandas as pd
import numpy as np
import holidays
from neuralprophet import NeuralProphet
import datetime
# NOTE 2: Most of the code comes from:
# https://neuralprophet.com/html/events_holidays_peyton_manning.html
# Context:
# We will use the time series of the log daily page views for the Wikipedia
# page for Peyton Manning (American former football quarterback ) as an example.
# During playoffs and super bowls, the Peyton Manning's wiki page is more frequently
# viewed. We would like to see if countries specific holidays also have an
# influence.
# First, we load the data:
data_location = "https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "wp_log_peyton_manning.csv")
# To simulate your case, we add a country_id column filled with random values {1,2}
# Let's assume US=1 and Canada=2
import numpy as np
np.random.seed(0)
df['country_id']=np.random.randint(1,2+1,df['ds'].count())
print("The dataframe we are working on:")
print(df.head())
# We would like to add holidays for US and Canada to see if holidays have an
# influence on the # of daily's views on Manning's wiki page.
# The data in df starts in 2007 and ends in 2016:
StartingYear=2007
LastYear=2016
# Holidays for US and Canada:
US_holidays = holidays.US(years=[year for year in range(StartingYear, LastYear+1)])
CA_holidays = holidays.CA(years=[year for year in range(StartingYear, LastYear+1)])
holidays_US=pd.DataFrame()
holidays_US['ds']=[]
holidays_US['event']=[]
holidays_CA=pd.DataFrame()
holidays_CA['ds']=[]
holidays_CA['event']=[]
for i in df.index:
# Convert date string to datetime object:
datetimeobj=[int(x) for x in df['ds'][i].split('-')]
# Check if the corresponding day is a holyday in the US;
if df['country_id'][i]==1 and (datetime.datetime(*datetimeobj) in US_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_US']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_US
holidays_US=holidays_US.append(df1,ignore_index=True)
# Check if the corresponding day is a holyday in Canada:
if df['country_id'][i]==2 and (datetime.datetime(*datetimeobj) in CA_holidays):
d = {'ds': [df['ds'][i]], 'event': ['holiday_CA']}
df1=pd.DataFrame(data=d)
# If yes: add to holidays_CA
holidays_CA=holidays_CA.append(df1,ignore_index=True)
# Now we can drop the country_id in df:
df.drop('country_id', axis=1, inplace=True)
print("Days in df that are holidays in the US:")
print(holidays_US.head())
print()
print("Days in df that are holidays in Canada:")
print(holidays_CA.head())
# user specified events
# history events
playoffs = pd.DataFrame({
'event': 'playoff',
'ds': pd.to_datetime([
'2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07',
]),
})
superbowls = pd.DataFrame({
'event': 'superbowl',
'ds': pd.to_datetime([
'2010-02-07', '2012-02-05', '2014-02-02',
'2016-02-07',
]),
})
# Create the events_df:
events_df = pd.concat((playoffs, superbowls, holidays_US, holidays_CA))
# Create neural network and fit:
# NeuralProphet Object
m = NeuralProphet(loss_func="MSE")
m = m.add_events("playoff")
m = m.add_events("superbowl")
m = m.add_events("holiday_US")
m = m.add_events("holiday_CA")
# create the data df with events
history_df = m.create_df_with_events(df, events_df)
# fit the model
metrics = m.fit(history_df, freq="D")
# forecast with events known ahead
future = m.make_future_dataframe(df=history_df, events_df=events_df, periods=365, n_historic_predictions=len(df))
forecast = m.predict(df=future)
fig = m.plot(forecast)
fig_param = m.plot_parameters()
fig_comp = m.plot_components(forecast)
RESULT:
The results (see PARAMETERS figure) seem to show that when a day is a holiday, there are less views in both US and Canada. Does it make sense? Maybe... It looks plausible that people on holiday have more interesting things to do than browsing Manning's wiki page :-) I don't know.
PROGRAM'S OUTPUT:
The dataframe we are working on:
ds y country_id
0 2007-12-10 9.5908 1
1 2007-12-11 8.5196 2
2 2007-12-12 8.1837 2
3 2007-12-13 8.0725 1
4 2007-12-14 7.8936 2
Days in df that are holidays in the US:
ds event
0 2007-12-25 holiday_US
1 2008-01-21 holiday_US
2 2008-07-04 holiday_US
3 2008-11-27 holiday_US
4 2008-12-25 holiday_US
Days in df that are holidays in Canada:
ds event
0 2008-01-01 holiday_CA
1 2008-02-18 holiday_CA
2 2008-08-04 holiday_CA
3 2008-09-01 holiday_CA
4 2008-10-13 holiday_CA
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 138
88%
241/273 [00:02<00:00, 121.69it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.51E+00
88%
241/273 [00:02<00:00, 123.87it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.36E-02, min: 1.63E+00
89%
242/273 [00:02<00:00, 121.58it/s]
INFO - (NP.utils_torch.lr_range_test) - lr-range-test results: steep: 3.62E-02, min: 2.58E+00
INFO - (NP.forecaster._init_train_loader) - lr-range-test selected learning rate: 3.44E-02
Epoch[138/138]: 100%|██████████| 138/138 [00:29<00:00, 4.74it/s, MSELoss=0.012, MAE=0.344, RMSE=0.478, RegLoss=0]
The figures:
FORECASTS:
PARAMETERS:
COMPONENTS:

Financial time series: python Matplotlib "specgram" y-axis displaying Period instead of Frequency

python Matplotlib's "specgram" display of a heatmap showing frequency (y-axis) vs. time (x-axis) is useful for time series analysis, but I would like to have the y-axis displayed in terms of Period (= 1/frequency), rather than frequency. I am still asking if anyone has a complete working solution to achieve this?
The immediately following python code generates the author's original plot using "specgram" and (currently commented out) a comparison with the suggested solution that was offered using "mlab.specgram". This suggested solution succeeds with the easy conversion from frequency to period = 1/frequency, but does not generate a viable plot for the authors example.
from __future__ import division
from datetime import datetime
import numpy as np
from pandas import DataFrame, Series
import pandas.io.data as web
import pandas as pd
from pylab import plot,show,subplot,specgram
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
################################################
# obtain data:
ticker = "SPY"
source = "google"
start_date = datetime(1999,1,1)
end_date = datetime(2012,1,1)
qt = web.DataReader(ticker, source, start_date, end_date)
qtC = qt.Close
################################################
data = qtC
fs = 1 # 1 sample / day
nfft = 128
# display the time-series data
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax1.plot(range(len(data)),data)
#----------------
# Original version
##################
# specgram (NOT mlab.specgram) --> gives direct plot, but in Frequency space (want plot in Period, not freq).
ax2 = fig.add_subplot(212)
spec, freq, t = specgram(data, NFFT=nfft, Fs=fs, noverlap=0)
#----------------
"""
# StackOverflow version (with minor changes to axis titles)
########################
# calcuate the spectrogram
spec, freq, t = mlab.specgram(data, NFFT=nfft, Fs=fs, noverlap=0)
# calculate the bin limits in time (x dir)
# note that there are n+1 fence posts
dt = t[1] - t[0]
t_edge = np.empty(len(t) + 1)
t_edge[:-1] = t - dt / 2.
# however, due to the way the spectrogram is calculates, the first and last bins
# a bit different:
t_edge[0] = 0
t_edge[-1] = t_edge[0] + len(data) / fs
# calculate the frequency bin limits:
df = freq[1] - freq[0]
freq_edge = np.empty(len(freq) + 1)
freq_edge[:-1] = freq - df / 2.
freq_edge[-1] = freq_edge[-2] + df
# calculate the period bin limits, omit the zero frequency bin
p_edge = 1. / freq_edge[1:]
# we'll plot both
ax2 = fig.add_subplot(312)
ax2.pcolormesh(t_edge, freq_edge, spec)
ax2.set_ylim(0, fs/2)
ax2.set_ylabel('freq.[day^-1]')
ax3 = fig.add_subplot(313)
# note that the period has to be inverted both in the vector and the spectrum,
# as pcolormesh wants to have a positive difference between samples
ax3.pcolormesh(t_edge, p_edge[::-1], spec[:0:-1])
#ax3.set_ylim(0, 100/fs)
ax3.set_ylim(0, nfft)
ax3.set_xlabel('t [days]')
ax3.set_ylabel('period [days]')
"""
If you are only asking how to display the spectrogram differently, then it is actually rather straightforward.
One thing to note is that there are two functions called specgram: matplotlib.pyplot.specgram and matplotlib.mlab.specgram. The difference between these two is that the former draws a spectrogram wheras the latter only calculates one (and that's what we want).
The only slightly tricky thing is to calculate the colour mesh rectangle edge positions. We get the following from the specgram:
t: centerpoints in time
freq: frequency centers of the bins
For the time dimension it is easy to calculate the bin limits by the centers:
t_edge[n] = t[0] + (n - .5) * dt, where dt is the time difference of two consecutive bins
It would be similarly simple for frequencies:
f_edge[n] = freq[0] + (n - .5) * df
but we want to use the period instead of frequency. This makes the first bin unusable, and we'll have to toss the DC component away.
A bit of code:
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import numpy as np
# create some data: (fs = sampling frequency)
fs = 2000.
ts = np.arange(10000) / fs
sig = np.sin(500 * np.pi * ts)
sig[5000:8000] += np.sin(200 * np.pi * (ts[5000:8000] + 0.0005 * np.random.random(3000)))
# calcuate the spectrogram
spec, freq, t = mlab.specgram(sig, Fs=fs)
# calculate the bin limits in time (x dir)
# note that there are n+1 fence posts
dt = t[1] - t[0]
t_edge = np.empty(len(t) + 1)
t_edge[:-1] = t - dt / 2.
# however, due to the way the spectrogram is calculates, the first and last bins
# a bit different:
t_edge[0] = 0
t_edge[-1] = t_edge[0] + len(sig) / fs
# calculate the frequency bin limits:
df = freq[1] - freq[0]
freq_edge = np.empty(len(freq) + 1)
freq_edge[:-1] = freq - df / 2.
freq_edge[-1] = freq_edge[-2] + df
# calculate the period bin limits, omit the zero frequency bin
p_edge = 1. / freq_edge[1:]
# we'll plot both
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax1.pcolormesh(t_edge, freq_edge, spec)
ax1.set_ylim(0, fs/2)
ax1.set_ylabel('frequency [Hz]')
ax2 = fig.add_subplot(212)
# note that the period has to be inverted both in the vector and the spectrum,
# as pcolormesh wants to have a positive difference between samples
ax2.pcolormesh(t_edge, p_edge[::-1], spec[:0:-1])
ax2.set_ylim(0, 100/fs)
ax2.set_xlabel('t [s]')
ax2.set_ylabel('period [s]')
This gives: