resample time series on uniform interval in numpy/scipy? - numpy

I have a random variable X sampled at random times T similar to this toy data:
import numpy as np
T = np.random.exponential(size=1000).cumsum()
X = np.random.normal(size=1000)
This timeseries looks like this:
A key point is that the sampling interval is non-uniform: by this I mean that all elements of np.diff(T) are not equal. I need to resample the timeseries T,X on uniform intervals with a specified width dt, meaning (np.diff(T)==dt).all() should return True.
I can resample the timeseries on uniform intervals using scipy.interpolate.interp1d, but this method does not allow me to specify the interval size dt:
from scipy.interpolate import interp1d
T = np.linspace(T.min(),T.max(),T.size) # same range and size with a uniform interval
F = interp1d(T,X,fill_value='extrapolate') # resample the series on uniform interval
X = F(T) # Now it's resampled.
The essential issue is that interp1d does not accept an array T unless T.size==X.size.
Is there another method I can try to resample the time series T,X on uniform intervals of width dt?

dt = ...
from scipy.interpolate import interp1d
Told = np.arange(T.min(),T.max(),T.size)
F = interp1d(Told,X,fill_value='extrapolate')
Tnew = np.linspace(T.min(), T.max(), dt)
Xnew = F(Tnew)

Related

Pandas get max delta in a timeseries for a specified period

Given a dataframe with a non-regular time series as an index, I'd like to find the max delta between the values for a period of 10 secs. Here is some code that does the same thing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
xs = np.cumsum(np.random.rand(200))
# This function is to create a general situation where the max is not aways at the end or beginning
ys = xs**1.2 + 10 * np.sin(xs)
plt.plot(xs, ys, '+-')
threshold = 10
xs_thresh_ind = np.zeros_like(xs, dtype=int)
deltas = np.zeros_like(ys)
for i, x in enumerate(xs):
# Find indices that lie within the time threshold
period_end_ind = np.argmax(xs > x + threshold)
# Only operate when the window is wide enough (this can be treated differently)
if period_end_ind > 0:
xs_thresh_ind[i] = period_end_ind
# Find extrema in the period
period_min = np.min(ys[i:period_end_ind + 1])
period_max = np.max(ys[i:period_end_ind + 1])
deltas[i] = period_max - period_min
max_ind_low = np.argmax(deltas)
max_ind_high = xs_thresh_ind[max_ind_low]
max_delta = deltas[max_ind_low]
print(
'Max delta {:.2f} is in period x[{}]={:.2f},{:.2f} and x[{}]={:.2f},{:.2f}'
.format(max_delta, max_ind_low, xs[max_ind_low], ys[max_ind_low],
max_ind_high, xs[max_ind_high], ys[max_ind_high]))
df = pd.DataFrame(ys, index=xs)
OUTPUT:
Max delta 48.76 is in period x[167]=86.10,200.32 and x[189]=96.14,249.09
Is there an efficient pandaic way to achieve something similar?
Create a Series from ys values, indexed by xs - but convert xs to be actual timedelta elements, rather than the float equivalent.
ts = pd.Series(ys, index=pd.to_timedelta(xs, unit="s"))
We want to apply a leading, 10 second window in which we calculate the difference between max and min. Because we want it to be leading, we'll sort the Series in descending order and apply a trailing window.
deltas = ts.sort_index(ascending=False).rolling("10s").agg(lambda s: s.max() - s.min())
Find the maximum delta with deltas[deltas == deltas.max()], which gives
0 days 00:01:26.104797298 48.354851
meaning a delta of 48.35 was found in the interval [86.1, 96.1)

How to change a seaborn histogram plot to work for hours of the day?

I have a pandas dataframe with lots of time intervals of varying start times and lengths. I am interested in the distribution of start times over 24hours. I therefore have another column entitled Hour with just that in. I have plotted a histogram using seaborn to look at the distribution but obviously the x axis starts at 0 and runs to 24. I wonder if there is a way to change so it runs from 8 to 8 and loops over at 23 to 0 so it provides a better visualisation of my data from a time perspective. Thanks in advance.
sns.distplot(df2['Hour'], bins = 24, kde = False).set(xlim=(0,23))
If you want to have a custom order of x-values on your bar plot, I'd suggest using matplotlib directly and plot your histogram simply as a bar plot with width=1 to get rid of padding between bars.
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# prepare sample data
dates = pd.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 7),
freq="H")
random_dates = np.random.choice(dates, 1000)
df = pd.DataFrame(data={"date":random_dates})
df["hour"] = df["date"].dt.hour
# set your preferred order of hours
hour_order = [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7]
# calculate frequencies of each hour and sort them
plot_df = (
df["hour"]
.value_counts()
.rename_axis("hour", axis=0)
.reset_index(name="freq")
.set_index("hour")
.loc[hour_order]
.reset_index())
# day / night colour split
day_mask = ((8 <= plot_df["hour"]) & (plot_df["hour"] <= 20))
plot_df["color"] = np.where(day_mask, "skyblue", "midnightblue")
# actual plotting - note that you have to cast hours as strings
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.bar(
x=plot_df["hour"].astype(str),
height=plot_df["freq"],
color=plot_df["color"], width=1)
ax.set_xlabel('Hour')
ax.set_ylabel('Frequency')
plt.show()

What is fastest way to compute quantile over grouped dataframe?

I am creating monthly diurnal plots from pandas dataframe. I need to plot mean, median or any quantile. I am able to achieve it correctly, but with large data, quantile computation is way slower than mean or median computation. Is there any faster way to achieve this?
import pandas as pd
import numpy as np
import datetime as dt
date_range = pd.date_range(start=dt.datetime(2018,1,1,00,00), end=dt.datetime(2018,12,31,23,59), freq='1min')
N = len(date_range)
df = pd.DataFrame({'Test': np.random.rand(N)}, index=date_range)
df['Time'] = df.index.time
df['Month'] = df.index.month
time_mean_median = dt.datetime(2019,1,1,0,0,0)
time_qunatiles = dt.datetime(2019,1,1,0,0,0)
for i in range(12):
df_month = df[['Test', 'Time']].loc[df['Month'] == i + 1]
start_time = dt.datetime.now()
df1_group = df[['Test', 'Time']].groupby('Time').agg([np.mean, np.median])
time_mean_median += dt.datetime.now()-start_time
quantiles = [0.23, 0.72]
start_time = dt.datetime.now()
df2_group = df[['Test', 'Time']].groupby('Time').quantile(q=quantiles).unstack()
time_qunatiles += dt.datetime.now() - start_time
print('Mean/median computation time {}'.format(time_mean_median.time()))
print('Quantile computation time {}'.format(time_qunatiles.time()))
In this example I get mean/median total computation time around 0.7 seconds, compare to almost 12 second with quantile computation.

Financial time series: python Matplotlib "specgram" y-axis displaying Period instead of Frequency

python Matplotlib's "specgram" display of a heatmap showing frequency (y-axis) vs. time (x-axis) is useful for time series analysis, but I would like to have the y-axis displayed in terms of Period (= 1/frequency), rather than frequency. I am still asking if anyone has a complete working solution to achieve this?
The immediately following python code generates the author's original plot using "specgram" and (currently commented out) a comparison with the suggested solution that was offered using "mlab.specgram". This suggested solution succeeds with the easy conversion from frequency to period = 1/frequency, but does not generate a viable plot for the authors example.
from __future__ import division
from datetime import datetime
import numpy as np
from pandas import DataFrame, Series
import pandas.io.data as web
import pandas as pd
from pylab import plot,show,subplot,specgram
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
################################################
# obtain data:
ticker = "SPY"
source = "google"
start_date = datetime(1999,1,1)
end_date = datetime(2012,1,1)
qt = web.DataReader(ticker, source, start_date, end_date)
qtC = qt.Close
################################################
data = qtC
fs = 1 # 1 sample / day
nfft = 128
# display the time-series data
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax1.plot(range(len(data)),data)
#----------------
# Original version
##################
# specgram (NOT mlab.specgram) --> gives direct plot, but in Frequency space (want plot in Period, not freq).
ax2 = fig.add_subplot(212)
spec, freq, t = specgram(data, NFFT=nfft, Fs=fs, noverlap=0)
#----------------
"""
# StackOverflow version (with minor changes to axis titles)
########################
# calcuate the spectrogram
spec, freq, t = mlab.specgram(data, NFFT=nfft, Fs=fs, noverlap=0)
# calculate the bin limits in time (x dir)
# note that there are n+1 fence posts
dt = t[1] - t[0]
t_edge = np.empty(len(t) + 1)
t_edge[:-1] = t - dt / 2.
# however, due to the way the spectrogram is calculates, the first and last bins
# a bit different:
t_edge[0] = 0
t_edge[-1] = t_edge[0] + len(data) / fs
# calculate the frequency bin limits:
df = freq[1] - freq[0]
freq_edge = np.empty(len(freq) + 1)
freq_edge[:-1] = freq - df / 2.
freq_edge[-1] = freq_edge[-2] + df
# calculate the period bin limits, omit the zero frequency bin
p_edge = 1. / freq_edge[1:]
# we'll plot both
ax2 = fig.add_subplot(312)
ax2.pcolormesh(t_edge, freq_edge, spec)
ax2.set_ylim(0, fs/2)
ax2.set_ylabel('freq.[day^-1]')
ax3 = fig.add_subplot(313)
# note that the period has to be inverted both in the vector and the spectrum,
# as pcolormesh wants to have a positive difference between samples
ax3.pcolormesh(t_edge, p_edge[::-1], spec[:0:-1])
#ax3.set_ylim(0, 100/fs)
ax3.set_ylim(0, nfft)
ax3.set_xlabel('t [days]')
ax3.set_ylabel('period [days]')
"""
If you are only asking how to display the spectrogram differently, then it is actually rather straightforward.
One thing to note is that there are two functions called specgram: matplotlib.pyplot.specgram and matplotlib.mlab.specgram. The difference between these two is that the former draws a spectrogram wheras the latter only calculates one (and that's what we want).
The only slightly tricky thing is to calculate the colour mesh rectangle edge positions. We get the following from the specgram:
t: centerpoints in time
freq: frequency centers of the bins
For the time dimension it is easy to calculate the bin limits by the centers:
t_edge[n] = t[0] + (n - .5) * dt, where dt is the time difference of two consecutive bins
It would be similarly simple for frequencies:
f_edge[n] = freq[0] + (n - .5) * df
but we want to use the period instead of frequency. This makes the first bin unusable, and we'll have to toss the DC component away.
A bit of code:
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import numpy as np
# create some data: (fs = sampling frequency)
fs = 2000.
ts = np.arange(10000) / fs
sig = np.sin(500 * np.pi * ts)
sig[5000:8000] += np.sin(200 * np.pi * (ts[5000:8000] + 0.0005 * np.random.random(3000)))
# calcuate the spectrogram
spec, freq, t = mlab.specgram(sig, Fs=fs)
# calculate the bin limits in time (x dir)
# note that there are n+1 fence posts
dt = t[1] - t[0]
t_edge = np.empty(len(t) + 1)
t_edge[:-1] = t - dt / 2.
# however, due to the way the spectrogram is calculates, the first and last bins
# a bit different:
t_edge[0] = 0
t_edge[-1] = t_edge[0] + len(sig) / fs
# calculate the frequency bin limits:
df = freq[1] - freq[0]
freq_edge = np.empty(len(freq) + 1)
freq_edge[:-1] = freq - df / 2.
freq_edge[-1] = freq_edge[-2] + df
# calculate the period bin limits, omit the zero frequency bin
p_edge = 1. / freq_edge[1:]
# we'll plot both
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax1.pcolormesh(t_edge, freq_edge, spec)
ax1.set_ylim(0, fs/2)
ax1.set_ylabel('frequency [Hz]')
ax2 = fig.add_subplot(212)
# note that the period has to be inverted both in the vector and the spectrum,
# as pcolormesh wants to have a positive difference between samples
ax2.pcolormesh(t_edge, p_edge[::-1], spec[:0:-1])
ax2.set_ylim(0, 100/fs)
ax2.set_xlabel('t [s]')
ax2.set_ylabel('period [s]')
This gives:

Getting usable dates from Axes.get_xlim() in a pandas time series plot

I'm trying to get the xlimits of a plot as a python datetime object from a time series plot created with pandas. Using ax.get_xlim() returns the axis limits as a numpy.float64, and I can't figure out how to convert the numbers to a usable datetime.
import pandas
from matplotlib import dates
import matplotlib.pyplot as plt
from datetime import datetime
from numpy.random import randn
ts = pandas.Series(randn(10000), index=pandas.date_range('1/1/2000',
periods=10000, freq='H'))
ts.plot()
ax = plt.gca()
ax.set_xlim(datetime(2000,1,1))
d1, d2 = ax.get_xlim()
print "%s(%s) to %s(%s)" % (d1, type(d1), d2, type(d2))
print "Using matplotlib: %s" % dates.num2date(d1)
print "Using datetime: %s" % datetime.fromtimestamp(d1)
which returns:
262968.0 (<type 'numpy.float64'>) to 272967.0 (<type 'numpy.float64'>)
Using matplotlib: 0720-12-25 00:00:00+00:00
Using datetime: 1970-01-03 19:02:48
According to the pandas timeseries docs, pandas uses the numpy.datetime64 dtype. I'm using pandas version '0.9.0'.
I am using get_xlim() instead directly accessing the pandas series because I am using the xlim_changed callback to do other things when the user moves around in the plot area.
Hack to get usable values
For the above example, the limits are returned in hours since the Epoch. So I can convert to seconds since the Epoch and use time.gmtime() to get somewhere usable, but this still doesn't feel right.
In [66]: d1, d2 = ax.get_xlim()
In [67]: time.gmtime(d1*60*60)
Out[67]: time.struct_time(tm_year=2000, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=1, tm_isdst=0)
The current behavior of matplotlib.dates:
datetime objects are converted to floating point numbers which represent time in days since 0001-01-01 UTC, plus 1. For example, 0001-01-01, 06:00 is 1.25, not 0.25. The helper functions date2num(), num2date() and drange() are used to facilitate easy conversion to and from datetime and numeric ranges.
pandas.tseries.converter.PandasAutoDateFormatter() seems to build on this, so:
x = pandas.date_range(start='01/01/2000', end='01/02/2000')
plt.plot(x, x)
matplotlib.dates.num2date(plt.gca().get_xlim()[0])
gives:
datetime.datetime(2000, 1, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x7ff73a60f290>)
# First convert to pandas Period
period = pandas.tseries.period.Period(ordinal=int(d1), freq=ax.freq)
# Then convert to pandas timestamp
ts = period.to_timestamp()
# Then convert to date object
dt = ts.to_datetime()