Apply Butterworth filters in time series data - frequency

I have time series data which are measured in miliseconds (ms) (69300 rows) and I want to apply low-pass, high-pass and band-pass Butterworth filters.
My approach is the following:
Convert ms into Hz
Find the Nyquist which is sample rate/2 (as sample rate I take the converted Hz value)
Calculate the sinusoid+noise
Calculate the cutoff frequencies for low-pass and high-pass filters (by taking the 0.1 of the total Hz and divide with the Nyquist value for the low-pass, and taking the 0.25 of the total Hz)
For the band-pass filter I calculate the difference of the cut-off frequencies
Apply the -nth order of the filters
Passing the filters with the sinusoid+noise.
Below is a code snippet I made using R:
# 69300 ms are 0.014430014430014Hz
x <- 1:69300
nyquist <- 0.014430014430014/2 # sampling rate/2
x1 <- sin(2*pi*RF*0.014430014430014) + 0.25*rnorm(length(RF))
# 0.014430014430014 Hz sinusoid+noise, RF is the time series metric
f_low <- 0.001443001/nyquist # 0.1 of total Hz divided by Nyquist
f_high <- 0.003607504/nyquist # 0.25 of total Hz divided by Nyquist
bf_low <- butter(4, f_low, type="low")
bf_high <- butter(4, f_high, type = "high")
bf_pass <- butter(4, 0.3000001, type = "pass") # f_high - f_low
b <- filter(bf_low, x1)
b1 <- filter(bf_high,x1)
b2 <- filter(bf_pass,x1)
Is this the correct approach? Should instead of sinusoid+noise to apply the filter to the metric itself?

Related

How to calculate approximate fourier coefficients using np.trapz

I have a dataset which looks roughly as follows (and is sinusoidal in nature):
TW-240-run1.txt
Point Number Temperature
0 51.504781
1 51.487722
2 51.487722
3 51.828893
4 51.828893
5 51.436547
6 51.368312
7 51.726542
8 51.368312
9 51.317137
10 51.317137
11 51.283020
12 51.590073
.
.
.
9599 51.675366
I am tasked with finding the fundamental/first fourier coefficients, a_n and b_n for this dataset, by means of a numerical integration technique. In this case, I am simply using numpy.trapz from numpy, which aims to implement the trapezium rule. The fourier coefficients, a_n and b_n can be calculated with the following formulae:
where tau (𝛕) is the time period of the sine function. For my case, 𝛕 = 240 seconds (referring to the point number 240 on the data sheet), and thus the bounds of integration are 0 to 240. T(t) from the above formulae is the data set and n = 1.
My current code for trying to calculate the fourier coefficients is as follows:
# Packages
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
#input data from datasheet, the loadtxt below takes in the data from t = 0s to t = 240s
x1, y1 = np.loadtxt(r'C:\Users\Sidharth\Documents\y2python\y2python\thermal_4min_a.txt', unpack=True, skiprows=3)
tau_4min = 240.0
def cosine(period, t, n):
return np.cos((2*np.pi*n*t)/(period)) #defines the cos term for the a_n formula
def sine(period, t, n): #defines the sin term for the a_n formula
return np.sin((2*np.pi*n*t)/(period))
a_1_4min = (2/tau_4min)*np.trapz((y1*cos_term_4min), x1) #implement a_n formula (trapezium rule for T(t)*cos)
print('a_1 is', a_1_4min)
b_1_4min = (2/tau_4min)*np.trapz((y1*sin_term_4min), x1) #implement b_n formula (trapezium rule for T(t)*cos)
print('b_1 is', b_1_4min)
Essentially what this is doing is, it takes in the data, but only up to the row index 241 (point number 240), and then multiplies it by the sine/cosine term from each of the above formulae. However, I realise this isn't calculating the fourier coefficients properly.
My question(s) are as follows:
Will my code work if I can find a way to set limits of integration for np.trapz and then importing the entire data set, instead of only importing the data points from 0 to 240 and multiplying it by the cos or sine term, then using np trapz on that product, as I am currently doing (0 and 240 are supposed to be my limits of integration)

Why window_length/hop_length are multiplied with sample rate in librosa.core.stft in this example?

I'm new to voice recognition and I'm going through the details in this implementation of speaker verification. In data_preprocess.py authors use librosa library. Here is a simplified version of the code:
def preprocess_data(data_dir, res_dir, N, M, tdsv_frame, sample_rate, nfft, window_len, hop_len):
os.makedirs(res_dir, exist_ok=True)
batch_frames = N * M * tdsv_frame
batch_number = 0
batch = []
batch_len = 0
for i, path in enumerate(tqdm(os.listdir(data_dir))):
data, sr = librosa.core.load(os.path.join(data_dir, path), sr=sample_rate)
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
batch.append(S)
batch_len += S.shape[1]
if batch_len < batch_frames: continue
batch = np.concatenate(batch, axis=1)[:,:batch_frames]
np.save(os.path.join(res_dir, "voice_%d.npy" % batch_number), batch)
batch_number += 1
batch = []
batch_len = 0
N = 2 # number of speakers of batch
M = 400 # number of utterances per speaker
tdsv_frame = 80 # feature size
sample_rate = 8000 # sampling rate
nfft = 512 # fft kernel size
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"
Based on a figure in the paper, they want to create a batch of features in the size of (N*M)*tdsv_frame.
I think I understand the concept of window_length, hop_length, but what is a question to me is how the authors set these parameters. Why we should multiple these lengths with sample_rate as it's done here:
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
Thank you.
librosa.core.stft takes win_length/hop_length in number of samples. This is typical for Digital Signal Processing, as fundamentally the systems are discrete based on the number of samples per second (the sample rate).
However for ease of understanding for humans, it makes more sense to think of these times in seconds/milliseconds. As in your example
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
So to go from a time in seconds to time in number of samples, one has to multiply by the sample rate.
The unit of window_len and hop_len is (ms), however, in librosa, they should be the number of samples.
# of samples = sampling_rate * (ms)

How to calculate slope of the line

I am trying to calculate the slope of the line for a 50 day EMA I created from the adjusted closing price on a few stocks I downloaded using the getSymbols function.
My EMA looks like this :
getSymbols("COLUM.CO")
COLUM.CO$EMA <- EMA(COLUM.CO[,6],n=50)
This gives me an extra column that contains the 50 day EMA on the adjusted closing price. Now I would like to include an additional column that contains the slope of this line. I'm sure it's a fairly easy answer, but I would really appreciate some help on this. Thank you in advance.
A good way to do this is with rolling least squares regression. rollSFM does a fast and efficient job for computing the slope of a series. It usually makes sense to look at the slope in relation to units of price activity in time (bars), so x can simply be equally spaced points.
The only tricky part is working out an effective value of n, the length of the window over which you fit the slope.
library(quantmod)
getSymbols("AAPL")
AAPL$EMA <- EMA(Ad(AAPL),n=50)
# Compute slope over 50 bar lookback:
AAPL <- merge(AAPL, rollSFM(Ra = AAPL[, "EMA"],
Rb = 1:nrow(AAPL), n = 50))
The column labeled beta contains the rolling window value of the slope (alpha contains the intercept, r.squared contains the R2 value).

Stata - Multiple rotated plots on graph (including distributions on sides of axes)

I would like to produce a single graph containing both: (1) a scatter plot (2) either histograms or kernel density functions of the Y and X variables to the left of the Y axis and below the X axis.
I found a graph that does this in MATLAB -- I would just like to produce something similar in Stata:
That graph was produced using the following MATLAB code:
n = 1000;
rho = .7;
Z = mvnrnd([0 0], [1 rho; rho 1], n);
U = normcdf(Z);
X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)];
[n1,ctr1] = hist(X(:,1),20);
[n2,ctr2] = hist(X(:,2),20);
subplot(2,2,2); plot(X(:,1),X(:,2),'.'); axis([0 12 -8 8]); h1 = gca;
title('1000 Simulated Dependent t and Gamma Values');
xlabel('X1 ~ Gamma(2,1)'); ylabel('X2 ~ t(5)');
subplot(2,2,4); bar(ctr1,-n1,1); axis([0 12 -max(n1)*1.1 0]); axis('off'); h2 = gca;
subplot(2,2,1); barh(ctr2,-n2,1); axis([-max(n2)*1.1 0 -8 8]); axis('off'); h3 = gca;
set(h1,'Position',[0.35 0.35 0.55 0.55]);
set(h2,'Position',[.35 .1 .55 .15]);
set(h3,'Position',[.1 .35 .15 .55]);
colormap([.8 .8 1]);
UPDATE: The Stata13 manual entry for "graph combine" has precisely this example (http://www.stata.com/manuals13/g-2graphcombine.pdf). Here is the code:
use http://www.stata-press.com/data/r13/lifeexp, clear
generate loggnp = log10(gnppc)
label var loggnp "Log base 10 of GNP per capita"
scatter lexp loggnp, ysca(alt) xsca(alt) xlabel(, grid gmax) fysize(25) saving(yx)
twoway histogram lexp, fraction xsca(alt reverse) horiz fxsize(25) saving(hy)
twoway histogram loggnp, fraction ysca(alt reverse) ylabel(,nogrid) xlabel(,grid gmax) saving(hx)
graph combine hy.gph yx.gph hx.gph, hole(3) imargin(0 0 0 0) graphregion(margin(l=22 r=22)) title("Life expectancy at birth vs. GNP per capita") note("Source: 1998 data from The World Bank Group")
There's probably a better a way to do it, but this is my quick attempt to take up the challenge.
sysuse auto,clear
set obs 1000
twoway scatter mpg price, saving(sct,replace) ///
xsc(r(0(5000)20000) off ) ysc(r(10(10)50) off) ///
xti("") yti("") xlab(,nolab) ylab(,nolab)
kdensity mpg, n(1000) k(gauss) gen(x0 d0)
line x0 d0, xsc(rev off) ysc(alt) xlab(,nolab) xtick(,notick) saving(hist0, replace)
kdensity price, n(1000) k(gauss) gen(x1 d1)
line d1 x1, xsc(alt) ysc(rev off) ylab(,nolab) ytick(,notick) saving(hist1, replace)
graph combine hist0.gph sct.gph hist1.gph, cols(2) holes(3)
I'd also like to know if there are ways to improve on it. The codes are not very neat, and I had trouble properly aligning the line plot and the scatter plot without removing the ticks and labels of the scatter plot (xcommon and ycommon did not really do the job for me).
credit to this post on the Statalist.

How do I bandpass-filter a signal using a Gaussian function in Python (Numpy/Scipy)

I have a time series (more specifically a correlation function). I want to bandpass-filter this signal using a Gaussian function H:
H(w) = e^(-alpha((w-wn)/wn)^2),
where wn is the central frequency in my bandpass filter and alpha is a certain constant value that I know.
I apply a (inverse) FFT to my H function:
H = np.e ** (-alfa * ((w - wn) / wn) ** 2)
H = np.fft.ifft(H)
HH = np.asarray([i1 for i1 in itertools.chain(H[len(H)/2:len(H)], H[0:len(H)/2])])
And what I do then is to use fftconvolve:
filtered = fftconvolve(data, HH.real, mode='same'),
but the "filtered signal" that I see seems to be filtering frequencies centered in 2 times wn.
What is the correct way of doing this? Is there a restriction in the length of my filter with respect to the length of my time series?
Perhaps what you are looking for is the Gaussian filter from Scipy,
from scipy.ndimage import gaussian_filter
output = gaussian_filter(input, sigma )
where sigma is the standard deviation of the Gaussian kernel. See the Scipy documentation for more details. https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html