Scipy, Numpy: Audio classifier,Voice/Speech Activity Detection - numpy

I am writting a program to automatically classify recorded audio phone calls files (wav files) which contain atleast some Human Voice or not (only DTMF, Dialtones, ringtones, noise).
My first approach was implementing simple VAD (voice activity detector) using ZCR (zero crossing rate) & calculating Energy, but both of these paramters confuse DTMF, Dialtones with Voice. This techquie failed so I implemented a trivial method to calculate variance of FFT inbetween 200Hz and 300Hz. My numpy code is as follows
wavefft = np.abs(fft(frame))
n = len(frame)
fx = np.arange(0,fs,float(fs)/float(n))
stx = np.where(fx>=200)
stx = stx[0][0]
endx = np.where(fx>=300)
endx = endx[0][0]
return np.sqrt(np.var(wavefft[stx:endx]))/1000
This resulted in 60% accuracy.
Next, I tried implementing a machine learning based approach using SVM (Support Vector Machine) and MFCC (Mel-frequency cepstral coefficients). The results were totally incorrect, almost all samples were incorrectly marked. How should one train a SVM with MFCC feature vectors? My rough code using scikit-learn is as follows
[samplerate, sample] = wavfile.read ('profiles/noise.wav')
noiseProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/ring.wav')
ringProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/voice.wav')
voiceProfile = MFCC(samplerate, sample)
machineData = []
for noise in noiseProfile:
machineData.append(noise)
for voice in voiceProfile:
machineData.append(voice)
dataLabel = []
for i in range(0, len(noiseProfile)):
dataLabel.append (0)
for i in range(0, len(voiceProfile)):
dataLabel.append (1)
clf = svm.SVC()
clf.fit(machineData, dataLabel)
I want to know what alternative approach I could implement?

If you don't have to use scipy/numpy, you might checkout webrtvad, which is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code. WebRTC uses Gaussian Mixture Models (GMMs), works well, and is very fast.
Here's an example of how you might use it:
import webrtcvad
# audio must be 16 bit PCM, at 8 KHz, 16 KHz or 32 KHz.
def audio_contains_voice(audio, sample_rate, aggressiveness=0, threshold=0.5):
# Frames must be 10, 20 or 30 ms.
frame_duration_ms = 30
# Assuming split_audio is a function that will split audio into
# frames of the correct size.
frames = split_audio(audio, sample_rate, frame_duration)
# aggressiveness tells the VAD how aggressively to filter out non-speech.
# 0 will have the most false-positives for speech, 3 the least.
vad = webrtc.Vad(aggressiveness)
num_voiced = len([f for f in frames if vad.is_voiced(f, sample_rate)])
return float(num_voiced) / len(frames) > threshold

Related

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
net.setInput(blob)
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
EDIT:
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

word2vec - get nearest words

Reading the tensorflow word2vec model output how can I output the words related to a specific word ?
Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.
But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ?
For example if word2vec generated image :
image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html
In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?
This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.
To install:
Via github
Python code:
from gensim.models import Word2Vec
gmodel=Word2Vec.load_word2vec_format(fname)
ms=gmodel.most_similar('good',10)
for x in ms:
print x[0],x[1]
However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.
In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.
Flann is another library for Approximate Nearest Neighbors.
I will assume that you don't want to use gensim, and would prefer to stick with tensorflow. In that case, I'll offer two options
Option 1 - Tensorboard:
If you are just trying to do this from an exploratory standpoint, I would suggest using Tensorboard's embedding visualizer to search for the closest embeddings. It provides a cool interface and you can use both cosine and euclidian distances with a set number of neighbors.
Link to Tensorflow documentation
Option 2 - Direct Calculation
Within the word2vec_basic.py file, there is an example of how they are calculating closest words, and you could go ahead and use that if you mess with the function a little bit. The following is found in the graph itself:
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
Then, during training (every 10000 steps) they run this next bit of code (while the session is active). When they call similarity.eval() it is getting the literal numpy array evaluation of the similarity tensor in the graph.
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
If you want to adapt this for yourself, you will have to do some finessing with changing reverse_dictionary[valid_examples[i]] to be the word/words idxs that you want to get the k-closest words for.
Get gensim and use similar_by_word method on gensim.models.Word2Vec model.
similar_by_word takes 3 parameters,
The input word
n - for top n similar words (optional, default=10)
restrict_vocab (optional, default=None)
Example
import gensim, nltk
class FileToSent(object):
"""A class to load a text file efficiently """
def __init__(self, filename):
self.filename = filename
# To remove stop words (optional)
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
Then depending on your input sentences (sentence_file.txt),
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)

parameters for low pass fir filter using scipy

I am trying to write a simple low pass filter using scipy, but I need help defining the parameters.
I have 3.5 million records in the time series data that needs to be filtered, and the data is sampled at 1000 hz.
I am using signal.firwin and signal.lfilter from the scipy library.
The parameters I am choosing in the code below do not filter my data at all. Instead, the code below simply produces something that graphically looks like the same exact data except for a time phase distortion that shifts the graph to the right by slightly less than 1000 data points (1 second).
In another software program, running a low pass fir filter through graphical user interface commands produces output that has similar means for each 10 second (10,000 data point) segment, but that has drastically lower standard deviations so that we essentially lose the noise in this particular data file and replace it with something that retains the mean value while showing longer term trends that are not polluted by higher frequency noise. The other software's parameters dialog box contains a check box that allows you to select the number of coefficients so that it "optimizes based on sample size and sampling frequency." (Mine are 3.5 million samples collected at 1000 hz, but I would like a function that uses these inputs as variables.)
*Can anyone show me how to adjust the code below so that it removes all frequencies above 0.05 hz?* I would like to see smooth waves in the graph rather than just the time distortion of the same identical graph that I am getting from the code below now.
class FilterTheZ0():
def __init__(self,ZSmoothedPylab):
#------------------------------------------------------
# Set the order and cutoff of the filter
#------------------------------------------------------
self.n = 1000
self.ZSmoothedPylab=ZSmoothedPylab
self.l = len(ZSmoothedPylab)
self.x = arange(0,self.l)
self.cutoffFreq = 0.05
#------------------------------------------------------
# Run the filter
#------------------------------------------------------
self.RunLowPassFIR_Filter(self.ZSmoothedPylab, self.n, self.l
, self.x, self.cutoffFreq)
def RunLowPassFIR_Filter(self,data, order, l, x, cutoffFreq):
#------------------------------------------------------
# Set a to be the denominator coefficient vector
#------------------------------------------------------
a = 1
#----------------------------------------------------
# Create the low pass FIR filter
#----------------------------------------------------
b = signal.firwin(self.n, cutoff = self.cutoffFreq, window = "hamming")
#---------------------------------------------------
# Run the same data set through each of the various
# filters that were created above.
#---------------------------------------------------
response = signal.lfilter(b,a,data)
responsePylab=p.array(response)
#--------------------------------------------------
# Plot the input and the various outputs that are
# produced by running each of the various filters
# on the same inputs.
#--------------------------------------------------
plot(x[10000:20000],data[10000:20000])
plot(x[10000:20000],responsePylab[10000:20000])
show()
return
Cutoff is normalized to the Nyquist frequency, which is half the sampling rate. So with FS = 1000 and FC = 0.05, you want cutoff = 0.05/500 = 1e-4.
from scipy import signal
FS = 1000.0 # sampling rate
FC = 0.05/(0.5*FS) # cutoff frequency at 0.05 Hz
N = 1001 # number of filter taps
a = 1 # filter denominator
b = signal.firwin(N, cutoff=FC, window='hamming') # filter numerator
M = FS*60 # number of samples (60 seconds)
n = arange(M) # time index
x1 = cos(2*pi*n*0.025/FS) # signal at 0.025 Hz
x = x1 + 2*rand(M) # signal + noise
y = signal.lfilter(b, a, x) # filtered output
plot(n/FS, x); plot(n/FS, y, 'r') # output in red
grid()
The filter output is delayed half a second (the filter is centered on tap 500). Note that the DC offset added by the noise is preserved by the low-pass filter. Also, 0.025 Hz is well within the pass range, so the output swing from peak to peak is approximately 2.
The units of cutoff freq are probably [0,1) where 1.0 is equivalent to FS (the sampling frequency). So if you really mean 0.05Hz and FS=1000Hz, you'd want to pass cutoffFreq / 1000. You may need a longer filter to get such a low cutoff.
(BTW you are passing some arguments but then using the object attributes instead, but I don't see that introducing any obvious bugs yet...)