In ml5 pitch detection using crepe model, how to detect pitch above ±2kHz

In ml5 pitch detection using crepe model, how to detect pitch above ±2kHz - tensorflow

I'm successfully using pitch detection features of ml5:
tutorial: https://ml5js.org/reference/api-PitchDetection/
model: https://cdn.jsdelivr.net/gh/ml5js/ml5-data-and-models/models/pitch-detection/crepe/
The issue:
No pitch above ±2000Hz is detected. I tried multiple devices and checked that the sounds are visible on sonograms so it's does not seem to be a mic issue.
I assumed it may be a result of sampling rate limitations / resampling done by the library, as the Nyquist frequency (max "recordable" frequency) is that of half of the sampling rate.
I hosted the ml5 sources localy and tried modifying the PitchDetection class
There I see the sampling rate seems to be resampled to 1024Hz for performance reasons. This does not sound right though as if I'm not mistaken, this would only allow detection of frequencies up to 512hz. I am definitely missing something (or a lot).
I tried fiddling with the rates, but increasing it to, say 2048 causes an error:
Error when checking : expected crepe_input to have shape [null,1024] but got array with shape [1,2048].
My question is:
Is there something in ml5 PitchDetection class I can modify, configure (perhaps a different model) to detect frequencies higher than 2000Hz using crepe model?

After more investigation, turns out the CREPE model itself supports up to ~1997Hz (seen in code) or 1975.5 Hz (seen in paper)
The paper about CREPE:
https://arxiv.org/abs/1802.06182
States:
The 360 pitch values are denoted as c1, c2..., 360 are selected so that they cover six octaves with 20-cent intervals between C1 and B7, corresponding to 32.70 Hz and 1975.5 Hz
The JS implementation has this mapping which maps the 360 intervals to 0 - 1997Hz range:
const cent_mapping = tf.add(tf.linspace(0, 7180, 360), tf.tensor(1997.3794084376191))
This means, short of retraining the model I'm probably out of luck at using it for now.
Edit:
After a good nights sleep I found a simple solution which works for my simple application.
In it's essence, it is to resample my audio buffer so it has 2 times lower pitch. CREPE than detects a pitch of 440Hz as 220Hz, and I just need to multiply it by 2.
The result is still more consistently correct than YIN algorithm for my real time, noisy application.

Related

Simple QPSK transmiter, large sidelobes pulsation

I have a simple flowgraph for QPSK transmitter with USRP.
After execution, there is lage sidelobes, that pulsate.
During the periods of large sidelobes, there is a drop in amplutude of main lobe.
There is no such pulsations if I make similar transmitter with Matlab.
I suscpect discontinues in sorce.
Comments and advice are appreciated.

Your pool of random data is far too short; you'll see data periodicity in spectrum very quickly; it might be that this is exactly what happens. So, try with num_samples 2**20 instead.
You can observe your transmit spectrum yourself before even transmitting it: use the Qt GUI frequency sink or waterfall sink with an FFT length that corresponds to the FFT length you use in gqrx.
Your sample rate is at the least end of all possible sampling rates. Here, the roll-off of the interpolation filters inside the USRP will definitely show. Don't do that to yourself. Use sps = 16, samp_rate = 1e6 instead.
Make sure you're not getting any underruns in your tranmitter, nor overruns in your receiver. If that happens at these incredibly low sampling rates, something is wrong with your computer setup

Changes make no difference. The following is # 2**20 number of samples, 1 MHz sample rate and 20 samples per symbol. There is no underrun.
# 5 Mhz sample rate I start receiving underrun.

I found the problem and a solution.
The problem is that the level of the signal after modulator is too strong for the USRP input. After modulator the abs value of the signal reach 9. I don't know the maximum level of the signal that USRP expects. I presume something like 1 peak to peak
The solution is to restrict the level by multiplication with a constant. With constant=0.5, there is still distortions. Value of 0.2 is ok.
Here is the new flowgraph:

GNURadio Companion Blocks for Z-Wave using RTL-SDR dongle

I'm using RTL-SDR generic dongle for receiving frames of Z-Wave protocol. I use real Z-Wave devices. I'm using scapy-radio and I've also downloaded EZ-Wave. However, none of them implements blocks for all Z-Wave data rates, modulations and codings. I've received some frames using original solution of EZ-Wave, however I assume I can't receive frames at all data rates, codings and modulations. Now I'm trying to implement solution according to their blocks to implement all of them.
Z-Wave procotol uses these modulations, data rates and coding:
9.6 kbps - FSK - Manchester
40 kbps - FSK - NRZ
100 kbps - GFSK - NRZ
These are my actual blocks (not able receving anything at all right now):
For example, I will explain my view on blocks for receiving at
9.6 kbps - FSK - Manchester
RTL-SDR Source
variable center_freq = 869500000
variable r1_freq_offset = 800e3
Ch0: Frequency: center_freq_3-r1_freq_offset, so I've got 868.7 Mhz on RTL-SDR Source block.
Frequency Xlating FIR Filter
Center frequency = - 800Khz to get frequency 868.95 Mhz (Europe). To be honest, I'm not sure why I do this and I need an explanation. I'm trying to implement those blocks according to EZ-Wave implementation of blocks for 40 kbps-FSK-NRZ (as I assume). They use sample rate 2M and different configurations, which I did not understand.
Taps = firdes.low_pass(1,samp_rate_1,samp_rate_1/2,5e3,firdes.WIN_HAMMING). I don't understand, what should be transition bw (5e3 in my case)
Sample rate = 19.2e3, because data rate/baud is 9.6 Kbps and according to Nyquist–Shannon sampling theorem, sampling rate should be at least double to data rate, so 2*9.6=19.2. So I'm trying to resample default 2M from source to 19.2 Kbps.
Simple squelch
I use default value (-40) and I'm not sure, if I should change this or not.
Quadrature Demod
should do the FSK demodulation and I use default value of gain. I'm not sure if this is a right way to do FSK demodulation.
Gain = 2(samp_rate_1)/(2*math.pi*20e3/8.0)*
Low Pass Filter
Sample rate = 19.2k to use the same new sample rate
Cuttoff Freq = 9.6k, I assume this according to https://nccgroup.github.io/RFTM/fsk_receiver.html
Transition width = 4.8 which is also sample_rate/2
Clock Recovery MM
Most of the parameters are default.
Omega = 2, because samp_rate/baud
Binary Slicer
is for getting binary code of signal
Zwave PacketSink 9.6
should the the Manchester decoding.
I would like to ask, what should I change on my blocks to achieve proper receiving of Z-Wave frames at all data rates, modulation and coding. When I start receiving, I'm able to see messages from my devices at FFT sink and Waterfall sink. The message debug doesn't print packets (like from original EZ-Wave solution) but only
Looking for sync : 575555aa
Looking for sync : 565555aa
Looking for sync : aa5555aa
what should be value in frame_shift_register, according to C code for Manchester decoding (ZWave PacketSink 9.6). I've seen similar post, however this is a bit different and to be honest, I'm stuck here.
I will be grateful for any help.

Let's look at the GFSK case. First of all, the sampling rate of the RTL source, 2M Baud is pretty high. For the maximum data rate, 100 kbps - GFSK, a sample rate of say 400 ~ 500kbaud will do just fine. There is also the power squelch block. This block prevents signals below a certain threshold to pass. This is not good because it filters low power signals that may contain information. There is also the sample rate issue between the lowpass filter and the MM clock recovery block. The output of the symbol recovery block should be 100kbaud (because for GFSK, sample rate = symbol rate). Using the omega value of 2 and working backward, the input to the MM block should be 200kbaud. But, the lowpass filter produces samples at 2Mbaud, 10 times than expected. You have to do proper decimation.
I implemented a GFSK receiver once for our CubeSat. Timing recovery was done by the PFB block, which is more reliable than the MM one. You can find the paper here:
https://www.researchgate.net/publication/309149646_Software-defined_radio_transceiver_for_QB50_CubeSat_telemetry_and_telecommand?_sg=HvZBpQBp8nIFh6mIqm4yksaAwTpx1V6QvJY0EfvyPMIz_IEXuLv2pODOnMToUAXMYDmInec76zviSg.ukZBHrLrmEbJlO6nZbF4X0eyhFjxFqVW2Q50cSbr0OHLt5vRUCTpaHi9CR7UBNMkwc_KJc1PO_TiGkdigaSXZA&_sgd%5Bnc%5D=1&_sgd%5Bncwor%5D=0
Some more details on the receiver could also be found here:
GFSK modulation/demodulation with GNU Radio and USRP
M.

I appreciate your answer, I've changed my sample rates. Now I'm still working on 9.6Kbps, FSK demodulation and Manchester decoding. Currently, output from my M&M clock recovery looks like this:
I would like to ask you what do think about this signal. As I said, it should be FSK demodulation and then I should use Manchester decoding. Do I still need usage of PCB block? Primary, I have to do 9.6kbps, FSK and Manchester, so I will look at 100Kbps GFSK NRZ if there will be some time left.

Sample rate is 1M because of RTL-SDR dongle limitations (225001 to 300000 and 900001 to 3200000).
Current blocks:
I don't understand :
Taps of Frequency Xlating FIR Filter firdes.low_pass(1,samp_rate_1,40e3,20e3,firdes.WIN_HAMMING)
Cuttoff Freq and Transition Width of Low Pass filter
Clock Recovery M&M aswell, so consider its values "random".
ClockRecovery Output:
I was trying to use PCB block according to your work at ResearchGate. However, I was unsuccessful because I still don't understand all that science behind the clock recovery.
Doing Low-pass filtering twice is because original Z-Wave blocks from scapy-radio for 40Kbps, FSK and NRZ coding are made like this (and it works):
So I thought I will be just about changing few parameters and decoder (Zwave PacketSink9.6).
I also uploaded my current blocks here.

Moses Browne Mwakyanjala, I'm also trying to implement that thing according to your work.
Maybe there is a problem with a clock recovery and Manchester decoding. Manchester decoding use transitions 0->1 and 1->0 to encode 0s and 1s. How can I properly configure clock recovery to achieve correct sample rate and transitions for Manchester decoding? Manchester decoder (Z-Wave PacketSink 9.6) is able to find the preamble and ends only with looking for sync.
I would like to also ask you, where can I find my modulation index "h" mentioned in your work?
Thank you

Calculating walking distance for user over time

I'm trying to track the distance a user has moved over time in my application using the GPS. I have the basic idea in place, so I store the previous location and when a new GPS location is sent I calculate the distance between them, and add that to the total distance. So far so good.
There are two big issues with this simple implementation:
Since the GPS is inacurate, when the user moves, the GPS points will not be a straight line but more of a "zig zag" pattern making it look like the user has moved longer than he actually have moved.
Also a accuracy problem. If the phone just lays on the table and polls GPS possitions, the answer is usually a couple of meters different every time, so you see the meters start accumulating even when the phone is laying still.
Both of these makes the tracking useless of coruse, since the number I'm providing is nowwhere near accurate enough.
But I guess that this problem is solvable since there are a lot of fitness trackers and similar out there that does track distance from GPS. I guess they do some kind of interpolation between the GPS values or something like that? I guess that won't be 100% accurate either, but probably good enough for my usage.
So what I'm after is basically a algorithm where I can put in my GPS positions, and get as good approximation of distance travelled as possible.
Note that I cannot presume that the user will follow roads, so I cannot use the Google Distance Matrix API or similar for this.

This is a common problem with the position data that is produced by GPS receivers. A typical consumer grade receiver that I have used has a position accuracy defined as a CEP of 2.5 metres. This means that for a stationary receiver in a "perfect" sky view environment over time 50% of the position fixes will lie within a circle with a radius of 2.5 metres. If you look at the position that the receiver reports it appears to wander at random around the true position sometimes moving a number of metres away from its true location. If you simply integrate the distance moved between samples then you will get a very large apparent distance travelled.for a stationary device.
A simple algorithm that I have used quite successfully for a vehicle odometer function is as follows
for(;;)
{
Stored_Position = Current_Position ;
do
{
Distance_Moved = Distance_Between( Current_Position, Stored_Position ) ;
} while ( Distance_Moved < MOVEMENT_THRESHOLD ) ;
Cumulative_Distance += Distance_Moved ;
}
The value of MOVEMENT_THRESHOLD will have an effect on the accuracy of the final result. If the value is too small then some of the random wandering performed by the stationary receiver will be included in the final result. If the value is too large then the path taken will be approximated to a series of straight lines each of which is as long as the threshold value. The extra distance travelled by the receiver as its path deviates from this straight line segment will be missed.
The accuracy of this approach, when compared with the vehicle odometer, was pretty good. How well it works with a pedestrian would have to be tested. The problem with people is that they can make much sharper turns than a vehicle resulting in larger errors from the straight line approximation. There is also the perennial problem with sky view obscuration and signal multipath caused by buildings, vehicles etc. that can induce positional errors of 10s of metres.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)

Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

Mathematical analysis of a sound sample (as an array of numbers)

I need to find the frequency of a sample, stored (in vb) as an array of byte. Sample is a sine wave, known frequency, so I can check), but the numbers are a bit odd, and my maths-foo is weak.
Full range of values 0-255. 99% of numbers are in range 235 to 245, but there are some outliers down to 0 and 1, and up to 255 in the remaining 1%.
How do I normalise this to remove outliers, (calculating the 235-245 interval as it may change with different samples), and how do I then calculate zero-crossings to get the frequency?
Apologies if this description is rubbish!

The FFT is probably the best answer, but if you really want to do it by your method, try this:
To normalize, first make a histogram to count how many occurrances of each value from 0 to 255. Then throw out X percent of the values from each end with something like:
for (i=lower=0;i< N*(X/100); lower++)
i+=count[lower];
//repeat in other direction for upper
Now normalize with
A[i] = 255*(A[i]-lower)/(upper-lower)-128
Throw away results outside the -128..127 range.
Now you can count zero crossings. To make sure you are not fooled by noise, you might want to keep track of the slope over the last several points, and only count crossings when the average slope is going the right way.

The standard method to attack this problem is to consider one block of data, hopefully at least twice the actual frequency (taking more data isn't bad, so it's good to overestimate a bit), then take the FFT and guess that the frequency corresponds to the largest number in the resulting FFT spectrum.
By the way, very similar problems have been asked here before - you could search for those answers as well.

Use the Fourier transform, it's much more noise insensitive than counting zero crossings
Edit: #WaveyDavey
I found an F# library to do an FFT: From here
As it turns out, the best free
implementation that I've found for F#
users so far is still the fantastic
FFTW library. Their site has a
precompiled Windows DLL. I've written
minimal bindings that allow
thread-safe access to FFTW from F#,
with both guru and simple interfaces.
Performance is excellent, 32-bit
Windows XP Pro is only up to 35%
slower than 64-bit Linux.
Now I'm sure you can call F# lib from VB.net, C# etc, that should be in their docs

If I understood well from your description, what you have is a signal which is a combination of a sine plus a constant plus some random glitches. Say, like
x[n] = A*sin(f*n + phi) + B + N[n]
where N[n] is the "glitch" noise you want to get rid of.
If the glitches are one-sample long, you can remove them using a median filter which has to be bigger than the glitch length. On both sides of the glitch. Glitches of length 1, mean you will have enough with a median of 3 samples of length.
y[n] = median3(x[n])
The median is computed so: Take the samples of x you want to filter (x[n-1],x[n],x[n+1]), sort them, and your output is the middle one.
Now that the noise signal is away, get rid of the constant signal. I understand the buffer is of a limited and known length, so you can just compute the mean of the whole buffer. Substract it.
Now you have your single sinus signal. You can now compute the fundamental frequency by counting zero crossings. Count the amount of samples above 0 in which the former sample was below 0. The period is the total amount of samples of your buffer divided by this, and the frequency is the oposite (1/x) of the period.

Although I would go with the majority and say that it seems like what you want is an fft solution (fft algorithm is pretty quick), if fft is not the answer for whatever reason you may want to try fitting a sine curve to the data using a fitting program and reading off the fitted frequency.
Using Fityk, you can load the data, and fit to a*sin(b*x-c) where 2*pi/b will give you the frequency after fitting.
Fityk can be used from a gui, from a command-line for scripting and has a C++ API so could be included in your programs directly.

I googled for "basic fft". Visual Basic FFT Your question screams FFT, but be careful, using FFT without understanding even a little bit about DSP can lead results that you don't understand or don't know where they come from.

get the Frequency Analyzer at http://www.relisoft.com/Freeware/index.htm and run it and look at the code.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas