Basic rules for custom cluster configuration when using distributed learning in Cloud ML - tensorflow

I am investigating the use of custom scale tiers in the Cloud Machine Learning API.
Now, I dont know precisely how to design my custom tiers! I basically use a CIFAR type of model, and I decided to use:
if args.distributed:
config['trainingInput']['scaleTier'] = 'CUSTOM'
config['trainingInput']['masterType'] = 'complex_model_m'
config['trainingInput']['workerType'] = 'complex_model_m'
config['trainingInput']['parameterServerType'] = 'large_model'
config['trainingInput']['workerCount'] = 12
config['trainingInput']['parameterServerCount'] = 4
yaml.dump(config, file('custom_config.yaml', 'w'))
But I hardly can find any information on how to dimension properly the cluster. Are there any "rules of thumb" out there? Or do we have to try and test?
Many thanks in advance!

I have done a couple of small experiments, which might be worth sharing. My setup wasn't a 100% clean, but I think the rough idea is correct.
The model looks like the cifar example, but with a lot of training data. I use averaging, decaying gradient as well as dropout.
The "config" naming is (hopefully) explicit : basically 'M{masterCost}_PS{nParameterServer}x{parameterServerCost}_W{nWorker}x{workerCost}'. For parameter servers, I always use "large_model".
The "speed" is the 'global_step/s'
The "cost" is the total ML unit
And I call "efficiency" the number of 'global_step/second/ML unit'
Here are some partial results :
config cost speed efficiency
0 M1 1 0.5 0.50
1 M6_PS1x3_W2x6 21 10.0 0.48
2 M6_PS2x3_W2x6 24 10.0 0.42
3 M3_PS1x3_W3x3 15 11.0 0.73
4 M3_PS1x3_W5x3 21 15.9 0.76
5 M3_PS2x3_W4x3 21 15.1 0.72
6 M2_PS1x3_W5x2 15 7.7 0.51
I know that I should run many more experiments, but I have no time for this.
If I have time, I will dig in deeper.
The main conclusions are :
it might be worth trying a few setup on a small amount of iterations, just for deciding which config to use before going on hyper parameter tuning.
What is good, is that the variation is quite limited. From 0.5 to 0.75, this is a 50% efficiency increase, it is significant but not explosive.
For my specific problem, basically, large and expensive units are overkill for my problem. The best value I can get is using "complex_model_m".

Related

Understanding spacy textcat_multilabel scorer output

I'm trying to understand the output of my textcat_multilabel job. I have 4 text categories and I'm using spacy version 3.2.0 (The methodologies have changed a lot recently and I don't really understand the documentation).
E
#
LOSS TEXTC...
CATS_SCORE
SCORE
0
0
1.00
51.86
0.52
0
200
122.15
52.90
0.53
This is what I have in my config file. (btw. What is v1?)
scorer = {"#scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5
In fact, everything in the standard config file is unchanged from the suggestions except the dropout which I increased to 0.5.
The final row of my job shows these values: 0 8400 2.59 87.29 0.87
I am very impressed with the results that I'm getting with this job. Just need to understand what I'm doing.
E is epochs
# is training iterations / batches (see here)
LOSS_TEXTCAT is the loss of your textcat component. Loss normally fluctuates the first few iterations and then trends downward. The exact values are meaningless.
SCORE_TEXTCAT is the score of your textcat component on your dev set, see the docs for some details on that.
SCORE is overall score of your pipeline, a weighted average of any components you have. But you only have a textcat so it's basically the same as that score.
v1 is just version 1, components are versioned in case they are updated later so you can use older versions with newer code.

Why does Optaplanner not allow configuring the decay rate in simulated annealing?

I want to use Simulated Annealing in OptaPlanner, but I am a little baffled by the fact that there is only a setting for the initial temperature and not one for the decay rate. What is the reason for this choice?
The cooldown rate is automatically derived from the timeGradient, which is simply put 0.0 at the start, 0.5 at half the spentTime and 1.0 at all of the spentTime.
But yes, the classic Simulated Annealing method has 2 parameters (starting temperature and cooldown rate). One could implement such an SA pretty easily by copy-pasting the SimulatedAnnealingAcceptor and configuring it in the AcceptorConfig.
That being said, tuning 2 parameters is a pain for users. That's why OptaPlanner default SA only has 1 parameter that - together with the termination - is translated into the 2 parameters that SA needs.

Initial jump in loss with TensorFlow

Suppose I have a saved model that is nearly at the minimum, but with some room for improvement. For example, the loss (as reported by tf.keras.Models.model.evaluate() ) might be 11.390, and I know that the model can go down to 11.300.
The problem is that attempts to refine this model (using tf.keras.Models.model.fit()) consistently result in the weights receiving an initial 'jolt' during the first epoch, which sends the loss way upwards. After that, it starts to decrease, but it does not always converge on the correct minimum (and may not even get back to where it started.)
It looks like this:
tf.train.RMSPropOptimizer(0.0002):
0 11.982
1 11.864
2 11.836
3 11.822
4 11.809
5 11.791
(...)
15 11.732
tf.train.AdamOptimizer(0.001):
0 14.667
1 11.483
2 11.400
3 11.380
4 11.371
5 11.365
tf.keras.optimizers.SGD(0.00001):
0 12.288
1 11.760
2 11.699
3 11.650
4 11.666
5 11.601
Dataset with 30M observations, batch size 500K in all cases.
I can mitigate this by turning the learning rate way down, but then it takes forever to converge.
Is there any way to prevent training from going "wild" at the beginning, without impacting the long-term convergence rate?
As you tried decreasing the learning rate is the way to go.
E.g. learning rate = 0.00001
tf.train.AdamOptimizer(0.00001)
Especially with Adam that should be promising, since the learning rate is at the same time an upper bound for the step size.
On top of that you could try learning rate scheduling, where you set the learning rate according to your predefined schedule.
Also I feel that from what you show when you decreased the learning rate, this does not seem to be too bad, in terms of convergence rate.
Maybe another hyperparameter you could tune in your case would be to reduce the batch size, to decrease computation cost per update.
Note:
I find the term "not the right minimum" rather misleading. To further understand nonconvex optimization for artificial neural networks, I would like to Point to the deep learning book of Ian Goodfellow et al

Simple voice recognition when whispering

I'm trying to do simple voice to text mapping using pocketsphinx (. The grammar is very simple such as:
public <grammar> = (Matt, Anna, Tom, Christine)+ (One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Zero)+ ;
e.g:
Tom Anna Three Three
yields
Tom Anna 33
I adapted the acoustic model (to take into account my foreign accent) and after that I received decent performance (~94% accuracy). I used training dataset of ~3minutes.
Right now I'm trying to do the same but by whispering to the microphone. The accuracy dropped significantly to ~50% w/o training. With training for accent
I got ~60%. I tried other thinks including denoising and boosting volume. I read the whole docs but was wondering if anyone could answer some questions so I can
better know in which direction should I got to improve performance.
1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a sampling parameter. When using sphinx_fe you use "-samprate 16000". Was it used deliberately to train 8k model using data with 16k sampling rate? Why data with 8k sampling haven't been used? Does it have influence on performance?
2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar. Can those models be used with pocketsphinx? Will acustic model with 16k sampling have typically better performance with data having 16k sampling rate?
3) when using data for training should I use those with normal speaking mode (to adapt only for my accent) or with whispering mode (to adapt to whisper and my accent)? I think I tried both scenarios and didn't notice any difference to draw any conclussion but I don't know pocketsphinx internals so I might be doing something wrong.
4) I used the following script to record adapting training and testing data from the tutorial:
for i in `seq 1 20`; do
fn=`printf arctic_%04d $i`;
read sent; echo $sent;
rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null;
done < arctic20.txt
I noticed that each time I hit Control-C this keypress is distinct in the recorded audio that leaded to errors. Trimming audio somtimes helped to correct to or lead to
other error instead. Is there any requirement that each recording has some few seconds of quite before and after speaking?
5) When accumulating observation counts is there any settings I can tinker with to improve performance?
6) What's the difference between semi-continuous and continuous model? Can pocketsphinx use continuous model?
7) I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing to the one you got in pocketsphinx-extra. Does it make any difference?
8) I tried different combination of removing white noise (using 'sox' toolkit e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the last parameter
sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it good to remove noise or it's something pocketsphinx doing as well? My environment is quite
is there is only white noise that I guess can have more inpack when audio recorded whispering.
9) I noticed that boosting volume (gain) alone most of the time only maked the performance a little bit worse even though for humans it was easier to distinguish words. Should I avoid it?
10) Overall I tried different combination and the best results I got is ~65% when only removing noise, so only slight (5%) improvement. Below are some stats:
//ORIGNAL UNPROCESSED TESTING FILES
TOTAL Words: 111 Correct: 72 Errors: 43
TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
TOTAL Insertions: 4 Deletions: 13 Substitutions: 26
//DENOISED + VOLUME UP
TOTAL Words: 111 Correct: 76 Errors: 42
TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
TOTAL Insertions: 7 Deletions: 4 Substitutions: 31
//VOLUME UP
TOTAL Words: 111 Correct: 69 Errors: 47
TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
TOTAL Insertions: 5 Deletions: 12 Substitutions: 30
//DENOISE, threshold 0.1
TOTAL Words: 111 Correct: 77 Errors: 41
TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 31
//DENOISE, threshold 0.21
TOTAL Words: 111 Correct: 80 Errors: 38
TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 28
Those processing I was doing only for testing data. Should the training data be processed in the same way? I think I tried that but there was barely any difference.
11) In all those testing I used ARPA language model. When using JGSF results where usually much worse (I have the latest pocketsphinx branch). Why is that?
12) Because is each sentence the maximum number would be '999' and no more than 3 names, I modified the JSGF and replaced repetition sign '+' by repeating content in the parentheses manually. This time the result where much closer to ARPA. Is there any way in grammar to tell maximum number of repetition like in regular expression?
13) When using ARPA model I generated it by using all possible combinations (since dictionary is fixed and really small: ~15 words) but then testing I was still receiving somtimes illegal results e.g. Tom Anna (without any required number). Is there any way to enforce some structure using ARPA model?
14) Should the dictionary be limited only to those ~15 words or just full dictionary will only affect speed but not performance?
15) Is modifying dictionary (phonemes) the way to go to improve recognition when whispering? (I'm not an expert but when we whisper I guess some words might sounds different?)
16) Any other tips how to improve accuracy would be really helpful!
Regarding whispering: when you do so, the sound waves don't have meaningful aperiodic parts (vibrations that result from your vocal cords resonating normally, but not when whispering). You can try this by putting your finger to your throat while loudly speaking 'aaaaaa', and then just whispering it.
AFAIR acoustic modeling relies a lot on taking the frequency spectrum of the sound to detect peaks (formants) and relate them to phones (like vowels).
Educated guess: when whispering, the spectrum is mostly white-noise, slightly shaped by the oral position (tongue, openness of mouth, etc), which is enough for humans, but far not enough to make the peeks distinguishable by a computer.

Moving beyond R's optim function

I am trying to use R to estimate a multinomial logit model with a manual specification. I have found a few packages that allow you to estimate MNL models here or here.
I've found some other writings on "rolling" your own MLE function here. However, from my digging around - all of these functions and packages rely on the internal optim function.
In my benchmark tests, optim is the bottleneck. Using a simulated dataset with ~16000 observations and 7 parameters, R takes around 90 seconds on my machine. The equivalent model in Biogeme takes ~10 seconds. A colleague who writes his own code in Ox reports around 4 seconds for this same model.
Does anyone have experience with writing their own MLE function or can point me in the direction of something that is optimized beyond the default optim function (no pun intended)?
If anyone wants the R code to recreate the model, let me know - I'll glady provide it. I haven't provided it since it isn't directly relevant to the problem of optimizing the optim function and to preserve space...
EDIT: Thanks to everyone for your thoughts. Based on a myriad of comments below, we were able to get R in the same ballpark as Biogeme for more complicated models, and R was actually faster for several smaller / simpler models that we ran. I think the long term solution to this problem is going to involve writing a separate maximization function that relies on a fortran or C library, but am certainly open to other approaches.
Tried with the nlm() function already? Don't know if it's much faster, but it does improve speed. Also check the options. optim uses a slow algorithm as the default. You can gain a > 5-fold speedup by using the Quasi-Newton algorithm (method="BFGS") instead of the default. If you're not concerned too much about the last digits, you can also set the tolerance levels higher of nlm() to gain extra speed.
f <- function(x) sum((x-1:length(x))^2)
a <- 1:5
system.time(replicate(500,
optim(a,f)
))
user system elapsed
0.78 0.00 0.79
system.time(replicate(500,
optim(a,f,method="BFGS")
))
user system elapsed
0.11 0.00 0.11
system.time(replicate(500,
nlm(f,a)
))
user system elapsed
0.10 0.00 0.09
system.time(replicate(500,
nlm(f,a,steptol=1e-4,gradtol=1e-4)
))
user system elapsed
0.03 0.00 0.03
Did you consider the material on the CRAN Task View for Optimization ?
I am the author of the R package optimParallel, which could be helpful in your case. The package provides parallel versions of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times as illustrated in the following figure (p is the number of paramters).
See https://cran.r-project.org/package=optimParallel and http://arxiv.org/abs/1804.11058 for more information.
FWIW, I've done this in C-ish, using OPTIF9. You'd be hard-pressed to go faster than that. There are plenty of ways for something to go slower, such as by running an interpreter like R.
Added: From the comments, it's clear that OPTIF9 is used as the optimizing engine. That means that most likely the bulk of the time is spent in evaluating the objective function in R. While it is possible that C functions are being used underneath for some of the operations, there still is interpreter overhead. There is a quick way to determine which lines of code and function calls in R are responsible for most of the time, and that is to pause it with the Escape key and examine the stack. If a statement costs X% of time, it is on the stack X% of the time. You may find that there are operations that are not going to C and should be. Any speedup factor you get this way will be preserved when you find a way to parallelize the R execution.