Understanding spacy textcat_multilabel scorer output - spacy

I'm trying to understand the output of my textcat_multilabel job. I have 4 text categories and I'm using spacy version 3.2.0 (The methodologies have changed a lot recently and I don't really understand the documentation).
E
#
LOSS TEXTC...
CATS_SCORE
SCORE
0
0
1.00
51.86
0.52
0
200
122.15
52.90
0.53
This is what I have in my config file. (btw. What is v1?)
scorer = {"#scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5
In fact, everything in the standard config file is unchanged from the suggestions except the dropout which I increased to 0.5.
The final row of my job shows these values: 0 8400 2.59 87.29 0.87
I am very impressed with the results that I'm getting with this job. Just need to understand what I'm doing.

E is epochs
# is training iterations / batches (see here)
LOSS_TEXTCAT is the loss of your textcat component. Loss normally fluctuates the first few iterations and then trends downward. The exact values are meaningless.
SCORE_TEXTCAT is the score of your textcat component on your dev set, see the docs for some details on that.
SCORE is overall score of your pipeline, a weighted average of any components you have. But you only have a textcat so it's basically the same as that score.
v1 is just version 1, components are versioned in case they are updated later so you can use older versions with newer code.

Related

How to set a minimum number of epoch in Optuna SuccessiveHalvingPruner()?

I'm using Optuna 2.5 to optimize a couple of hyperparameters on a tf.keras CNN model. I want to use pruning so that the optimization skips the less promising corners of the hyperparameters space. I'm using something like this:
study0 = optuna.create_study(study_name=study_name,
storage=storage_name,
direction='minimize',
sampler=TPESampler(n_startup_trials=25, multivariate=True, seed=123),
pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto',
reduction_factor=4, min_early_stopping_rate=0),
load_if_exists=True)
Sometimes the model stops after 2 epochs, some other times it stops after 12 epochs, 48 and so forth. What I want is to ensure that the model always trains at least 30 epochs before being pruned. I guess that the parameter min_early_stopping_rate might have some control on this but I've tried to change it from 0 to 30 and then the models never get pruned. Can someone explain me a bit better than the Optuna documentation, what these parameters in the SuccessiveHalvingPruner() really do (specially min_early_stopping_rate)?
Thanks
min_resource's explanation on the documentation says
A trial is never pruned until it executes min_resource * reduction_factor ** min_early_stopping_rate steps.
So, I suppose that we need to replace the value of min_resource with a specific number depending on reduction_factor and min_early_stopping_rate.

Initial jump in loss with TensorFlow

Suppose I have a saved model that is nearly at the minimum, but with some room for improvement. For example, the loss (as reported by tf.keras.Models.model.evaluate() ) might be 11.390, and I know that the model can go down to 11.300.
The problem is that attempts to refine this model (using tf.keras.Models.model.fit()) consistently result in the weights receiving an initial 'jolt' during the first epoch, which sends the loss way upwards. After that, it starts to decrease, but it does not always converge on the correct minimum (and may not even get back to where it started.)
It looks like this:
tf.train.RMSPropOptimizer(0.0002):
0 11.982
1 11.864
2 11.836
3 11.822
4 11.809
5 11.791
(...)
15 11.732
tf.train.AdamOptimizer(0.001):
0 14.667
1 11.483
2 11.400
3 11.380
4 11.371
5 11.365
tf.keras.optimizers.SGD(0.00001):
0 12.288
1 11.760
2 11.699
3 11.650
4 11.666
5 11.601
Dataset with 30M observations, batch size 500K in all cases.
I can mitigate this by turning the learning rate way down, but then it takes forever to converge.
Is there any way to prevent training from going "wild" at the beginning, without impacting the long-term convergence rate?
As you tried decreasing the learning rate is the way to go.
E.g. learning rate = 0.00001
tf.train.AdamOptimizer(0.00001)
Especially with Adam that should be promising, since the learning rate is at the same time an upper bound for the step size.
On top of that you could try learning rate scheduling, where you set the learning rate according to your predefined schedule.
Also I feel that from what you show when you decreased the learning rate, this does not seem to be too bad, in terms of convergence rate.
Maybe another hyperparameter you could tune in your case would be to reduce the batch size, to decrease computation cost per update.
Note:
I find the term "not the right minimum" rather misleading. To further understand nonconvex optimization for artificial neural networks, I would like to Point to the deep learning book of Ian Goodfellow et al

Basic rules for custom cluster configuration when using distributed learning in Cloud ML

I am investigating the use of custom scale tiers in the Cloud Machine Learning API.
Now, I dont know precisely how to design my custom tiers! I basically use a CIFAR type of model, and I decided to use:
if args.distributed:
config['trainingInput']['scaleTier'] = 'CUSTOM'
config['trainingInput']['masterType'] = 'complex_model_m'
config['trainingInput']['workerType'] = 'complex_model_m'
config['trainingInput']['parameterServerType'] = 'large_model'
config['trainingInput']['workerCount'] = 12
config['trainingInput']['parameterServerCount'] = 4
yaml.dump(config, file('custom_config.yaml', 'w'))
But I hardly can find any information on how to dimension properly the cluster. Are there any "rules of thumb" out there? Or do we have to try and test?
Many thanks in advance!
I have done a couple of small experiments, which might be worth sharing. My setup wasn't a 100% clean, but I think the rough idea is correct.
The model looks like the cifar example, but with a lot of training data. I use averaging, decaying gradient as well as dropout.
The "config" naming is (hopefully) explicit : basically 'M{masterCost}_PS{nParameterServer}x{parameterServerCost}_W{nWorker}x{workerCost}'. For parameter servers, I always use "large_model".
The "speed" is the 'global_step/s'
The "cost" is the total ML unit
And I call "efficiency" the number of 'global_step/second/ML unit'
Here are some partial results :
config cost speed efficiency
0 M1 1 0.5 0.50
1 M6_PS1x3_W2x6 21 10.0 0.48
2 M6_PS2x3_W2x6 24 10.0 0.42
3 M3_PS1x3_W3x3 15 11.0 0.73
4 M3_PS1x3_W5x3 21 15.9 0.76
5 M3_PS2x3_W4x3 21 15.1 0.72
6 M2_PS1x3_W5x2 15 7.7 0.51
I know that I should run many more experiments, but I have no time for this.
If I have time, I will dig in deeper.
The main conclusions are :
it might be worth trying a few setup on a small amount of iterations, just for deciding which config to use before going on hyper parameter tuning.
What is good, is that the variation is quite limited. From 0.5 to 0.75, this is a 50% efficiency increase, it is significant but not explosive.
For my specific problem, basically, large and expensive units are overkill for my problem. The best value I can get is using "complex_model_m".

In Tensorflow, how to convert scores from neural net into discrete values as a part of learning process

Hello fellow tensorflowians!
I have a following schema:
I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually).
Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .
I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.
So my question is:
How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?
precision = 0.01 # so that sqrt(precision) == 0.1
loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))

What is meaning of "parameter optimization of SVM by PSO"?

I can change parameters C and epsilon manually to obtain an optimised result, but I found that there is parameter optimization of SVM by PSO (or any other optimization algorithm). There is no algorithm. What does it mean: how can PSO automatically optimize the SVM parameters? I read several papers on this topic, but I'm still not sure.
Particle Swarm Optimization is a technique that uses the ML parameters (SVM parameters, in your case) as its features.
Each "particle" in the swarm is characterized by those parameter values. For instance, you might have initial coordinates of
degree epsilon gamma C
p1 3 0.001 0.25 1.0
p2 3 0.003 0.20 0.9
p3 2 0.0003 0.30 1.2
p4 4 0.010 0.25 0.5
...
pn ...........................
The "fitness" of each particle (p1-p4 shown here out of a population of n particles) is measured by the accuracy of the resulting model: the PSO algorithm trains and tests a model for each particle, returning that model's error rate as the value analogous to that from the training loss function (which it how the value is computed).
On each iteration, particles move toward the fittest neighbours. The process repeats until a maximum (hopefully the global one) appears as a convergence point. This process is simply one from the familiar gradient descent family.
There are two basic PSO variants. In gbest (global best), every particle affects every other particle, sort of a universal gravitation principle. It converges quickly, but may well miss a global max in favor of a local max that happened to be nearer to the swarm's original center. In lbest (local best), a particle responds to only its k closest neighbors. This can form localized clusters; it converges more slowly, but is more likely to find the global max in a non-convex space.
I'll try to briefly explain enough to answer your clarification questions. If that doesn't work, I'm afraid you'll probably have to find someone to discuss this in front of a white board.
To use PSO, you have to decide which SVM parameters you'll try to optimize, and how many particles you want to use. PSO is a meta-algorithm, so its features are the SVM parameters. The PSO parameters are population (how many particles you want to use, update neighbourhood (lbest size and a distance function; gbest is the all-inclusive case), and velocity (learning rate for the SVM parameters).
For a bit of illustration, let's assume the particle table above, extended to a population of 20 particles. We'll use lbest with a neighbourhood of 4, and a velocity of 0.1. We choose (randomly, in a grid, or however we think might give us nice results) the initial values of degree, epsilon, gamma, and C for each of the 20 particles.
Each iteration of PSO works like this:
# Train the model described by each particle's "position"
For each of the 20 particles:
Train an SVM with the SVM input and the given parameters.
Test the SVM; return the error rate as the PSO loss function value.
# Update the particle positions
for each of the 20 particles:
find the nearest 4 neighbours (using the PSO distance function)
identify the neighbour with the lowest loss (SVM's error rate).
adjust this particle's features (degree, epsilon, gamma, C) 0.1 of the way toward that neighbour's features. 0.1 is our learning rate / velocity. (Yes, I realize that changing degree is not likely to happen (it's a discrete value) without a special case in the update routine.
Continue iterating through PSO until the particles have converged to your liking.
gbest is simply lbest with an infinite neighbourhood; in that case, you don't need a distance function on the particle space.