Conflicts in the training data for Microsoft Custom Translator - training-data

I am using Microsoft Custom Translator and providing the training data in tmx format. My training data has some conflicts. For example, I have English to German training data where I have duplicate English strings but the German translations are different for these duplicate English strings. In such cases, how does it affect the Model ?

As long as one side is different, they are merely alternative translations, which happen all the time. The alternatives will be kept, and influence the probabilities in the resulting model.

I'll expand on the official and approved answer from our esteemed colleague at Microsoft Translator.
Yes, it happens a lot, and yes it will influence the probabilities in the resulting model.
Is that good? It depends.
Yes, there are target-side conflicts due to different contexts, especially on short strings, but just as often there are other reasons, and unjustifiable inconsistencies.
It's best to actually look at the target-side conflicts and make an executive decision based on the type of the conflicts and the scenario - the overall dataset, the desired behaviour and the behaviour of the generic system.
There are cases where target-side conflicts in training data are desireable or harmless, but at least as often, they're harmful or strike trade-offs.
For example, missing accent marks, bad encodings, nasty hidden characters or other non-human readable differences like double-width parentheses, conflicting locales, untranslated segments, updating style guidelines... are mostly harmful conflicts. One variant could be localising units while the other does not. And, often enough, one variant is just a bad translation.
Very often, these direct conflicts - that is conflicts between segments that have the same exact source, which can be found with a simple script - are a clue about conflicts in the wider dataset - which are harder to find unless you know what you're looking for.
Trade-offs exist between more 1:1 translationese and transcreation, between accuracy and fluency. The former has a bad name but it's less risky and more robust.
The decision could be to drop, resolve or to normalise, or to go debug the dataset and data pipeline.
Just throwing it all in the blackbox and mumbling In Deep Learning We Trust over Manning and Schütze 1999 three times only makes sense if the scale - the frequency with which you train custom models, not the amount of training data - is so high that basic due diligence is not feasible.
To really know, you may need to train the system with and without the conflicts, and evaluate and compare.
Source-side noise and conflicts, on the other hand, are not even really conflicts and are usually safe and even beneficial to include. And they're still worth peeking at.

Related

Is there a standard way to breakdown huge optimization models to ensure the submodules are working correctly?

Apologies as this may be a general question for optimization:
For truly large scale optimization models it feels as if the model becomes quite complex and cumbersome before it is even testable. For small scale optimization problems, up to even 10-20 constraints its feasible to code the entire program and just stare at it and debug.
However, for large scale models with potentially 10s-100 of constraint equations it feels as if there should be a way to test subsections of the optimization model before putting the entire thing together.
Imagine you are writing a optimization for a rocket that needs to land on the moon the model tells the rocket how to fly itself and land on the moon safely. There might be one piece of the model that would dictate the gravitational effects of the earth and moon and their orbits together that would influence how the rocket should position itself, another module that dictates how the thrusters should fire in order to maneuver the rocket into the correct positions, and perhaps a final module that dictates how to optimally use the various fuel sources.
Is there a good practice to ensure that one small section (e.g the gravitational module) works well independently of the whole model. Then iteratively testing the rocket thruster piece, then the optimal fuel use etc. Since, once you put all the pieces together and the model doesn't resolve (perhaps due to missing constraints or variables) it quickly becomes a nightmare to debug.
What are the best practices if any for iteratively building and testing large-scale optimization models?
I regularly work on models with millions of variables and equations. ("10s-100 of constraint equations" is considered small-scale). Luckily they all have less than say 50 blocks of similar equations (indexed equations). Obviously just eyeballing solutions is impossible. So we add a lot of checks (also on the data, which can contain errors). For debugging, it is a very good idea to have a very small data set around. Finally, it helps to have good tools, such as modeling systems with type/domain checking, automatic differentiation etc.)
Often we cannot really check equations in isolation because we are dealing with simultaneous equations. The model only makes sense when all equations are present. So "iterative building and testing" is usually not possible for me. Sometimes we keep small stylized models around for documentation and education.

how 'negative sampling' improve word representation quality in word2vec?

negative sampling in 'word2vec' improves the training speed, that's obviously!
but why 'makes the word representations significantly more accurate.'?
I didn't find the relevant discussion or details. can u help me?
It's hard to describe what the author of that claim may have meant, without the full context of where it appeared. For example, word-vectors can be optimized for different tasks, and the same options that make word-vectors better for one task might make them worse for another.
One popular way to evaluate word-vectors since Google's original paper & code release is a set of word-analogy problems. These give a nice repeatable summary 'accuracy' percentage, so the author might have meant that for a particular training corpus, on that particular problem, holding other things constant, the negative-sampling mode had a higher 'accuracy' score.
But that wouldn't mean it's always better, with any corpus, or for any other downstream evaluation of quality or accuracy-on-some-task.
Projects with larger corpuses, and especially larger vocabularies (more unique words), tend to prefer the negative-sampling mode. The hierarchical-softmax alternative mode becomes slower as the vocabulary becomes larger, while the negative-sampling mode does not.
And, having a large, diverse corpus, with many subtly-different usage examples of all interesting words, is the most important contributor to really good word-vectors.
So, simply by making larger corpuses manageable, within a limited amount of training time, negative-sampling could be seen as indirectly enabling improved word-vectors - because corpus size is such an important factor.

Definitions of Phenotype and Genotype

Can someone help me understand the definitions of phenotype and genotype in relation to evolutionary algorithms?
Am I right in thinking that the genotype is a representation of the solution. And the phenotype is the solution itself?
Thanks
Summary: For simple systems, yes, you are completely right. As you get into more complex systems, things get messier.
That is probably all most people reading this question need to know. However, for those who care, there are some weird subtleties:
People who study evolutionary computation use the words "genotype" and "phenotype" frustratingly inconsistently. The only rule that holds true across all systems is that the genotype is a lower-level (i.e. less abstracted) encoding than the phenotype. A consequence of this rule is that there can generally be multiple genotypes that map to the same phenotype, but not the other way around. In some systems, there are really only the two levels of abstraction that you mention: the representation of a solution and the solution itself. In these cases, you are entirely correct that the former is the genotype and the latter is the phenotype.
This holds true for:
Simple genetic algorithms where the solution is encoded as a bitstring.
Simple evolutionary strategies problems, where a real-value vector is evolved and the numbers are plugged directly into a function which is being optimized
A variety of other systems where there is a direct mapping between solution encodings and solutions.
But as we get to more complex algorithms, this starts to break down. Consider a simple genetic program, in which we are evolving a mathematical expression tree. The number that the tree evaluates to depends on the input that it receives. So, while the genotype is clear (it's the series of nodes in the tree), the phenotype can only be defined with respect to specific inputs. That isn't really a big problem - we just select a set of inputs and define phenotype based on the set of corresponding outputs. But it gets worse.
As we continue to look at more complex algorithms, we reach cases where there are no longer just two levels of abstraction. Evolutionary algorithms are often used to evolve simple "brains" for autonomous agents. For instance, say we are evolving a neural network with NEAT. NEAT very clearly defines what the genotype is: a series of rules for constructing the neural network. And this makes sense - that it the lowest-level encoding of an individual in this system. Stanley, the creator of NEAT, goes on to define the phenotype as the neural network encoded by the genotype. Fair enough - that is indeed a more abstract representation. However, there are others who study evolved brain models that classify the neural network as the genotype and the behavior as the phenotype. That is also completely reasonable - the behavior is perhaps even a better phenotype, because it's the thing selection is actually based on.
Finally, we arrive at the systems with the least definable genotypes and phenotypes: open-ended artificial life systems. The goal of these systems is basically to create a rich world that will foster interesting evolutionary dynamics. Usually the genotype in these systems is fairly easy to define - it's the lowest level at which members of the population are defined. Perhaps it's a ring of assembly code, as in Avida, or a neural network, or some set of rules as in geb. Intuitively, the phenotype should capture something about what a member of the population does over its lifetime. But each member of the population does a lot of different things. So ultimately, in these systems, phenotypes tend to be defined differently based on what is being studied in a given experiment. While this may seem questionable at first, it is essentially how phenotypes are discussed in evolutionary biology as well. At some point, a system is complex enough that you just need to focus on the part you care about.

Optimizing assignments based on many variables

I was recently talking with someone in Resource Management and we discussed the problem of assigning developers to projects when there are many variables to consider (of possibly different weights), e.g.:
The developer's skills & the technology/domain of the project
The developer's travel preferences & the location of the project
The developer's interests and the nature of the project
The basic problem the RM person had to deal with on a regular basis was this: given X developers where each developers has a unique set of attributes/preferences, assign them to Y projects where each project has its own set of unique attributes/requirements.
It seems clear to me that this is a very mathematical problem; it reminds me of old optimization problems from algebra and/or calculus (I don't remember which) back in high school: you know, find the optimal dimensions for a container to hold the maximum volume given this amount of material—that sort of thing.
My question isn't about the math, but rather whether there are any software projects/libraries out there designed to address this kind of problem. Does anyone know of any?
My question isn't about the math, but rather whether there are any software projects/libraries out there designed to address this kind of problem. Does anyone know of any?
In my humble opinion, I think that this is putting the cart before the horse. You first need to figure out what problem you want to solve. Then, you can look for solutions.
For example, if you formulate the problem by assigning some kind of numerical compatibility score to every developer/project pair with the goal of maximizing the total sum of compatibility scores, then you have a maximum-weight matching problem which can be solved with the Hungarian algorithm. Conveniently, this algorithm is implemented as part of Google's or-tools library.
On the other hand, let's say that you find that computing compatibility scores to be infeasible or unreasonable. Instead, let's say that each developer ranks all the projects from best to worst (e.g.: in terms of preference) and, similarly, each project ranks each developer from best to worst (e.g.: in terms of suitability to the project). In this case, you have an instance of the Stable Marriage problem, which is solved by the Gale-Shapley algorithm. I don't have a pointer to an established library for G-S, but it's simple enough that it seems that lots of people just code their own.
Yes, there are mathematical methods for solving a type of problem which this problem can be shoehorned into. It is the natural consequence of thinking of developers as "resources", like machine parts, largely interchangeable, their individuality easily reduced to simple numerical parameters. You can make up rules such as
The fitness value is equal to the subject skill parameter multiplied by the square root of the reliability index.
and never worry about them again. The same rules can be applied to different developers, different subjects, different scales of projects (with a SLOC scaling factor of, say, 1.5). No insight or real leadership is needed, the equations make everything precise and "assured". The best thing about this approach is that when the resources fail to perform the way your equations say they should, you can just reduce their performance scores to make them fit. And if someone has already written the tool, then you don't even have to worry about the math.
(It is interesting to note that Resource Management people always seem to impose such metrics on others in an organization -- thereby making their own jobs easier-- and never on themselves...)

Looking for ideas/references/keywords: adaptive-parameter-control of a search algorithm (online-learning)

I'm looking for ideas/experiences/references/keywords regarding an adaptive-parameter-control of search algorithm parameters (online-learning) in combinatorial-optimization.
A bit more detail:
I have a framework, which is responsible for optimizing a hard combinatorial-optimization-problem. This is done with the help of some "small heuristics" which are used in an iterative manner (large-neighborhood-search; ruin-and-recreate-approach). Every algorithm of these "small heuristics" is taking some external parameters, which are controlling the heuristic-logic in some extent (at the moment: just random values; some kind of noise; diversify the search).
Now i want to have a control-framework for choosing these parameters in a convergence-improving way, as general as possible, so that later additions of new heuristics are possible without changing the parameter-control.
There are at least two general decisions to make:
A: Choose the algorithm-pair (one destroy- and one rebuild-algorithm) which is used in the next iteration.
B: Choose the random parameters of the algorithms.
The only feedback is an evaluation-function of the new-found-solution. That leads me to the topic of reinforcement-learning. Is that the right direction?
Not really a learning-like-behavior, but the simplistic ideas at the moment are:
A: A roulette-wheel-selection according to some performance-value collected during the iterations (near past is more valued than older ones).
So if heuristic 1 did find all the new global best solutions -> high probability of choosing this one.
B: No idea yet. Maybe it's possible to use some non-uniform random values in the range (0,1) and i'm collecting some momentum of the changes.
So if heuristic 1 last time used alpha = 0.3 and found no new best solution, then used 0.6 and found a new best solution -> there is a momentum towards 1
-> next random value is likely to be bigger than 0.3. Possible problems: oscillation!
Things to remark:
- The parameters needed for good convergence of one specific algorithm can change dramatically -> maybe more diversify-operations needed at the beginning, more intensify-operations needed at the end.
- There is a possibility of good synergistic-effects in a specific pair of destroy-/rebuild-algorithm (sometimes called: coupled neighborhoods). How would one recognize something like that? Is that still in the reinforcement-learning-area?
- The different algorithms are controlled by a different number of parameters (some taking 1, some taking 3).
Any ideas, experiences, references (papers), keywords (ml-topics)?
If there are ideas regarding the decision of (b) in a offline-learning-manner. Don't hesitate to mention that.
Thanks for all your input.
Sascha
You have a set of parameter variables which you use to control your set of algorithms. Selection of your algorithms is just another variable.
One approach you might like to consider is to evolve your 'parameter space' using a genetic algorithm. In short, GA uses an analogue of the processes of natural selection to successively breed ever better solutions.
You will need to develop an encoding scheme to represent your parameter space as a string, and then create a large population of candidate solutions as your starting generation. The genetic algorithm itself takes the fittest solutions in your set and then applies various genetic operators to them (mutation, reproduction etc.) to breed a better set which then become the next generation.
The most difficult part of this process is developing an appropriate fitness function: something to quantitatively measure the quality of a given parameter space. Your search problem may be too complex to measure for each candidate in the population, so you will need a proxy model function which might be as hard to develop as the ideal solution itself.
Without understanding more of what you've written it's hard to see whether this approach is viable or not. GA is usually well suited to multi-variable optimisation problems like this, but it's not a silver bullet. For a reference start with Wikipedia.
This sounds like hyper heuristics which you're trying to do. Try looking for that keyword.
In Drools Planner (open source, java) I have support for tabu search and simulated annealing out the box.
I haven't implemented the ruin-and-recreate-approach (yet), but that should be easy, although I am not expecting better results. Challenge: Prove me wrong and fork it and add it and beat me in the examples.
Hyper heuristics are on my TODO list.