In which scenario do you use chunking instead of full parsing? - text-mining

Chunking or shallow parsing segments a sentence into a sequence of syntactic constituents or chunks, i.e. sequences of adjacent words grouped on the basis of linguistic properties. It is often referred as efficient and robust approach to parsing natural language and a popular alternative to the full parsing but in which scenario chunking would be the more appropriate technique
over full parsing.

This is nothing more than my own personal bias, but if for some reason you only need to detect noun and/or verb phrases, you are often might be better off with chunking. E.g., for document clustering, topic tagging, or simply identifying keywords, NP or VP chunking can be more than sufficient. Also, if you need to work with a language for which no tree-banks exist, you might want to fall back to chunking.
Chunking typically has the advantage of being orders of magnitude faster than deep parsing, but modern (perceptron/neural) parsers are much faster than deep parsers used to be five or ten years ago. However, even to date, deep parsing can choke on long sentences. And, obviously, annotating tree-banks to train a deep parser is more costly than annotating the NP/VP phrases or even just building a rule-based chunker - particularly if you need to detect phrases in non-English texts.

Related

Conflicts in the training data for Microsoft Custom Translator

I am using Microsoft Custom Translator and providing the training data in tmx format. My training data has some conflicts. For example, I have English to German training data where I have duplicate English strings but the German translations are different for these duplicate English strings. In such cases, how does it affect the Model ?
As long as one side is different, they are merely alternative translations, which happen all the time. The alternatives will be kept, and influence the probabilities in the resulting model.
I'll expand on the official and approved answer from our esteemed colleague at Microsoft Translator.
Yes, it happens a lot, and yes it will influence the probabilities in the resulting model.
Is that good? It depends.
Yes, there are target-side conflicts due to different contexts, especially on short strings, but just as often there are other reasons, and unjustifiable inconsistencies.
It's best to actually look at the target-side conflicts and make an executive decision based on the type of the conflicts and the scenario - the overall dataset, the desired behaviour and the behaviour of the generic system.
There are cases where target-side conflicts in training data are desireable or harmless, but at least as often, they're harmful or strike trade-offs.
For example, missing accent marks, bad encodings, nasty hidden characters or other non-human readable differences like double-width parentheses, conflicting locales, untranslated segments, updating style guidelines... are mostly harmful conflicts. One variant could be localising units while the other does not. And, often enough, one variant is just a bad translation.
Very often, these direct conflicts - that is conflicts between segments that have the same exact source, which can be found with a simple script - are a clue about conflicts in the wider dataset - which are harder to find unless you know what you're looking for.
Trade-offs exist between more 1:1 translationese and transcreation, between accuracy and fluency. The former has a bad name but it's less risky and more robust.
The decision could be to drop, resolve or to normalise, or to go debug the dataset and data pipeline.
Just throwing it all in the blackbox and mumbling In Deep Learning We Trust over Manning and Schütze 1999 three times only makes sense if the scale - the frequency with which you train custom models, not the amount of training data - is so high that basic due diligence is not feasible.
To really know, you may need to train the system with and without the conflicts, and evaluate and compare.
Source-side noise and conflicts, on the other hand, are not even really conflicts and are usually safe and even beneficial to include. And they're still worth peeking at.

How to build short sentences with a small letter set restriction?

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":
"Antonia ist mit Tina im Tal."
"Tamina malt mit lila Tinte Enten."
"Tina nimmt alle Tomaten mit."
For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:
SimpleNLG - a Java API, which has also an adaption for the german language
KOMET - multilingual generation, from University Bremen
Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction?
Can you recommend a paper to this topic?
Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.
Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.
Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:
Pick first word at random, checking that it is constructed solely from your desired set of characters
Use the Markov Chain to predict the next word
Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
If not, predict another word until it is the case.
If so, then repeat starting at 2 to completion.
Hope that's what you're looking for. Let me know if you have any more questions.
As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.
Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.
While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.
As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

Haskell: list/vector/array performance tuning

I am trying out Haskell to compute partition functions of models in statistical physics. This involves traversing quite large lists of configurations and summing various observables - which I would like to do as efficiently as possible.
The current version of my code is here: https://gist.github.com/2420539
Some strange things happen when trying to choose between lists and vectors to enumerate the configurations; in particular, to truncate the list, using V.toList . V.take (3^n) . V.fromList (where V is Data.Vector) is faster than just using take, which feels a bit counter-intuitive. In both cases the list is evaluated lazily.
The list itself is built using iterate; if instead I use Vectors as much as possible and build the list by using V.iterateN, again it becomes slower ...
My question is, is there a way (other than splicing V.toList and V.fromList at random places in the code) to predict which one will be the quickest? (BTW, I compile everything using ghc -O2 with the current stable version.)
Vectors are strict, and have O(1) subsets (e.g. take). They also have an optimized insert and delete. So you will sometimes see performance improvements by switching data structures on the fly. However, it is usually the wrong approach -- keeping all data in either one form or the other is better. (And you're using UArrays as well -- further confusing the issue).
General rules:
If the data is large and being transformed only in bulk fashion, using a dense, efficient structures like vectors make sense.
If the data is small, and traversed linearly, rarely, then lists make sense.
Remember that operations on lists and vectors have different complexity, so while iterate . replicate on lists is O(n), but lazy, the same on vectors will not necessarily be as efficient (you should prefer the built in methods in vector to generate arrays).
Generally, vectors should always be better for numerical operations. It might be that you have to use different functions that you do in lists.
I would stick to vectors only. Avoid UArrays, and avoid lists except as generators.

How to extract semantic relatedness from a text corpus

The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.
The simplest approach I've read about consists of extracting IF-IDF information from the corpus.
A lot of people use Latent Semantic Analysis to find semantic correlations.
I've come across the Lucene search engine: http://lucene.apache.org/
Do you think it is suitable to extract IF-IDF?
What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?
Thanks in advance!
Mulone
Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.
If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.
It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of count from Query

Voicexml how many words in grammar

I want to have a dynamic grammar in my voicexml file (read single products and create the grammar with php)
my question is, if there is any advice or experience how many words should be writte into the source from where I read the products.
I don't know much about the structure or pronunciation of the words, so let's say
a) the words are rather different from each other
b) the words rather have the same structre or pronunciation
c) a mix of a) and b)
thanks in advance
I'm assuming you mean SRGS grammars when you indicate a dynamic grammar for VoiceXML.
Unfortunately, you're going to have to do performance testing under a reasonable load to really know for sure. I've successfully transmitted 1M+ grammars under certain conditions. I've also done 10,000 name lists. I've also come across platforms that can only utilize a few dozen entries.
The speech recognition (ASR) and VoiceXML platform are going to have a significant impact on your results. And, the number of concurrent recognitions with this grammar will also be relevant along with the overall recognition load.
The factors you mention do have an impact on recognition performance and cpu load, but I've typically found size of grammar and length/variability of entries to matter more. For example, yes/no grammars typically have a much higher cpu load then complex menu grammars (short phrases tend to require more passes and leave open a larger number of possibilities when processing). I've seen some horrible numbers from wide ranging digit grammars (9-31 digit grammars). The sounds are short and difficult to disambiguate. The variability in components, again, creates large number of paths that have to be continuously checked for a solution. Most menu or natural speaking phrases have longer words that sound significantly different so that many paths can be quickly excluded.
Some tips:
Most enterprise class ASR systems support a cache. If you can identify grammars with URL parameters and set any HTTP header information the ASR needs (don't assume they follow the standards), you may see a significant performance boost.
Prompts can often hide grammar loading/compiling phases. If you have a relatively long prompt where people will tend to barge in, you'll find that you can hide some fairly large grammar fetches. Again, not all platforms do a good job of processing these tasks in parallel. Note, most ASR engines can collect audio and perform end-pointing, while still fetching and compiling the grammar. This buys you more time, but you'll see the impact in longer latencies.
Most ASR engines provide tools that let you analyze a grammar with sample audio. The tools will usually give you a cpu resource indicators. I've rarely found that you can calculate/predict overall performance due to the complexities around recognition concurrency, but they can give you a comparative impact with other grammars. I have yet to find an engine that makes it easy to track grammar processing times, it can be difficult to even roughly guess concurrency challenges. In most cases, large scale testing has been necessary.
After grammar load/compile times, recognition concurrency is the most significant performance impact. I've seen a few applications that have highly complex grammars near the beginning of the call. There were high levels of recognition concurrency without an opportunity to cache (platform issue at the time), which lead to scaling challenges (intermittent, large latencies in recognition processing).