I am interested in statistical machine translation. Can anyone suggest where can I find more information about state-of-the-art implementations like Google Translate, Microsoft Translate?
I would like to know about the following stuff:
1) The size of training data for different languages.
2) The quality of the translations for different languages.
and any other interesting point about engine implementation.
If you want to have more details about the state of the art, you should have a look on this book: http://www.statmt.org/book/
It is probably a little outdated but it interesting.
The famous tool in MT is Moses: http://statmt.org/moses/
You'll find an overview of this tool, and you can try a tutorial if you're brave.
With these documentation you'll have a more detailed comprehension of the task.
Related
Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)
I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.
This answer shows a pretty example of using a parser generator to look through text for some patterns of interest. In that example, it's product prices.
Does anyone know of tools to generate the grammars given training examples (document + info I want from it)? I found a couple papers, but no tools. I looked through ANTLR docs a bit, but it deals with grammars; a "recognizer" takes as input a grammar, not training examples.
This is a machine learning problem. You can at best get an approximation. But I don't think anybody has done this well, let alone released a tool. (I actively track what people do to build grammars for computer languages, and this idea has been proposed many times, but I have yet to see a useful implementation).
The problem is that for any fixed set of examples, there's a huge number of possible grammars. It is easy to construct a naive one: for the fixed set of examples, simply propose a grammar that has one rule to recognize each example. That works, but is hardly helpful. Now the question is, how many ways can you generalize this, and which one is the best? In fact you can't know, because your next new example may be a total surprise in terms of structure. (Theory definition: A language is the set of sentences that comprise it).
We haven't even talked about the simpler problem of learning the lexemes of the language. How would you propose to learn what legal strings for floating point numbers are?
One tool that does this is NLTK. I Highly recommend it, and the O'Reilly book that covers it is available free online. There are tools for parsing, learning grammars, etc... The only downside is that it is mainly a research rather than production tool, so the emphasis isn't on performance.
NLTK is able to construct grammar from labeled training samples, which is exactly what you are asking. Have a look at the great docs and the book. (My last experience with it also had it working on the JVM through Jython without any issues.)
I'm looking for code or a product or a service to do semantic analysis of text (sentences and or paragraphs) to categorize the text by general topic, e.g.
Finance
Entertainment
Technology
Business
Art
etc...
If you have a bunch of examples that have already been categorised, you can use these to train a classifier.
This is a very simple document classfication problem, and any suite of machine learning tools will have the algorithms and tutorials for this. For instance, check out weka: http://www.cs.waikato.ac.nz/ml/weka/
or rapidminer: http://rapid-i.com/content/blogcategory/38/69/
If your needs are limited, and you just want a simple API, you cannot go wrong with this Naive Bayes library: https://ci-bayes.dev.java.net/
Good luck!
If you want to evaluate a commercial service API, check out the VIKI engine APIs:
http://www.softwareevolution.it/en/products/viki-core-api.html
It is an easy to use Json service api with specific semantic features.
Would this be of any help to you?
http://en.wikipedia.org/wiki/Document_classification
It's not a finished product or service, neither code, but it describes the various algorithms that can be used for semantic analysis. Googling on a bit further, I believe that it's not really out of the laboratory yet. People are experimenting with KNN algorithms mostly, resulting in cool stuff, but not really what you need:
http://www.ebi.ac.uk/webservices/whatizit/info.jsf
But if there is some software that will do what you ask, it would be in this list:
http://www.kdnuggets.com/software/text.html
For example the LPU program, it seems to be able to learn if you feed it enough teaching documents.
http://www.cs.uic.edu/~liub/LPU/LPU-download.html
If you're into Python/interpreted languages, check out the excellent NLTK framework at nltk.org. It has an excellent how to page and a recently published O'Reilly book.
If you're into Java and/or require a more mature but harder to grasp framework, try GATE instead.
I have lots and lots of data in various structures. Are there any better platforms other than Excel charts which can help me.
thanks
http://services.alphaworks.ibm.com/manyeyes/browse/visualizations
Here you can upload data sets and get different online visualization, your data will be made public tough.
What about google charts?
A starting point
The field of data visualisation is growing rapidly at the moment. Traditional toolchains such as Microsoft Excel were augmented by powerful visualisation solutions as part of the dashboarding craze that came with the last wave of ERPs. We're even more spoiled now as the programming community has joined with traditional analytics to explore java, javascript, and any language you can think of.
The story gets even better with open source and cloud-based solutions. Keeping up is hard work, but I've found some great jump-off points in a recent round of research. If you take an evening to take a few minutes with each of the tools listed in this great Computer World article, you will surely find one that immediately appeals to your preferences and skills.
22 Free Tools for Data Visualization and Analysis
If this is a little much to digest in one sitting, take a glance over the handy chart first to get an overview of some of what is out there.
Bonus
A great one not on that list is d3.js, which is a currently maintained successor to the protovis project, which I believe is no longer active. You can find d3.js on github, which again shows how lucky we are to have such great community efforts in open sourcing these kinds of powerful visualisation solutions.
Depends a bit what your objectives are and how technical you are willing to get.
Incanter is a great toolset that I can heartily recommend (I use it for visualisation in my own projects). It's a statistical computing and visualisation library for Clojure - which in turn is a very flexible and dynamic langauge, good for interactive experiements.
I particularly like the DSL for creating charts, e.g. to create a histogram of 1000 samples from the normal distribution you can just do:
(view (histogram (sample-normal 1000)))
Tale a look at R. It has a strong community and ecosystem. If you enjoy working from a console, you'll probably enjoy how easy it is to go from a CSV, for example, to various data visualizations.
I found this interactive tutorial from Code School to be very helpful in getting started.