Resources for programs teaching natural languages - api

What API's and data sets are available for use in programs to teach natural languages e.g. to aid in learning to read/write/listen/speak a 2nd language? These could be web or traditional API's to dictionaries, translation services, associations of words / concepts to images, sounds e.g. spoken words or phrases, movies, or sets of flashcard decks. Also of interest are websites that could be spidered to obtain local data sets for offline use.
As a start, I note that that the Google translate API can be accessed programmatically.
There is an online web course for Swedish with sound files.
There are online texts and MP3 files for many languages including Swedish from the Foreign Service Institute.
I am especially interested in resources for Swedish, but feel free to add resources for other languages. Please tag any answers with the relevant language or languages.

I think spidering sites that provide language learning content is against the ToU for most of them.
If you're interested in attempting to analyze pronunciation and grade its correctness, this API may be worth referencing. I don't have personal experience with it, however - algorithms for this are really still in their infancy, and don't provide a lot of value to the average consumer at this point.

Related

Captcha alternative

In order to implement a CAPTCHA for my login page, I would like to understand how a translation test can be considered secure compared to popular image recognition patterns.
All customers will be bilingual speakers of an orally learnt and used Polynesian language i.e., no formal spelling conventions (hence the translation to English not the reverse), so instead of asking them to read distorted letters I would like to ask them to translate a simple sentence into English to be validated from the PHP server side.
Is this secure/accurate?
The basic idea to state that this kind of CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is totally insecure is that while the OP states that "currently" Google Translator doesn't offer support for Polynesian language, it cannot be excluded that it will do so in the future.
More generally, translation is not a valid CAPTCHA test because of the following considerations:
Comparing a random sentence VS its automated translation using a public translator (e.g. a future version of Google, Bing) is equal for a hacker submitting the same phrase to the translation engine
Using a whitelist of sentences and their translations will be eventually overwhelmed by the accuracy of the automated public translators
I mean that modern public computer translators are perfecting their accuracy. If you assume that a public translator is unable to perform an accurate job today and challenge the user with a known phrase the translator cannot process, technology will tend to eventually fix that translation and you will get the challenge sentence easily spotted by robots.
That is the main principle of ReCaptcha being used as an OCR, but from the opposite side. I will suggest you to read this paper but briefly the researchers state that ReCaptcha is destined to improve its accuracy far more than automated OCRs because of user input.
Since Google and Bing Translate widely use user-submitted data to improve their translation process, they will be subject to a human-aided machine learning eventually breaking the Turing Test for that kind of challenge (e.g. ReCaptcha will read like a human, Translate will translate like a human)
After reading the comments, it seems the only danger I face is a vague future Google Translate one, which is unlikely to eventuate. So I'm going to stick my head out and say that this is indeed a good security measure which could conceivably be useful to many businesses or organisations that have such a customer base. Thanks for the assist.
Major point in it's favor is ease of use for the customers all of which so far prefer it to trying to read captcha. I put it on a live system so had 80+ people use it today.
I presume they all speak English too then? Unusual to require your users to be bilingual. Even if this is the case today, is it possible that with future growth you might be excluding certain users? What if someone moves into the area who wants to signup but only speaks English?
Language is a funny imprecise thing. You could take a sentence and probably translate it a number of different ways. Computers deal in precision so you need a question where there can only be one answer.
Also, the whole idea of a CAPTCHA is to make sure it's a real person but it may not be too hard to write a program that uses google translate or something similar. It may not always get it right but it'd probably get through some of the time.

How to optimise Google Translate API calls to translate multiple words in a single request

Everyone. Recently Google Translate Is Integrated Into My Project, Which Plays The Role Of Translating Some Product Names, Product Descriptions, Product Related Category Names. But Cause There Are Plenty Of Products In My Database(And Increased Quickly), Google Translate Api Would Cost Considerable Money.
I Want To Translate By Google As Less As Possible. In The Translation, Many Words Are Same Among Many Products, For Example : 阿迪达斯 - Adidas, 苹果 - iphone, 篮球 - Basketball, Bla Bla..... I Wanna Do Some Tricks, But Find No Idea.
Did Anyone Encounter Such Questions?
Any Help Would Be Appreciated.
It sounds like what you need is actually the ability to reuse translation at the string or substring level (in other words, per database entry). You can't really do that with Google, that I know of. You've got a few options, as I see it:
You could switch over to Microsoft Translator and use their methods
that allow you to place translations yourself, such as their
Collaborative Translation feature that lets you override the MT with
a preferred translation and even to vote translations up/down. Quality here will be broadly comparable to Google (I often find it better), and you have methods at your disposal that allow this override. Also, unlike Google, the Microsoft API is free up to a certain volume. Take a look:
http://www.microsoft.com/en-us/translator/developers.aspx
Microsoft also has a unique feature called the Microsoft Translator Hub, which can use your terminology, for example, for translations. However,depending on how you implemented any solution with Microsoft, you might still have the problem that you are making more calls out to Microsoft than you'd like, and, moreover, that "matching" only takes place at the level of a whole record or string, so it would not hit the case of shared linguistic elements being concatenated into one string.
There's a commercial offering called GeoFluent (full disclosure--I am the product manager for this product, so I'm clearly biased :)) that works with Microsoft Translator but provides pre and post translation processing that can deal with sub-segment and may reduce the volume you are therefore putting through translation each time. It could make sense if, as you mention, you are rapidly adding to your database. Of course, this is a commercial offering too, so you'd have to balance the costs.
Let me know if this helps, and happy to answer any other questions you have.
Marcus
There is a PHP sample here : http://weblite.ca/svn/dataface/modules/tm/trunk/lib/googleTranslatePlugin.php
That allows you to send and array and return an array.
array(source=>target) getTranslations()
translates all of the user provided strings into the target language using the Google Translate API and returns an array of source=>target
strings.

Why embed lua into a game engine?

I've been looking into building a basic game engine from the ground up and after making a list of features that are common to other engines, one of the bigger things is the fact that they have an embedded scripting language like lua or python.
My question is how is an embedded scripting language superior to just making a header file (or something of the like) that the user can include in a c++ file which gives them access to many of the functions and states. I'm sure there's a very good answer out there, I just haven't stumbled on it yet.
Also beyond why it's needed, what are languages like lua used for in things like game engines?
Lua is a far simpler language than C++, and all you need to edit it is a text editor. This puts the ability to script events and/or high level game logic in the hand of your designers and end users. Dynamic typing and garbage collecting allows them write very succinct code that focusses on game logic rather than all the systems-level housekeeping chores you get in a language like C++. It's also far easier to sandbox.
Lua is a popular choice because it's small, portable, hackable ANSI C code base; easy to embed, extend, and -- most importantly for game developers -- it has a minimal runtime footprint (one of the fastest interpreted languages). It's also a great combination of easy to learn/read/write syntax, but with powerful features like coroutines which can be very useful in games.
The reason for including a scripting language is to allow users to customize the behavior without having to recompile the code.
I'm not sure about what you are asking in the second part of the question. Are you asking what other languages are used, or are you asking what ways are languages like Lua used?
If you asked about what other languages are good for this, one such language is Tcl. Tcl was designed from the ground up to be an embedded scripting language, and is very mature and robust, and easily learned by non-technical people.
As for what scripting languages are good for ... configuration files is one way. By using a programming language rather than a text file with name/value pairs, it allows users to add logic to their start-up files. For example, maybe you allow users to assign different functions to keys on the keyboard; with a programming language they can add different functions for different computers. Or, if you're creating a game like a RPG, perhaps you can assign different keys for different character classes. If playing as a mage, F12 might be cast a spell, but if playing as a warrior f12 might be to do a finishing blow.
There are many ways to use scripting languages, and many different langages to choose from. It all boils down to allowing your users to customize the behavior of the game without having to recompile the code.
You might find this article by a Game developer useful in understanding why embedded languages are used.
http://www.grimrock.net/2012/07/25/making-of-grimrock-rapid-programming/
Another good reason, is unless you are sharing your source code with your game users and they are all C programmers, languages such as lua make it possible for users to extend the game, for example look at World of Warcraft.

How I can start building wordnet for Turkish language to use in sentiment analysis

Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.