How to use CFStringTokenizer with Chinese and Japanese? - objective-c

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.

If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).

Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.

Related

Is there a description of the mecab (Japanese word parser) algorithm?

Is there a document somewhere that describes the Mecab algorithm?
Or could someone give a simple one-paragraph or one-page description?
I'm finding it too hard to understand the existing code, and what the databases contain.
I need this functionality in my free website and phone apps for teaching languages (www.jtlanguage.com). I also want to generalize it for other languages, and make use of the conjugation detection mechanism I've already implemented, and I also need it without license encumbrance. Therefore I want to create my own implementation (C#).
I already have a dictionary database derived from EDICT. What else is needed? A frequency-of-usage database?
Thank you.
Some thoughts that are too long to fit in a comment.
§ What license encumbrances? MeCab is dual-licensed including BSD, so that's about as unencumbered as you can get.
§ There's also a Java rewrite of Mecab called Kuromoji that's Apache licensed, also very commercial-friendly.
§ MeCab implements a machine learning technique called conditional random fields for morphological parsing (separating free text into morphemes) and part-of-speech tagging (labeling those morphemes) Japanese text. It is able to use various dictionaries as training data, which you've seen—IPADIC, UniDic, etc. Those dictionaries are compilations of morphemes and parts-of-speech, and are the work of many human-years worth of linguistic research. The linked paper is by the authors of MeCab.
§ Others have applied other powerful machine learning algorithms to the problem of Japanese parsing.
Kytea can use both support vector machines and logistic regression to the same problem. C++, Apache licensed, and the papers are there to read.
Rakuten MA is in JavaScript, also liberally licensed (Apache again), and comes with a regular dictionary and a light-weight one for constrained apps—it won't give you readings of kanji though. You can find the academic papers describing the algorithm there.
§ Given the above, I think you can see that simple dictionaries like EDICT and JMDICT are insufficient to do the advanced analysis that these morphological parsers do. And these algorithms are likely way overkill for other, easier-to-parse languages (i.e., languages with spaces).
If you need the power of these libraries, you're probably better off writing a microservice that runs one of these systems (I wrote a REST frontend to Kuromoji called clj-kuromoji-jmdictfurigana) instead of trying to reimplement them in C#.
Though note that it appears C# bindings to MeCab exist: see this answer.
In several small projects I just shell out to MeCab, then read and parse its output. My TypeScript example using UniDic for Node.js.
§ But maybe you don't need full morphological parsing and part-of-speech tagging? Have you ever used Rikaichamp, the Firefox add-on that uses JMDICT and other low-weight publicly-available resources to put glosses on website text? (A Chrome version also exists.) It uses a much simpler deinflector that quite frankly is awful compared to MeCab et al. but can often get the job done.
§ You had a question about the structure of the dictionaries (you called them "databases"). This note from Kimtaro (the author of Jisho.org) on how to add custom vocabulary to IPADIC may clarify at least how IPADIC works: https://gist.github.com/Kimtaro/ab137870ad4a385b2d79. Other more modern dictionaries (I tend to use UniDic) use different formats, which is why the output of MeCab differs depending on which dictionary you're using.

Antlr and PL/I grammar

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars
If not, can you please share your thought/experience on developing these grammars from scratch
Thanks
I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).
The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.
Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):
The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.
After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).
TL;DR
Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.
Have a look at the OpenSource Cobol-85 Parser from ProLeap, based on antlr4 and creating ASTs and ASGs as well.
And, best of all, it really works !
https://github.com/uwol/proleap-cobol-parser
I am not aware of a comparable PLI-grammar, but a very good start is the EBNF-definition from Ralf Lämmel (CWI, Amsterdam) & Chris Verhoef (WINS, Universiteit van Amsterdam)
http://www.cs.vu.nl/grammarware/browsable/os-pli-v2r3/

Is there a repository of grammars for CMU Sphinx?

I'm writing an (offline) voice recognition app. I have CMU Sphinx4 set up and working using some of the included demo dictionaries. However, they're of limited scope (eg..numbers, cities, etc).
Is there a more comprehensive grammar available? Or maybe a repository of more of these limited grammars? I'm trying to exhaust any other options before creating my own.
Thank you
Grammars are always specific to your particular goal, so it does not make sense to share those . Even such simple subject as digits can vary between concrete applications: we use "zero" and "oh" to denote "0" in regular speech, whilst scientists also use "not" for the same purpose.
Sphinx4 supports JSGF and GRXML formats, you can easily find specifications of both.
You seem to be mistaking grammars with dictionaries. They are completely different things.
Sphinx supports not only grammars, but also n-gram language models. You may find them more versatile. Such model can be automatically generated and will work if given a large corpora which reflects the real usage sentences.
As for dictionaries - creating them for english is relatively simple. One could even think about a tool which reads a phonetic word representation from an online dictionary and converts it to sphinx format. The only input would be a word list.
I believe this paper will come handy to your effort. This paper entails creating grammar and dictionary for a new language, Swahili

How Do I Design Abstract Semantic Graphs?

Can someone direct me to online resources for designing and implementing abstract semantic graphs (ASG)? I want to create an ASG editor for my language. Being able to edit the ASG directly has a number of advantages:
Only identifiers and literals need to be typed in and identifiers are written only once, when they're defined. Everything else is selected via the mouse.
Since the editor knows the language's grammar, there are no more syntax errors. The editor prevents them from being created in the first place.
Since the editor knows the language's semantics, there are no more semantic errors.
There are some secondary advantages:
Since all the reserved words are easily separable, a program can be written in one locale and viewed in other. On-the-fly changes of locale are possible.
All the text literals are easily separable, so changes of locale are easily made, including on-the-fly changes.
I'm not aware of a book on the matter, but you'll find the topic discussed in portions of various books on computer language. You'll also find discussions of this surrounding various projects which implement what you describe. For instance, you'll find quite a bit of discussion regarding the design of Scratch. Most workflow engines are also based on scripting in semantic graphs.
Allow me to opine... We've had the technology to manipulate language structurally for basically as long as we've had programming languages. I believe that the reason we still use textual language is a combination of the fact that it is more natural for us as humans, who communicate in natural language, to wield, and the fact that it is sometimes difficult to compose and refactor code when proper structure has to be maintained. If you're not sure what I mean, try building complex expressions in Scratch. Text is easier and a decent IDE gives virtually as much verification of correct structure.*
*I don't mean to take anything away from Scratch, it's a thing of beauty and is perfect for its intended purpose.

How I can start building wordnet for Turkish language to use in sentiment analysis

Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)