Dose chinese need wordpiece? - tensorflow

I want to use Chinese bert model. In tokenization.py, I fond WordpieceTokenizer function(https://github.com/google-research/bert/blob/master/tokenization.py), but I don't think it is needed to use wordpiece for chinese, because the miminal unit of chinese is character.
WordpieceTokenizer is just for english text, am I right?

From the README:
We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages.
However, from the Multilingual README (emphasis added):
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece.
So WordPiece is presumably run on the whole sentence, though it would only matter for sentences that contained non-Chinese characters. So to run the code as-is you would want WordPiece.
However, to clarify:
WordPiece is not just for English, it can be used on any language and in practice is used on many
Whether single character-based tokenization for Chinese is the best decision is debated
WordPiece is not available outside Google, SentencePiece could be used as a replacement (though I think the BERT code might have a pretrained model)

Related

simple input of diacritical marks, and superscripts

There are times when you need to input modified variables with diacritical marks, or superscripts.
Seems like declare_index_properties allows doing it at the stage of display print.
But it is neither simple, nor very useful in formulas.
is there a simple way of adding hats, umlauts, and ', "strokes on top of a symbol, making it distinguishable from the symbol without such mark both to interpreter and to human eye?
Maxima doesn't have a notion of declaring a symbol to have diacritical marks or other combining marks on it. However, Maxima allows Unicode characters in symbol names if the underlying Lisp implementation allows Unicode; almost all of them allow Unicode. GCL is the only Lisp implementation, so far as I know, which doesn't handle Unicode correctly.
WxMaxima appears to allow Unicode characters to be input. At least, it worked that way when I tried some examples. Command-line Maxima allows Unicode if the terminal it is running in allows Unicode.
I think any Unicode character should be OK in a string. For symbols, any character which passes ALPHA-CHAR-P (a build-in Lisp function) can be part of a symbol name. Also, any character which is declared to be alphabetic (via declare("x", alphabetic) where x is the character in question) can be part of a symbol name.
I think wxMaxima has some capability to allow the user to select characters with diacritical marks from a menu; I haven't tried it. When I want to use Unicode characters, I end up just pasting them from a web page or something. I have used https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html as a source of characters in the past.

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

StanfordNLP Spanish Tokenizer

I want to tokenize a text in Spanish with StanfordNLP and my problem is that the model splits any word matching the pattern "\d*s " (a word composed by digits and ending with an "s") in two tokens. If the word finished with another letter, such as "e", the tokenizer return only one token.
For instance, given the sentence:
"Vendo iPhone 5s es libre de fabrica esta nuevo sin usar."
The tokenizer return for the text "iPhone 5s" three tokens:"iPhone", "5" and "s".
Someone has an idea how could I avoid this behaviour?
I suppose you are working with the SpanishTokenizer rather than PTBTokenizer.
SpanishTokenizer is heavily based on the FrenchTokenizer, which comes also from the PTBTokenizer (English).
I've run all three with your sentence and seems that the PTBTokenizer give you the results you need, but not the others.
As all of them are deterministic tokenizers I think you can't avoid that problem because seems to me that the problem is not in the heuristic part which should run later after the deterministic.
A possible workaround may be to use WhitespaceTokenizer, as long as you don't mind having punctuation tokens or some other gramma rules.

conversion of MathematicalPI symbol names to Unicode

I am processing PDF files and wish to convert characters to Unicode as far as possible. The MathematicalPI family of character sets appear to use their own symbol names (e.g. "H11001"). By exploration I have constructed a table (for MathematicalPI-One) like:
<chars>
<char charname="H11001" codepoint16="0X2B" codepoint="43" unicodeName="PLUS"/>
<char charname="H11002" codepoint16="0x2D" codepoint="45" unicodeName="MINUS"/>
<char charname="H11003" codepoint16="0XD7" codepoint="215" unicodeName="MULTIPLICATION SIGN"/>
<char charname="H11005" codepoint16="0X3D" codepoint="61" unicodeName="EQUALS"/>
</char>
Can anyone point me to an existing translation table like this (ideally for all MathematicalPI sets). [I don't want a graphical display of glyphs as that means each has to be looked up as a Unicode equivalent.]
Also there seems to be a similar symbol resource where the charnames are of the form C223 (for copyright). Any information on this will be appreciated.
UPDATE:
I need something well beyond #user1808924's answer - I have already compiled by own (partial) translation table so it's certainly possible to construct one. It is possible to download and display a list of glyphs in MathematicalPI (may hundreds) and to go through the Unicode spec making equivalences (and for the majority I think there are clear equivalences). A satisfactory answer would either include a table with hundreds of equivalences or a defintive statement that this would violate Copyright of the font creator.
UPDATE: Between #minopret and #Miguel it is certainly possible to construct a mapping. The MathPi sets are well defined - a few hundred - and shapecatcher makes it easy to find the best glyphs pictorially. The mapping won't be definitive (i.e. with Adobe's stamp) but it will be worthwhile. And I suspect there will be cases where two different glyphs are essentially identical and so a visual mapping wont work - e.g. is an equilateral triangle INCREMENT or GREEK CAPITAL LETTER DELTA?
I doubt that I personally will complete a full table - I don't know what some of the symbols mean. But I hope to produce a subset used in Scientific technical medical (STM) publishing.
#user1808924 I notice you answered this on your first day on SO. Bounty questions are normally offered (as in this case) for difficult questions where there is a definitive answer but it is difficult to find. It's not normally useful to offer opinions or guesses unless you have expert knowledge of the area.
I do not think that there is such translation table available at all.
It looks to me that MathematicalPI font family is a synthetic one, which has been created ad hoc by selecting a subset of elements from some larger unknown set. The raison d'être of MathematicalPI font family seems to be the representation of simple algebraic operators (plus, minus, multiplication, division) and the equals sign. The charnames (ie. H1100X) appear to be artifacts, because they are not ordered after codepoint values (eg. the equals sign is the last one).
By looking at the available data, I can suggest that the missing H11004 charname should correspond to the division operator. However, it is impossible to predict if it should be represented by the Unicode "solidus" character (ie. U+002F), "division sign" character (ie. U+00F7), or something else.
Here's what I published on the Adobe Forums site:
I could be wrong, but I don't think there's an official correspondence table.
Using the six Type 1 fonts and the OpenType font that was made out of them, I've assembled two PDFs which show all the glyphs. Next to them are the glyph names (for the Type 1 fonts) and the Unicode value(s) (for the OpenType font). If you cross reference these two PDFs, you should be able to assemble the correlation list you're looking for.
Mathematical Pi
Hope this helps.
Miguel
Here is the best information as provided by Miguel Sousa of Adobe in his Typography forum message there:
Mathematical Pi 1-6 PDF / Mathematical Pi 1-6 InDesign IDML
Mathematical Pi Std PDF / Mathematical Pi Std IDML
For what it's worth and to summarize information that I had added in comments on this answer, here is what I was able to find before and apart from that.
Michael Sharpe, creator of package "mathalfa" at CTAN and member of UCSD mathematics, has TeX definitions for Mathematical Pi in this archive file. I successfully guessed that the obsolete documented location at me.com has moved to his university site. The ".vf" files map the characters of Mathematical Pi to TeX math codepoints. They are binary. The mapping data is part of the dump to readable text using the tool "vftovp" that is part of TeX distributions. After performing that dump, we find that the mapped characters are:
mathpibb: 'hyphen-minus' 0-9 A-Z a-z
mathpical: percent 'hyphen-minus' A-Z
mathpifrak: 'hyphen-minus' 0-9 A-Z a-z
mh2s: A-Z
So that explains the package name "mathalfa". He took on only the task of employing the alphabetics and digits but hardly anything more. We must look at the files above for mappings for the symbols.
I think that parts of MathPi, such as the Greek letters of MathPi 1, use the same encoding as Adobe Symbol, which is documented here: http://unicode.org/Public/MAPPINGS/VENDORS/ADOBE/symbol.txt
When attempting to map symbols to Unicode oneself, a good way to find the Unicode point is by drawing the glyph on the screen here: http://shapecatcher.com
FWIW my current mapping table (from reading documents created using MathPI, is:
<codePoint name="H9251" unicode="U+03B1" unicodeName="GREEK LOWERCASE LETTER ALPHA"/>
<codePoint name="H9252" unicode="U+03B2" unicodeName="GREEK LOWERCASE LETTER BETA"/>
<codePoint name="H9253" unicode="U+03B3" unicodeName="GREEK SMALL LETTER GAMMA"/>
<codePoint name="H9254" unicode="U+03B4" unicodeName="GREEK SMALL LETTER DELTA"/>
<codePoint name="H9255" unicode="U+03B5" unicodeName="GREEK SMALL LETTER EPSILON"/>
<codePoint name="H9256" unicode="U+03B6" unicodeName="GREEK SMALL LETTER ZETA"/>
<codePoint name="H9257" unicode="U+03B7" unicodeName="GREEK SMALL LETTER ETA"/>
<codePoint name="H9258" unicode="U+03B8" unicodeName="GREEK SMALL LETTER THETA"/>
<codePoint name="H9259" unicode="U+03B9" unicodeName="GREEK SMALL LETTER IOTA"/>
<codePoint name="H9260" unicode="U+03BA" unicodeName="GREEK SMALL LETTER KAPPA"/>
<codePoint name="H9261" unicode="U+03BB" unicodeName="GREEK SMALL LETTER LAMBDA"/>
<codePoint name="H9262" unicode="U+03BC" unicodeName="GREEK LOWERCASE LETTER MU"/>
<codePoint name="H11001" unicode="U+002B" decimal="43" unicodeName="PLUS"/>
<codePoint name="H11002" unicode="U+002D" decimal="45" unicodeName="MINUS"/>
<codePoint name="H11003" unicode="U+00D7" decimal="215" unicodeName="MULTIPLICATION SIGN"/>
<codePoint name="H11005" unicode="U+003D" decimal="61" unicodeName="EQUALS"/>
<codePoint name="H11011" unicode="U+007E" decimal="126" unicodeName="TILDE"/>
<codePoint name="H11021" unicode="U+003C" decimal="60" unicodeName="LESS" htmlName="lt"/>
<codePoint name="H11022" unicode="U+003E" decimal="62" unicodeName="" htmlName="gt"/>
<codePoint name="H11032" unicode="U+0027" decimal="39" unicodeName="APOSTROPHE" htmlName="apos"/>
<codePoint name="H11034" unicode="U+00B0" decimal="176" unicodeName="DEGREE SIGN" htmlName="deg"/>
<codePoint name="H11554" unicode="U+00B7" decimal="183" unicodeName="MIDDLE DOT"/>

XCode - Display Vietnamese : Unicode problem

I need to display Vietnamese in my APP. But now, i cannot show the words in correct format. For example, the word "&#code" i cannot convert it to Vietnamese, it just display "&#code;".
Does anyone can help me how to handle the word in unicode ?
Thanks a lot!
Tisa
Just write the unicode string inside #"..." without quoting. Strictly speaking, that's non-portable, but as long as you use it for just for Objective-C, it should be OK. It should work on a modern XCode toolchain.
In general, you need to understand that &#... is a way to quote unicode character in HTML, not in a C-string. In C, if you want to be most portable, you need to use \x escapes. Some newer compilers accept \u... and \U... for unicodes.