iOS CFStringTransform and Đ - objective-c

I'm working on an iOS app in which I have to list and sort people names. I've some problem with special character.
I need some clarification on Martin R answer on https://stackoverflow.com/a/15154823/2148377
You could use the CoreFoundation CFStringTransform function which does almost all transformations from your list. Only "đ" and "Đ" have to be handled separately:
Why this particular letter? Where does this come from? Where can I find the documentation?
Thanks a lot.

I am not 100% sure, but I think it can be seen from the Unicode Data Base
http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt.
For example, the entry for "à" is
00E0;LATIN SMALL LETTER A WITH GRAVE;Ll;0;L;0061 0300;;;;N;LATIN SMALL LETTER A GRAVE;;00C0;;00C0
where field #6 is the "Decomposition mapping" into "a" + U+0300 (COMBINING GRAVE ACCENT),
therefore
CFStringTransform(..., kCFStringTransformStripCombiningMarks, ...)
transforms "à" into "a".
The entries for "Đ" and "đ" are
0110;LATIN CAPITAL LETTER D WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER D BAR;;;0111;
0111;LATIN SMALL LETTER D WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER D BAR;;0110;;0110
where field #6 is empty, so these characters do not have a decomposition into a "base character" and a "combining mark".
So the question remains: Which standard determines that a "normalized form" of "đ / Đ" is "d / D"?

Related

What are terminal and nonterminal symbols?

I am reading Rebol Wikipedia page.
"Parse expressions are written in the parse dialect, which, like the do dialect, is an expression-oriented sublanguage of the data exchange dialect. Unlike the do dialect, the parse dialect uses keywords representing operators and the most important nonterminals"
Can you explain what are terminals and nonterminals? I have read a lot about grammars, but did not understand what they mean. Here is another link where this words are used very often.
Definitions of terminal and non-terminal symbols are not Parse-specific, but are concerned with grammars in general. Things like this wiki page or intro in Grune's book explain them quite well. OTOH, if you're interested in how Red Parse works and yearn for simple examples and guidance, I suggest to drop by our dedicated chat room.
"parsing" has slightly different meanings, but the one I prefer is conversion of linear structure (string of symbols, in a broad sense) to a hierarchical structure (derivation tree) via a formal recipe (grammar), or checking if a given string has a tree-like structure specified by a grammar (i.e. if "string" belongs to a "language").
All symbols in a string are terminals, in a sense that tree derivation "terminates" on them (i.e. they are leaves in a tree). Non-terminals, in turn, are a form of abstraction that is used in grammar rules - they group terminals and non-terminals together (i.e. they are nodes in a tree).
For example, in the following Parse grammar:
greeting: ['hi | 'hello | 'howdy]
person: [name surname]
name: ['john | 'jane]
surname: ['doe | 'smith]
sentence: [greeting person]
greeting, person, name, surname and sentence are non-terminals (because they never actually appear in the linear input sequence, only in grammar rules);
hi, hello, howdy with john, jane, doe and smith are terminals (because parser cannot "expand" them into a set of terminals and non-terminals as it does with non-terminals, hence it "terminates" by reaching the bottom).
>> parse [hi jane doe] sentence
== true
>> parse [howdy john smith] sentence
== true
>> parse [wazzup bubba ?] sentence
== false
As you can see, terminal and non-terminal are disjoint sets, i.e. a symbol can be either in one of them, but not in both; moreso, inside grammar rules, only non-terminals can be written on the left side.
One grammar can match different strings, and one string can be matched by different grammars (in the example above, it could be [greeting name surname], or [exclamation 2 noun], or even [some noun], provided that exclamation and noun non-terminals are defined).
And, as usual, one picture is worth a thousand words:
Hope that helps.
think of it like that
a digit can be 1-9
now i will tell you to write down on a page a digit.
so you know that you can write down 1,2,3,4,5,6,7,8,9
basically the nonterminal symbol is "digit"
and the terminals symbols are the 1,2,3,4,5,6,7,8,9
when i told you to write down on a page a digit you wrote down 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
you didn't wrote down the word "digit" you wrote down the 1 or 2 or 3....
do you see where i'm going ?
let's try to make our own "rules"
let's "create" a nonterminal symbol we will call it "Olaf"
Olaf can be a dog (NOTE: dog is terminal)
Olaf can be a cat (NOTE: cat is terminal)
Olaf can be a digit (NOTE: digit is nonterminal)
Now i'm telling you that you can write down on a page an Olaf.
so that's mean that you can write down "dog"
you can also write down "cat"
you can also write down a digit so that's mean you can write down 1 or 2 or 3...
because digit is nonterminal symbol you dont write down "digit" you write down
the symbols that digit is referring to which is 1 or 2 or 3 etc...
in the end only terminals symbols are written on the "page"
one more thing i have to say is something that you may encounter one day, basically when you say "a nonterminal can be something".
there is a special term for that and that's basically called a "production rule"(can also be called a "production")
for example
Olaf can be "dog"
Olaf can be "cat"
Olaf can be digit
we got 3 productions here in other words we got here 3 definitions of Olaf
specifications of Programming languages use those ideas quite a lot when defining a syntax of a language

What characters are allowed in function names etc?

As the title says, which characters are allowed in identifiers (function, variable, and record field names)? aöø all seem to be fine, as do '_9 if not the first character. <$;% do not. Is it documented somewhere which ranges/blocks of unicode characters and symbols are allowed?
Follow-up question: which characters are allowed in infix operators?
So, after reading the Haskell specs (which can be assumed has influenced Elm), the JavaScript specs, and trial and error, I have arrived at the following rules:
An identifier must begin with a character from the unicode categories:
Uppercase letter (Lu) (modules, types)
Lowercase letter (Ll) (functions, variables)
Titlecase letter (Lt) (modules, types)
The rest of the characters must belong to any of the following categories:
Uppercase letter (Lu)
Lowercase letter (Ll)
Titlecase letter (Lt)
Modifier letter (Lm)
Other letter (Lo)
Decimal digit number (Nd)
Letter number (Nl)
Or be _ (except for in module names).
Technically "Other number" (No) seems to also be valid in Elm, but it crashes after it's been compiled to JavaScript.
I used this tool to get the ranges for each category.

conversion of MathematicalPI symbol names to Unicode

I am processing PDF files and wish to convert characters to Unicode as far as possible. The MathematicalPI family of character sets appear to use their own symbol names (e.g. "H11001"). By exploration I have constructed a table (for MathematicalPI-One) like:
<chars>
<char charname="H11001" codepoint16="0X2B" codepoint="43" unicodeName="PLUS"/>
<char charname="H11002" codepoint16="0x2D" codepoint="45" unicodeName="MINUS"/>
<char charname="H11003" codepoint16="0XD7" codepoint="215" unicodeName="MULTIPLICATION SIGN"/>
<char charname="H11005" codepoint16="0X3D" codepoint="61" unicodeName="EQUALS"/>
</char>
Can anyone point me to an existing translation table like this (ideally for all MathematicalPI sets). [I don't want a graphical display of glyphs as that means each has to be looked up as a Unicode equivalent.]
Also there seems to be a similar symbol resource where the charnames are of the form C223 (for copyright). Any information on this will be appreciated.
UPDATE:
I need something well beyond #user1808924's answer - I have already compiled by own (partial) translation table so it's certainly possible to construct one. It is possible to download and display a list of glyphs in MathematicalPI (may hundreds) and to go through the Unicode spec making equivalences (and for the majority I think there are clear equivalences). A satisfactory answer would either include a table with hundreds of equivalences or a defintive statement that this would violate Copyright of the font creator.
UPDATE: Between #minopret and #Miguel it is certainly possible to construct a mapping. The MathPi sets are well defined - a few hundred - and shapecatcher makes it easy to find the best glyphs pictorially. The mapping won't be definitive (i.e. with Adobe's stamp) but it will be worthwhile. And I suspect there will be cases where two different glyphs are essentially identical and so a visual mapping wont work - e.g. is an equilateral triangle INCREMENT or GREEK CAPITAL LETTER DELTA?
I doubt that I personally will complete a full table - I don't know what some of the symbols mean. But I hope to produce a subset used in Scientific technical medical (STM) publishing.
#user1808924 I notice you answered this on your first day on SO. Bounty questions are normally offered (as in this case) for difficult questions where there is a definitive answer but it is difficult to find. It's not normally useful to offer opinions or guesses unless you have expert knowledge of the area.
I do not think that there is such translation table available at all.
It looks to me that MathematicalPI font family is a synthetic one, which has been created ad hoc by selecting a subset of elements from some larger unknown set. The raison d'être of MathematicalPI font family seems to be the representation of simple algebraic operators (plus, minus, multiplication, division) and the equals sign. The charnames (ie. H1100X) appear to be artifacts, because they are not ordered after codepoint values (eg. the equals sign is the last one).
By looking at the available data, I can suggest that the missing H11004 charname should correspond to the division operator. However, it is impossible to predict if it should be represented by the Unicode "solidus" character (ie. U+002F), "division sign" character (ie. U+00F7), or something else.
Here's what I published on the Adobe Forums site:
I could be wrong, but I don't think there's an official correspondence table.
Using the six Type 1 fonts and the OpenType font that was made out of them, I've assembled two PDFs which show all the glyphs. Next to them are the glyph names (for the Type 1 fonts) and the Unicode value(s) (for the OpenType font). If you cross reference these two PDFs, you should be able to assemble the correlation list you're looking for.
Mathematical Pi
Hope this helps.
Miguel
Here is the best information as provided by Miguel Sousa of Adobe in his Typography forum message there:
Mathematical Pi 1-6 PDF / Mathematical Pi 1-6 InDesign IDML
Mathematical Pi Std PDF / Mathematical Pi Std IDML
For what it's worth and to summarize information that I had added in comments on this answer, here is what I was able to find before and apart from that.
Michael Sharpe, creator of package "mathalfa" at CTAN and member of UCSD mathematics, has TeX definitions for Mathematical Pi in this archive file. I successfully guessed that the obsolete documented location at me.com has moved to his university site. The ".vf" files map the characters of Mathematical Pi to TeX math codepoints. They are binary. The mapping data is part of the dump to readable text using the tool "vftovp" that is part of TeX distributions. After performing that dump, we find that the mapped characters are:
mathpibb: 'hyphen-minus' 0-9 A-Z a-z
mathpical: percent 'hyphen-minus' A-Z
mathpifrak: 'hyphen-minus' 0-9 A-Z a-z
mh2s: A-Z
So that explains the package name "mathalfa". He took on only the task of employing the alphabetics and digits but hardly anything more. We must look at the files above for mappings for the symbols.
I think that parts of MathPi, such as the Greek letters of MathPi 1, use the same encoding as Adobe Symbol, which is documented here: http://unicode.org/Public/MAPPINGS/VENDORS/ADOBE/symbol.txt
When attempting to map symbols to Unicode oneself, a good way to find the Unicode point is by drawing the glyph on the screen here: http://shapecatcher.com
FWIW my current mapping table (from reading documents created using MathPI, is:
<codePoint name="H9251" unicode="U+03B1" unicodeName="GREEK LOWERCASE LETTER ALPHA"/>
<codePoint name="H9252" unicode="U+03B2" unicodeName="GREEK LOWERCASE LETTER BETA"/>
<codePoint name="H9253" unicode="U+03B3" unicodeName="GREEK SMALL LETTER GAMMA"/>
<codePoint name="H9254" unicode="U+03B4" unicodeName="GREEK SMALL LETTER DELTA"/>
<codePoint name="H9255" unicode="U+03B5" unicodeName="GREEK SMALL LETTER EPSILON"/>
<codePoint name="H9256" unicode="U+03B6" unicodeName="GREEK SMALL LETTER ZETA"/>
<codePoint name="H9257" unicode="U+03B7" unicodeName="GREEK SMALL LETTER ETA"/>
<codePoint name="H9258" unicode="U+03B8" unicodeName="GREEK SMALL LETTER THETA"/>
<codePoint name="H9259" unicode="U+03B9" unicodeName="GREEK SMALL LETTER IOTA"/>
<codePoint name="H9260" unicode="U+03BA" unicodeName="GREEK SMALL LETTER KAPPA"/>
<codePoint name="H9261" unicode="U+03BB" unicodeName="GREEK SMALL LETTER LAMBDA"/>
<codePoint name="H9262" unicode="U+03BC" unicodeName="GREEK LOWERCASE LETTER MU"/>
<codePoint name="H11001" unicode="U+002B" decimal="43" unicodeName="PLUS"/>
<codePoint name="H11002" unicode="U+002D" decimal="45" unicodeName="MINUS"/>
<codePoint name="H11003" unicode="U+00D7" decimal="215" unicodeName="MULTIPLICATION SIGN"/>
<codePoint name="H11005" unicode="U+003D" decimal="61" unicodeName="EQUALS"/>
<codePoint name="H11011" unicode="U+007E" decimal="126" unicodeName="TILDE"/>
<codePoint name="H11021" unicode="U+003C" decimal="60" unicodeName="LESS" htmlName="lt"/>
<codePoint name="H11022" unicode="U+003E" decimal="62" unicodeName="" htmlName="gt"/>
<codePoint name="H11032" unicode="U+0027" decimal="39" unicodeName="APOSTROPHE" htmlName="apos"/>
<codePoint name="H11034" unicode="U+00B0" decimal="176" unicodeName="DEGREE SIGN" htmlName="deg"/>
<codePoint name="H11554" unicode="U+00B7" decimal="183" unicodeName="MIDDLE DOT"/>

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

ANTLR Parser Question

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.