I am processing PDF files and wish to convert characters to Unicode as far as possible. The MathematicalPI family of character sets appear to use their own symbol names (e.g. "H11001"). By exploration I have constructed a table (for MathematicalPI-One) like:
<chars>
<char charname="H11001" codepoint16="0X2B" codepoint="43" unicodeName="PLUS"/>
<char charname="H11002" codepoint16="0x2D" codepoint="45" unicodeName="MINUS"/>
<char charname="H11003" codepoint16="0XD7" codepoint="215" unicodeName="MULTIPLICATION SIGN"/>
<char charname="H11005" codepoint16="0X3D" codepoint="61" unicodeName="EQUALS"/>
</char>
Can anyone point me to an existing translation table like this (ideally for all MathematicalPI sets). [I don't want a graphical display of glyphs as that means each has to be looked up as a Unicode equivalent.]
Also there seems to be a similar symbol resource where the charnames are of the form C223 (for copyright). Any information on this will be appreciated.
UPDATE:
I need something well beyond #user1808924's answer - I have already compiled by own (partial) translation table so it's certainly possible to construct one. It is possible to download and display a list of glyphs in MathematicalPI (may hundreds) and to go through the Unicode spec making equivalences (and for the majority I think there are clear equivalences). A satisfactory answer would either include a table with hundreds of equivalences or a defintive statement that this would violate Copyright of the font creator.
UPDATE: Between #minopret and #Miguel it is certainly possible to construct a mapping. The MathPi sets are well defined - a few hundred - and shapecatcher makes it easy to find the best glyphs pictorially. The mapping won't be definitive (i.e. with Adobe's stamp) but it will be worthwhile. And I suspect there will be cases where two different glyphs are essentially identical and so a visual mapping wont work - e.g. is an equilateral triangle INCREMENT or GREEK CAPITAL LETTER DELTA?
I doubt that I personally will complete a full table - I don't know what some of the symbols mean. But I hope to produce a subset used in Scientific technical medical (STM) publishing.
#user1808924 I notice you answered this on your first day on SO. Bounty questions are normally offered (as in this case) for difficult questions where there is a definitive answer but it is difficult to find. It's not normally useful to offer opinions or guesses unless you have expert knowledge of the area.
I do not think that there is such translation table available at all.
It looks to me that MathematicalPI font family is a synthetic one, which has been created ad hoc by selecting a subset of elements from some larger unknown set. The raison d'être of MathematicalPI font family seems to be the representation of simple algebraic operators (plus, minus, multiplication, division) and the equals sign. The charnames (ie. H1100X) appear to be artifacts, because they are not ordered after codepoint values (eg. the equals sign is the last one).
By looking at the available data, I can suggest that the missing H11004 charname should correspond to the division operator. However, it is impossible to predict if it should be represented by the Unicode "solidus" character (ie. U+002F), "division sign" character (ie. U+00F7), or something else.
Here's what I published on the Adobe Forums site:
I could be wrong, but I don't think there's an official correspondence table.
Using the six Type 1 fonts and the OpenType font that was made out of them, I've assembled two PDFs which show all the glyphs. Next to them are the glyph names (for the Type 1 fonts) and the Unicode value(s) (for the OpenType font). If you cross reference these two PDFs, you should be able to assemble the correlation list you're looking for.
Mathematical Pi
Hope this helps.
Miguel
Here is the best information as provided by Miguel Sousa of Adobe in his Typography forum message there:
Mathematical Pi 1-6 PDF / Mathematical Pi 1-6 InDesign IDML
Mathematical Pi Std PDF / Mathematical Pi Std IDML
For what it's worth and to summarize information that I had added in comments on this answer, here is what I was able to find before and apart from that.
Michael Sharpe, creator of package "mathalfa" at CTAN and member of UCSD mathematics, has TeX definitions for Mathematical Pi in this archive file. I successfully guessed that the obsolete documented location at me.com has moved to his university site. The ".vf" files map the characters of Mathematical Pi to TeX math codepoints. They are binary. The mapping data is part of the dump to readable text using the tool "vftovp" that is part of TeX distributions. After performing that dump, we find that the mapped characters are:
mathpibb: 'hyphen-minus' 0-9 A-Z a-z
mathpical: percent 'hyphen-minus' A-Z
mathpifrak: 'hyphen-minus' 0-9 A-Z a-z
mh2s: A-Z
So that explains the package name "mathalfa". He took on only the task of employing the alphabetics and digits but hardly anything more. We must look at the files above for mappings for the symbols.
I think that parts of MathPi, such as the Greek letters of MathPi 1, use the same encoding as Adobe Symbol, which is documented here: http://unicode.org/Public/MAPPINGS/VENDORS/ADOBE/symbol.txt
When attempting to map symbols to Unicode oneself, a good way to find the Unicode point is by drawing the glyph on the screen here: http://shapecatcher.com
FWIW my current mapping table (from reading documents created using MathPI, is:
<codePoint name="H9251" unicode="U+03B1" unicodeName="GREEK LOWERCASE LETTER ALPHA"/>
<codePoint name="H9252" unicode="U+03B2" unicodeName="GREEK LOWERCASE LETTER BETA"/>
<codePoint name="H9253" unicode="U+03B3" unicodeName="GREEK SMALL LETTER GAMMA"/>
<codePoint name="H9254" unicode="U+03B4" unicodeName="GREEK SMALL LETTER DELTA"/>
<codePoint name="H9255" unicode="U+03B5" unicodeName="GREEK SMALL LETTER EPSILON"/>
<codePoint name="H9256" unicode="U+03B6" unicodeName="GREEK SMALL LETTER ZETA"/>
<codePoint name="H9257" unicode="U+03B7" unicodeName="GREEK SMALL LETTER ETA"/>
<codePoint name="H9258" unicode="U+03B8" unicodeName="GREEK SMALL LETTER THETA"/>
<codePoint name="H9259" unicode="U+03B9" unicodeName="GREEK SMALL LETTER IOTA"/>
<codePoint name="H9260" unicode="U+03BA" unicodeName="GREEK SMALL LETTER KAPPA"/>
<codePoint name="H9261" unicode="U+03BB" unicodeName="GREEK SMALL LETTER LAMBDA"/>
<codePoint name="H9262" unicode="U+03BC" unicodeName="GREEK LOWERCASE LETTER MU"/>
<codePoint name="H11001" unicode="U+002B" decimal="43" unicodeName="PLUS"/>
<codePoint name="H11002" unicode="U+002D" decimal="45" unicodeName="MINUS"/>
<codePoint name="H11003" unicode="U+00D7" decimal="215" unicodeName="MULTIPLICATION SIGN"/>
<codePoint name="H11005" unicode="U+003D" decimal="61" unicodeName="EQUALS"/>
<codePoint name="H11011" unicode="U+007E" decimal="126" unicodeName="TILDE"/>
<codePoint name="H11021" unicode="U+003C" decimal="60" unicodeName="LESS" htmlName="lt"/>
<codePoint name="H11022" unicode="U+003E" decimal="62" unicodeName="" htmlName="gt"/>
<codePoint name="H11032" unicode="U+0027" decimal="39" unicodeName="APOSTROPHE" htmlName="apos"/>
<codePoint name="H11034" unicode="U+00B0" decimal="176" unicodeName="DEGREE SIGN" htmlName="deg"/>
<codePoint name="H11554" unicode="U+00B7" decimal="183" unicodeName="MIDDLE DOT"/>
Related
I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.
I'm working on an iOS app in which I have to list and sort people names. I've some problem with special character.
I need some clarification on Martin R answer on https://stackoverflow.com/a/15154823/2148377
You could use the CoreFoundation CFStringTransform function which does almost all transformations from your list. Only "đ" and "Đ" have to be handled separately:
Why this particular letter? Where does this come from? Where can I find the documentation?
Thanks a lot.
I am not 100% sure, but I think it can be seen from the Unicode Data Base
http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt.
For example, the entry for "à" is
00E0;LATIN SMALL LETTER A WITH GRAVE;Ll;0;L;0061 0300;;;;N;LATIN SMALL LETTER A GRAVE;;00C0;;00C0
where field #6 is the "Decomposition mapping" into "a" + U+0300 (COMBINING GRAVE ACCENT),
therefore
CFStringTransform(..., kCFStringTransformStripCombiningMarks, ...)
transforms "à" into "a".
The entries for "Đ" and "đ" are
0110;LATIN CAPITAL LETTER D WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER D BAR;;;0111;
0111;LATIN SMALL LETTER D WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER D BAR;;0110;;0110
where field #6 is empty, so these characters do not have a decomposition into a "base character" and a "combining mark".
So the question remains: Which standard determines that a "normalized form" of "đ / Đ" is "d / D"?
Problem: Some English words are translated to symbols
Greek letters as English words are translated to symbols:
example lambda is converted to the equivalent small Greek letter.
Logic and Math words are transliated to symbols.
examples: and, or, in, exists, sum, div, top, int, pm converts to symbols
or small empty square if the symbol is not recognized.
Scope: Windows XP 32-bit, WIndows 7 64-bit with jEdit 4.5.2
This problem acts like an abbreviation expansion. As I type a-l-p-h-a then a space,
jedit converts alpha to the small Greek letter alpha.
I have learned to live with this but would like to find a solution to the problem.
Any help would be appreciated. I don't know if this is a customization problem or a feature or a bug.
To turn off all abbreviations, go into Utilities > Global Options, then Abbreviations. Uncheck "Space bar expands abbrevs".
EDIT: I didn't realize you wanted to use abbreviations but not those specific ones.
To take out the abbreviations for lambda, alpha, etc., go into that same dialog, pick "global" if it isn't already selected, then select each one from the list and hit the minus button under the list. Unfortunately (at least in jEdit 4.5) you'll have to select each one and delete it individually; you can't select multiple entries.
I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
PDFTextOut(" Column 1 Column 2 Column 3");
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.
According to TkDocs:
The "1.0" here represents where to insert the text, and can be read as "line 1, character 0". This refers to the first character of the first line; for historical conventions related to how programmers normally refer to lines and characters, line numbers are 1-based, and character numbers are 0-based.
I hadn't heard of this convention before, and I can't find anything relevant on Google. Can anyone explain this to me please?
I think you're referring to Tk's text widget. The man page says:
Lines are numbered from 1 for consistency with other UNIX programs that use this numbering scheme.
Although, I'm not sure which Unix tools it's talking about.
Update:
As mentioned in the comments, it looks like a lot of unix text manipulation tool starts line numbering at 1. And tcl/tk having a unix origin, it makes sense to be as compatible as possible with the underlying OS environment.
It really is nothing more than convention, but here is a suggestion.
Character positions are generally thought of in the same way as a Java iterator, which is a "pointer" to a position between two characters. Thus the first character is the one after index position 0. Substrings are taken between two inter-character positions, for instance.
Line positions on the other hand are generally thought of more in the way of a .NET enumerator, which is a "pointer" to the item itself, not to a position in between. Thus the first line is the line at position 1.