I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.
I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.
As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.
If I use this, the em-dash is preserved:
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))
If I use this, the em-dash is corrupted into "—":
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))
The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.
I have several questions:
How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?
How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.
I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?
The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.
You say 'As I understand it, an em-dash is not even a "Unicode character".' What do you mean by that? The Unicode character set definitely contains a code for em dash: 2014 hex. In the UTF-8 encoding, it will be 3 bytes: E2, 80, 94.
I suspect Martin Honnen is right that your editor is simply not showing the file properly. A couple of suggestions:
I'm not familiar with the editors you mention, but editors that handle different encodings will often silently choose an encoding by which to interpret the file (based on the BOM if there is one, and sometimes based on the character codes they see). They also typically have some way of showing what encoding they are interpreting the file as, and a way of tell them to load (or reload) the file as a particular encoding. If your editor does not have these features, I suggest you get one that does, such as EditPlus, or NotePad++.
As far as the hex editor, again I'm not familiar with the one you mention, but the whole point of a hex editor is to see the raw bytes. Such editors often offer a text view also (often side by side with the hex view) and if they do, I would not rely on their handling of encoding. Just use them to view the hex bytes and see if the bytes for your em dash are the same in both files.
Another way viewing the file can go wrong: even if your editor is interpreting the file as UTF-8, not all fonts will have all unicode characters in them, and for those characters not in the font they may display a little square or nothing at all. Try a few different fonts, or find one that purports to support unicode (though no font support ALL of Unicode, and there are several revisions of the Unicode spec which add more characters). Lucida Sans Unicode, I think, is one that will be on most Windows systems.
Another trick: I highly recommend the utility BabelMap. You can look up any unicode character there and see what the unicode value is, and you can copy the character from there and paste it into the file in your text editor and see how it displays it.
UltraEdit offers several configuration settings for working with UTF-8 encoded files. There is Auto detect UTF-8 files in File Handling - Unicode/UTF-8 Detection configuration dialog which is by default enabled.
With this setting enabled UltraEdit searches for the UTF-8 BOM. If not present, it searches in first few KB for an UTF-8 declaration as usually present in head of HTML/XHTML files or in first line of an XML file. If there is no BOM and no standardized encoding information at top of the file, UltraEdit searches within the first 64 KB for byte sequences which looks like UTF-8 encoded characters. If such a byte sequence can be found, the file is interpreted as UTF-8 encoded file by UltraEdit. For example a file containing only the 3 bytes E2 80 94 is interpreted as UTF-8 encoded file.
UltraEdit indicates in the status bar at bottom of the main window which encoding is detected and active (on save) for the active file. In the status bar there is either UTF-8 or U8- displayed depending on which status bar is used (advanced or basic) and which version of UltraEdit is used as the older versions have only the basic status bar.
Only files encoded in UTF-8 with no BOM, no UTF-8 character set or encoding declaration and no UTF-8 encoded character within the first 64 KB are opened wrong as ANSI file. In such cases the user can use the enhanced File - Open command of UltraEdit and explicitly select UTF-8 encoding before opening the file with button Open.
For completness, there is also a configuration setting which can be manually added to uedit32.ini which results in opening all files not being detected as UTF-16 files as UTF-8 encoded files. This is a setting useful for those who want to work only with UTF-8 encoded files even if there are very often no characters in a file present with a code value greater than 127.
For more information about working with UTF-8 encoded files take a look in the UltraEdit forums. There are a few topics with lots of information about editing UTF-8 encoded files in UltraEdit.
Related
I have hand-crafted one PDF (https://media.24usoftware.com/d/PDFwithOnlyLink.pdf) and modified another one, generated from FileMaker, in order to add web link annotations (https://media.24usoftware.com/d/PDFwithLinks.pdf). They both pass validation at https://www.pdf-online.com/osa/validate.aspx, but for some reason when I open them in Adobe Reader they get immediately modified, so when I try to close them Adobe Reader asks me if I want to save changes. But I have no clue what the changes are and why they were made. Any ideas? Adobe's support claims there are syntax errors but without providing any details what's syntactically wrong in them.
For your hand-crafted PDF file Ghostscript says your xref table is wrong, some of the entries are not exactly 20 bytes. This is a requirement for an entry in the xref table, if you use \r or \n instead of \r\n for an xref entry (as you have done) you must pad out the entry with white space.
I get the same warning on your modified file too.
See page 94 of the 1.7 PDF Reference Manual where it says:
Each entry is exactly 20 bytes long, including the end-of-line marker.
and then later:
If the file’s end-of-line marker is a single character (either a
carriage return or a line feed), it is preceded by a single space; if
the marker is 2 characters (both a carriage return and a line feed),
it is not preceded by a space. Thus, the overall length of the entry
is always exactly 20 bytes.
You might want to think about using a different validation tool, this is a basic and very common error, if that validator can't find it it doesn't speak well for its quality.
Though I'm disappointed to see that Acrobat X Pro's own validation analyser can't find it either.....
I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.
I have a problem related with PDF exportation in Pentaho BI plattform. I'm not able to produce a correct PDF file encoded in UTF-8 and which contains Spanish characters. That procedure neither works properly in local Report Designer nor in BI server. Special characters like 'ñ' or 'ç' are skipped in the PDF file. Generation in other formats works just fine (HTML, Excel, etc.).
I've been struggling with that issue for few days being unable to find any solution and would be grateful for any clue.
Thanks in advance
P.S. Report Designer and BI platform version 6.1.0.1
Seems like a font issue. Your font needs to know how to work with unicode and it needs to specify how to "draw" the characters you want.
Office programs (at least MS office) by default automatically select font, which can render any character (if font substitution is enabled), however PDF readers don't do it: they always use the exact font you've specified.
When selecting appropriate font, you have to pay attention to supported Unicode characters and to the font's license: some fonts don't allow embedding and Pentaho embeds font's subset, which was used, into generated PDF files if encoding is UTF-8 or Identity-H.
To install fonts for linux server you need to copy font files either to your java/lib/fonts/ folder or to /usr/share/fonts/, grant read rights to the server's user and restart the server application.
I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü
are there any suggestion?
PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.
PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).
I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.
So I have some Spanish content saved in Excel, that I am exporting into a .csv format so I can import it from the Firefox sql manager add-on into a .sql db. The problem is that when I import it, whenever there is an accent mark, (or whatever the technical name for those things are) Firefox doesn't recognize it, and accordingly produces a big black diamond with a white ?. Is there a better way to do this? Is there something I can do to have my Spanish content readable in a sql db? Maybe a more preferable program than the Firefox extension? Please let me know if you have any thoughts or ideas. Thanks!
You need to follow the chain and make sure nothing gets lost "in translation".
Specifically:
assert which encoding is used in the CSV file; ensure that the special charaters are effectively in there, and see how they are encoded (UTF8, particular Code page, ...)
ensure the that SQL server can
a) read these characters and
b) store them in an encoding which will preserve their integrity. (BTW, the encoding used in the CSV can of course be remapped to some other encoding of your choosing, i.e. one that you know will be suitable for consumption by your target application)
ensure that the database effectively stored these characters ok.
see if Firefox (or whichever "consumer" of this text) properly handles characters in this particular encoding.
It is commonplace but useful for this type of inquiries to recommend the following reading assignement:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky