Pentaho don't Generete PDF in UTF-8 encoding - pdf

I have a problem related with PDF exportation in Pentaho BI plattform. I'm not able to produce a correct PDF file encoded in UTF-8 and which contains Spanish characters. That procedure neither works properly in local Report Designer nor in BI server. Special characters like 'ñ' or 'ç' are skipped in the PDF file. Generation in other formats works just fine (HTML, Excel, etc.).
I've been struggling with that issue for few days being unable to find any solution and would be grateful for any clue.
Thanks in advance
P.S. Report Designer and BI platform version 6.1.0.1

Seems like a font issue. Your font needs to know how to work with unicode and it needs to specify how to "draw" the characters you want.
Office programs (at least MS office) by default automatically select font, which can render any character (if font substitution is enabled), however PDF readers don't do it: they always use the exact font you've specified.
When selecting appropriate font, you have to pay attention to supported Unicode characters and to the font's license: some fonts don't allow embedding and Pentaho embeds font's subset, which was used, into generated PDF files if encoding is UTF-8 or Identity-H.
To install fonts for linux server you need to copy font files either to your java/lib/fonts/ folder or to /usr/share/fonts/, grant read rights to the server's user and restart the server application.

Related

problem with unlocking user-password secured PDF

I have a PDF file protected by password - I know the correct user password. The problem is, that I am only able to open it on Windows Adobe Reader. Every other PDF viewer (also Linux command prompts tools for removing passwords) returns the information that password is wrong.
Potential cause: password is long (30 characters) and contains non-Latin (Polish) characters (like ł ó ę ć ź ą). I tried things like Unicode to ASCII converter, but it does not work.
Has anybody idea why it works only in Acrobat? I just want to open this document on Linux. The best would be to remove password.
EDIT: document is secured by 128-bit AES, Acrobat mention that "can be opened by Acrobat 7.0 or newer". Printing, copying, etc is not allowed.
EDIT2: thanks for helping in comments, I tried SumatraPDF and it works - but only allow to print this to non-searchable images pdf.
I checked that it is based on mupdf engine, but mupdf on Linux cannot deal with this file - it crash.
Sumatra has open source, do anybody knows how to edit it to print to PDF in normal way?
SumatraPDF uses MuPDF as its engine for several formats such as ePub HTML and of course PDF. It can store (not remove) a know password as a hash so no need to keep inputting for everyday reading or adding comments to a PDF.
So if as suggested by #mkl using the password with local characters on a local PC may work in SumatraPDF it should work in MuPDF-GL which is a more basic viewer. Spoiler, certainly I can remove my own simple 9 character encrypted challenge.pdf (8 sequentially alphabetic characters are a known semi random sequence) to save in MuPDF as unprotected.pdf, but nobody has cracked it yet :-)
However MuPDF-GL has many more powerful abilities hidden under the surface.
Using MuPDF-GL you should be able to open the file when it prompts for that password. then press A which starts the annotator (you dont need to add anything) but simply change the save as settings.
So in this case if there were errors it will have fixed any needed to re-save but first switch OFF incremental and set encryption to none. There is no guarantee this will work for all cases but worth a try.
IF mupdf-gl does not work for you on linux you can try
MuTool mutool draw -p password -o unprotected.pdf protected.pdf
OR qpdf which can also rebuild a PDF with different restrictions given the correct input password(s).
qpdf --password=myverylongstring!"^$% --decrypt protected.pdf unprotected.pdf
or if the password may cause commandline UTF problems save it as first line of a text file and use
qpdf -password-file=password.txt --decrypt protected.pdf unprotected.pdf
Lastly if you wish to print a pdf file on Linux you have two potential options as readers OLD Evince works for me on Windows 32bit but for 64bit I prefer nightly cutting edge Okular.

How can get original font names from PDF generated by Ghostscript?

I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.

What is a pdf bcmap file?

I am using a pdfjs viewer in my web application and it comes with all these bcmap files. I traced the network traffic and they are not being called for.
I don't really want to add these files to version control or the issue tracking system b/c there are so many of them, if they will not be needed.
What is a bcmap file?
The word "bcmap" stands for "binary cmap".
CMaps (Character Maps) are text files that are used in PostScript and other Adobe products to map character codes to character glyphs in CID fonts.
See this document by Adobe to see what CID fonts are good for. They are mostly used when dealing with East Asian writing systems. (This technology is a legacy technology, so it should not be used in pdfs created by modern tools)
pdfjs needs the CMap file when it wants to display such CID fonts. For that, you need to provide the CMaps.
You specify the url to the folder where the CMaps are stored via settings on the PDFJS global object.
PDFJS.cMapUrl = '../web/cmaps/';
By default, pdfjs will try to load a file with the name of the required CMap and no extension, for example "../web/cmaps/Hankaku".
If you enable the setting cMapPacked like this:
PDFJS.cMapPacked = true;
pdfjs will instead try to read a compressed version of the CMap file with the extension ".bcmap", for example "../web/cmaps/Hankaku.bcmap".
The compression itself is done with the tool at https://github.com/mozilla/pdf.js/tree/master/external/cmapscompress.
Conclusion: Include the files and set the PDFJS options correctly if there is a possibility that you need to display pdfs with East Asian texts that were created by legacy pdf creation tools. Don't include the files if you are certain you won't need to display such files.

bad character encoding after xsl 1.0 transform

I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.
I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.
As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.
If I use this, the em-dash is preserved:
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))
If I use this, the em-dash is corrupted into "—":
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))
The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.
I have several questions:
How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?
How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.
I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?
The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.
You say 'As I understand it, an em-dash is not even a "Unicode character".' What do you mean by that? The Unicode character set definitely contains a code for em dash: 2014 hex. In the UTF-8 encoding, it will be 3 bytes: E2, 80, 94.
I suspect Martin Honnen is right that your editor is simply not showing the file properly. A couple of suggestions:
I'm not familiar with the editors you mention, but editors that handle different encodings will often silently choose an encoding by which to interpret the file (based on the BOM if there is one, and sometimes based on the character codes they see). They also typically have some way of showing what encoding they are interpreting the file as, and a way of tell them to load (or reload) the file as a particular encoding. If your editor does not have these features, I suggest you get one that does, such as EditPlus, or NotePad++.
As far as the hex editor, again I'm not familiar with the one you mention, but the whole point of a hex editor is to see the raw bytes. Such editors often offer a text view also (often side by side with the hex view) and if they do, I would not rely on their handling of encoding. Just use them to view the hex bytes and see if the bytes for your em dash are the same in both files.
Another way viewing the file can go wrong: even if your editor is interpreting the file as UTF-8, not all fonts will have all unicode characters in them, and for those characters not in the font they may display a little square or nothing at all. Try a few different fonts, or find one that purports to support unicode (though no font support ALL of Unicode, and there are several revisions of the Unicode spec which add more characters). Lucida Sans Unicode, I think, is one that will be on most Windows systems.
Another trick: I highly recommend the utility BabelMap. You can look up any unicode character there and see what the unicode value is, and you can copy the character from there and paste it into the file in your text editor and see how it displays it.
UltraEdit offers several configuration settings for working with UTF-8 encoded files. There is Auto detect UTF-8 files in File Handling - Unicode/UTF-8 Detection configuration dialog which is by default enabled.
With this setting enabled UltraEdit searches for the UTF-8 BOM. If not present, it searches in first few KB for an UTF-8 declaration as usually present in head of HTML/XHTML files or in first line of an XML file. If there is no BOM and no standardized encoding information at top of the file, UltraEdit searches within the first 64 KB for byte sequences which looks like UTF-8 encoded characters. If such a byte sequence can be found, the file is interpreted as UTF-8 encoded file by UltraEdit. For example a file containing only the 3 bytes E2 80 94 is interpreted as UTF-8 encoded file.
UltraEdit indicates in the status bar at bottom of the main window which encoding is detected and active (on save) for the active file. In the status bar there is either UTF-8 or U8- displayed depending on which status bar is used (advanced or basic) and which version of UltraEdit is used as the older versions have only the basic status bar.
Only files encoded in UTF-8 with no BOM, no UTF-8 character set or encoding declaration and no UTF-8 encoded character within the first 64 KB are opened wrong as ANSI file. In such cases the user can use the enhanced File - Open command of UltraEdit and explicitly select UTF-8 encoding before opening the file with button Open.
For completness, there is also a configuration setting which can be manually added to uedit32.ini which results in opening all files not being detected as UTF-16 files as UTF-8 encoded files. This is a setting useful for those who want to work only with UTF-8 encoded files even if there are very often no characters in a file present with a code value greater than 127.
For more information about working with UTF-8 encoded files take a look in the UltraEdit forums. There are a few topics with lots of information about editing UTF-8 encoded files in UltraEdit.

Big PDF file when language is PL (Polish)

I converted a Smart Form output into PDF using the function module SX_OBJECT_CONVERT_OTF_PDF.
My problem is that when the language is PL (Polish) the PDF file is 10 times bigger comparing to EN language. Why?
Gunstick answer is probably right.
Sap note: 843480 discuss this issue.
As of release 620 onward, there is support patches that enable pdf elements( such as fonts) to be compressed. The resulting pdf will be larger then the only English one, but it will probably be less than 10 times larger.
This may be that polish uses a specific font (special characters) which is not installed by default on an OS. So the pdf converter includes the complete font into the document in order to render it correctly at the destination.
This is just speculation though.
You may try this one: http://lucattelli.com/blog/?page_id=478
This FM can take the binary PDF and convert it to BASE 64 and send it as a mail attachment.
See if it helps