What is a pdf bcmap file? - pdf

I am using a pdfjs viewer in my web application and it comes with all these bcmap files. I traced the network traffic and they are not being called for.
I don't really want to add these files to version control or the issue tracking system b/c there are so many of them, if they will not be needed.
What is a bcmap file?

The word "bcmap" stands for "binary cmap".
CMaps (Character Maps) are text files that are used in PostScript and other Adobe products to map character codes to character glyphs in CID fonts.
See this document by Adobe to see what CID fonts are good for. They are mostly used when dealing with East Asian writing systems. (This technology is a legacy technology, so it should not be used in pdfs created by modern tools)
pdfjs needs the CMap file when it wants to display such CID fonts. For that, you need to provide the CMaps.
You specify the url to the folder where the CMaps are stored via settings on the PDFJS global object.
PDFJS.cMapUrl = '../web/cmaps/';
By default, pdfjs will try to load a file with the name of the required CMap and no extension, for example "../web/cmaps/Hankaku".
If you enable the setting cMapPacked like this:
PDFJS.cMapPacked = true;
pdfjs will instead try to read a compressed version of the CMap file with the extension ".bcmap", for example "../web/cmaps/Hankaku.bcmap".
The compression itself is done with the tool at https://github.com/mozilla/pdf.js/tree/master/external/cmapscompress.
Conclusion: Include the files and set the PDFJS options correctly if there is a possibility that you need to display pdfs with East Asian texts that were created by legacy pdf creation tools. Don't include the files if you are certain you won't need to display such files.

Related

How can get original font names from PDF generated by Ghostscript?

I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.

Pentaho don't Generete PDF in UTF-8 encoding

I have a problem related with PDF exportation in Pentaho BI plattform. I'm not able to produce a correct PDF file encoded in UTF-8 and which contains Spanish characters. That procedure neither works properly in local Report Designer nor in BI server. Special characters like 'ñ' or 'ç' are skipped in the PDF file. Generation in other formats works just fine (HTML, Excel, etc.).
I've been struggling with that issue for few days being unable to find any solution and would be grateful for any clue.
Thanks in advance
P.S. Report Designer and BI platform version 6.1.0.1
Seems like a font issue. Your font needs to know how to work with unicode and it needs to specify how to "draw" the characters you want.
Office programs (at least MS office) by default automatically select font, which can render any character (if font substitution is enabled), however PDF readers don't do it: they always use the exact font you've specified.
When selecting appropriate font, you have to pay attention to supported Unicode characters and to the font's license: some fonts don't allow embedding and Pentaho embeds font's subset, which was used, into generated PDF files if encoding is UTF-8 or Identity-H.
To install fonts for linux server you need to copy font files either to your java/lib/fonts/ folder or to /usr/share/fonts/, grant read rights to the server's user and restart the server application.

Link to pages of PDFs created with GhostScript

We have created PDFs from converting individual PostScript pages into a single PDF (and embedding appropriate fonts) using GhostScript.
We've found that an individual page of the PDF cannot be linked to; for example, through the usage of
http://xxxx/yyy.pdf#page=24
There must be something within the PDF that makes this not possible. Are there any specific GhostScript options that should be passed when creating the PDF that would allow this type of page-destination link to work?
There are no specific pdfwrite (the Ghostscript device which actually produces PDF) options to do this. Without knowing why the (presumably) web browser or plugin won't open the file at the specified page its a little difficult to offer any more guidance.
What are you using to view the PDF files ?
Can you make a very simple file that fails ? Can you make that file public ?
If I can reproduce the problem, and the file is sufficiently simple, it may be possible to determine the problem. By the way, which version of Ghostscript are you using ?

What is a "packed PDF", and how can it be read?

I have been sent versions of "packed PDF" files where the top-level PDF contains child PDFs.
The top-level PDF acts primarily as a container. The packing is not always evident in Adobe reader (e.g. when pdftk is used to pack the link does not show). I can find little by Googling for this term nor in my 2012 book ("Whittington", "PDF Explained", O'Reilly).
Is this a standard part of PDF? If so I'd be grateful for pointers. And can PDFBox analyze it?
Concerning your question whether using PDF as a container file format is a standard part of PDF:
Yes, it is. ISO 32000-1:2008 describes it in section 7.11.4 Embedded file streams.
Most prominent are files associated to some document page, see 12.5.6.15, File Attachment Annotations, and those associated with the document as a whole through the EmbeddedFiles entry (PDF 1.4) in the PDF document’s name dictionary (see 7.7.4, Name Dictionary).
#JesseGood's link to PDF File Specification on the PDFBox site explains how to deal with the latter ones.
I'm not very knowledgeable concerning PDFBox and, therefore, don't know whether it allows easy access to the other kind of attachments, too. If it does not, you will essentially have to iterate the annotations of all pages to find the file attachment annotations and handle the contents according to the PDF specification.

Web served image in PDF?

Does PDF and/or Adobe Reader support including an image by URL so that you can insert a dynamic images from a web server into a document?
The answer to your question is both yes and no. If you look in the PDF spec (I'm going by version 1.7) in section 7.11.5, you'll see that a stream within a PDF document can be represented by an URL. So yes, you can go ahead and specify that a PDF has, say, its image content in the specified URL.
The problem will be that when you specify an image within PDF, you are specifying a PARTICULAR image that must have a particular data length and encoding. Simply specifying dimensions, dct compression (aka jpg), and URL is not enough. Images are contained in streams of a particular length. If the stream is too long or too short, it is considered an error.
So you can have images dynamically served up, provided that they are always exactly the same byte length. I think. And I say this because the specification is somewhat ambiguous as to what happens when you set the length to 0 in the stream dictionary.
Now, is doing this practical? Maybe - you'll need a fairly strong PDF toolkit in order to be able to author these documents. And if you have that, I think you'd be better off authoring the entire PDF document that your clients want on the fly rather than trying to substitute an image at read time.
I don't believe you can place a dynamic image in a PDF document in this manner. It's possible to dynamically create an entire PDF document using web-hosted content (using PHP, Coldfusion, etc.) but changing that content later on the web server will not dynamically update previously generated PDF documents, which is what it sounds like you want to do.
As PDFs are meant to be portable by nature (PORTABLE Document Format) and thus, not always viewed online, this goes against the very principle of the document format, and is not supported as far as I know.
You could include a reference to an image at the time of generation of the PDF, but said image will embedded into the PDF, not linked.
You could use pdf.js and modify the rendering methods slightly so that you inject your image. You can find pdf.js here: https://github.com/mozilla/pdf.js
You can also use FlexPaper which has an API that allows you to overlay your document with images
http://flexpaper.devaldi.com/