Is there a limit to how many metadata keyword characters are saved in a PDF? I am saving part numbers under work instructions designed in Microsoft Publisher. It will retain well past 255 characters, however when I export the document to PDF it gets cut off.
Thank you for all your help.
According to this accepted answer in the Adobe Forums
In the Acrobat Family, all fields on the main Document Properties tab are limited to 1999 characters or less. Fields on the Custom Properties tab are limited to 255 characters.
However, there is no size limit mentioned in the PDF Specification, so how much is saved and displayed depends on the implementation of your PDF application.
Related
Here is issue screenshot
Here is the sample code.
Dim rawData As Byte() = "sample data"
Response.ContentType = "application/pdf"
Response.ContentEncoding = System.Text.Encoding.UTF8
Response.BinaryWrite(rawData)
Response.End()
Space getting added between characters while writing to PDF using binary write
The underlying issue here is that you actually are not writing a PDF at all!
Your code essentially returns pure text data and then claims that it is a PDF. Such a claim doesn't change the text data in any way, though, they remain text and don't become a PDF.
The PDF viewer you use apparently attempts to somehow display what it got nonetheless but the result thereof turns out to be very unsatisfying (a proportional font seems to be used in a monospaced manner).
If you actually want to return a PDF, you have to explicitly create one. PDF is a complicated binary format best to create using a dedicated library.
Look for pdf libraries for your environment. You can find some that have explicit ways to add table or paragraph structures to the pdf, and some that create content by conversion from another structured format e.g. html.
The output of these libraries is a binary in pdf format which you can return from your code using Response.BinaryWrite.
Recently one can read in a number of questions that people have data in text or html format, return it setting some binary content type (PDF in this question, MS office formats in other ones), and then assume they so have generated a file in that format.
This is wrong, claiming a format doesn't transform into that format!
All this setting of the content type does, is informing the client what kind of viewer to use to open the data.
Probably this anti-pattern came up because MS Word (and most likely other word processors, too) can also open plain text and html text files and display them fairly properly. Thus, this anti-pattern at first glance appears to work somehow.
If you promised your client, though, that your application returns MS Office documents, don't return HTML or plain text claiming it to be an Office document, instead do create actual MS Office documents! Otherwise knowledgeable clients will not accept your implementation and clients who did accept it will eventually be informed by knowledgeable users that you cheated them which will at least lessen your renown.
I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...
We are developing a printing server that allows user to upload a DOC and print it out via HP ePrint. It needs to support page extraction.
I tried to use macro (with the help of Adobe Acrobat Reader Pro and MS Word) to extract pages into PDF. But it turns out that the size of PDF may be larger in size than expected.
Is there any way to extract pages (without lossing format - E.g. Table in DOC) from DOC to DOC, so that the size can be approximately the size?
This is a difficult requirement. It sounds like you have run into 2 problems (large PDFs and format loss) at the outset. You should probably say more about what you mean by "extraction" and why PDF is part of your solution because that's quite different from "upload and print" and "doc to doc". That way readers will have more suggestions for you.
I would suggest you try to approach the problem from a different angle if possible, because I suspect that you are unlikely to achive a good, efficent, stable result. One possible approach is to turn the DOC into PDF and then use iText or some other PDF library to manipulate the PDF before printing. It really depends on what you are trying to achieve - the specifics of your extract/merge/convert.
I am writing an application that has to read and interpret data stored in some PDF files. The reading part is done but I am only able to get a dump of all the words on a page and not the format of the words. What I mean is that if I have to extract a table, I am getting the numbers in the table but not the markup which defines the table.
Further, there is some formatting used which displays a few of these numbers within parentheses (meaning that those numbers are negative) but the parentheses themselves are not part of the text. Hence, I am not able to distinguish between positive and negative numbers present in the PDF table!
How do you get the PDF markup along with the text? Is a PDF similar in structure to an XML with tags used to markup tables etc.? If not, then, is there a resource which describes the salient features of the PDF DOM?
I am using VBA and the Acrobat library (AcroExch etc.)
There is no such thing as "PDF markup" in the sense of HTML etc. A table in PDF cannot be distinguished from line art, other than by using OCR, which can be error-prone if the layout is complex. It is simply drawn using geometrical shapes, like in a vector-based graphics program.
"Is a PDF similar in structure to an XML with tags used to markup tables etc.?"
No, not at all.
And there is no such thing as a 'DOM' either. Google for a file named *PDF32000_2008.pdf*. The current PDF specification for v1.7 (ISO spec) is that file. You should be able to locate it on the Adobe website.
As omz stated, text inside PDF does not really have a structure. You can take a look on the specification here. However, for some very specific files, there is something called PDF Tags, or PDF Marked Content, which is fairly new, and it aims to give PDF documents some kind of structure. If you target this kind of files specifically, you might be able to achieve something. Take a look on chapter 10 (Document Interchange) of the Adobe's specification for further details.
Maybe what you want to achieve can be done with less effort and faster by using TET, the Text Extraction Toolkit made by the fine folks from pdflib.com ( http://www.pdflib.com/products/tet/ ) ??
AFAIR, the TET has some (limited) support for table detection as well....
I converted a Smart Form output into PDF using the function module SX_OBJECT_CONVERT_OTF_PDF.
My problem is that when the language is PL (Polish) the PDF file is 10 times bigger comparing to EN language. Why?
Gunstick answer is probably right.
Sap note: 843480 discuss this issue.
As of release 620 onward, there is support patches that enable pdf elements( such as fonts) to be compressed. The resulting pdf will be larger then the only English one, but it will probably be less than 10 times larger.
This may be that polish uses a specific font (special characters) which is not installed by default on an OS. So the pdf converter includes the complete font into the document in order to render it correctly at the destination.
This is just speculation though.
You may try this one: http://lucattelli.com/blog/?page_id=478
This FM can take the binary PDF and convert it to BASE 64 and send it as a mail attachment.
See if it helps