I have an application that uses scanned form images as a background to a PDF, then paints fields and renders data in those fields over that background to provide a virtual form of a physical form.
The problem we are facing is the size of the PDF is too large (15-30mb) and we need to communicate several PDFs to an API that has a hard limit of ~20mb. The PDFs need to be 1-2Mb in size.
I am hoping to be able to solve this problem by stripping fonts and background form-images from the PDF itself, leaving the content of the PDF only the text data and fields. I imagine this could work as long as the PDF could load the fonts and fields from an external URL (our content delivery network would do quite nicely here).
The PDFs will be downloaded and rendered on a variety of devices (phones, tablets and PCs). They need to render properly, no different from having the images and fonts embedded in the PDFs.
Can I achieve this using PDFs?
No, what I wanted is not possible in PDFs. It may be possible with Javascript embedded in PDFs but I did not try this.
Instead, we resolved the problem by using vector graphics for lines in the form (tables and section separators), embedding common fonts and using only font sub-sets for fonts not used in data entry.
I don't think it is necessary to embed the fonts for input fields, but you will want to ensure that special input field characters like the solid center for radio buttons or checkmarks for checkboxes are embedded.
Related
Does anyone know if there is a function in PDF's to allow them to auto-adjust the view depending on whether it is on a desktop or mobile? Or even by screen size?
I am looking to prepare PDF material for distribution, however, on the user group includes a mix of desktop and mobile, so instead of creating two PDFs I would like to have a single PDF which adapts to the users screen?
This is not possible with PDF files up to PDF version 1.7, the most commonly used on out there.
PDF 2.0 which was released three years ago has such a feature but it depends on the viewer implementing it and the PDF writer correctly annotating the PDF. I guess there are PDF viewers out there that can already do this but I'm not specifically aware of any.
If I were you, I would write the document in a format like LaTeX that can easily be converted to both kinds of PDFs, one for desktop and one for mobile.
// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
I am trying to figure out a way to successfully embed full fonts (not just the subset) within an editable PDF to create a PDF that allows the end-user to customize the information within each of the editable fields without losing the font styling and without having to separately download the font themselves. Some fonts will allow it, while others that are downloaded/purchased will not. However, I have seen others do this successfully and I know that the particular font that I am working with was purchased with a license for commercial use so there should not be any permission issues with embedding it. I have also seen this particular font in use for editable PDFs. I am working on Illustrator CC and Adobe Pro DC. I've used the Preflight Embed Missing Fonts option to no avail, as well as changing the Illustrator Setting for substituting fonts to 0% and neither has worked successfully. Any suggestions? Is this an additional plug in?
I made a PDF file from Latex (using TexMaker).
Acrobat Reader is able to display BOTH the text and the table of contents in Linux.
But Acrobat Reader is unable to display the table of contents in Windows XP (the Chinese characters came out as boxes). However, the text is displayed correctly.
I tried to embed the fonts into the PDF but the various methods are not 100% successful, so I'm not sure if the fonts are embedded correctly or not. Anyway, the table of contents remain unreadable in Windows.
I wonder if it is really an font embedding problem? Or do I need to install these "Adobe Reader X Font Packs":
https://www.adobe.com/support/downloads/detail.jsp?ftpID=4883
My concern is that I'd like my PDF to be readable in Windows, including the table of contents (and preferably without further installations). If this is possible...
I suspect you are talking about "bookmarks" and not saying part of the text in the document is ok and part is not. PDF Bookmarks are part of the UI of the application and are not selected from embedded fonts. Therefore, the system you are running on needs to know how to handle fonts in the language(s) of choice.
See https://forums.adobe.com/thread/1144972?start=0&tstart=0
Embedding the fonts will have no effect on the bookmarks.
We are using Blackberries to display PDF reports. Here are background details on the problem:
The PDF reports are created using JasperReports.
Report format can be changed.
Different report formats are available (as per the feature set of JasperReports).
The PDF reports are on a website, too, so retaining a single source is ideal.
The page setup is in Landscape.
Here are the issues we have encountered:
Users cannot see a full line of text on the Blackberry.
The size of the PDF and UI makes reading difficult, at best.
The menu option to convert the PDF to text loses too much formatting to be useful.
The text is blurry (and too small).
Here are solutions we have thought about:
Create a second report (not ideal) in text or HTML format.
Simplify the original report format (not really an option, given the amount of data).
What other options are there for making a report available on the Blackberry, given the constraints of JaserReports, such that the report:
Is legible?
Is formatted for readability?
Displays quickly?
Essentially, we'd like to make sure there are no simple solutions we have overlooked for displaying legible PDFs on Blackberries.
We convert TIFFs to PDF for one of our applications, and have had mixed results with BlackBerry PDF viewers. These were our results.
Working
The following PDF readers worked for our purposes:
RepliGo Reader v1.1.1.1 - $19.95
Works fine.
DataViz Documents To Go Premium Edition v1.003.001 - $49.99
Works and includes a word wrap option to get the current zoom level to fit the available screen width, by moving text onto subsequent lines. Might fit your needs.
Non-Working
The following PDF readers did not work for our purposes:
BeamReader v1.0.8 - $17.99
BeamSuite v3.0.2 - $49.99
These couldn't open our PDF files ("Unsupported document format"). In addition they did not register as a PDF content handler, required for our application.
MasterDoc - $19.95
eOffice - $29.95
These also did not register as a PDF content handler. We had a range of problems with these, including installation issues, and not being able to open any PDFs at all.
Try BeamReader http://www.slgmobile.com/beamreader.html
I hear it's the best at reading PDFs for BlackBerry
How about outputting the file to an RTF or an image file (JPG/GIF), and then viewing them in your web browser?
If that doesn't work well on the native browser, I would focus on viewing the file via some other web browser - for example, Opera Mini. I know for images it's easier to navigate "big" images in Opera Mini than the native browser.
If your blackberries are on a BES server, couldn't you display the reports as HTML on your corporate intranet? - Then you could email a link to the blackberry and simply browse the report.
You can convert pdf to image via xpdf and than show image. xpdf is a BEST renderer of pdf.