What is a format for static documents like PDF but not divided into pages? - pdf

PDF is for static documents, so a document is shown the same in different applications, even if it has an unusual layout. But PDF documents are divided into pages because the format is designed for documents to be printed.
I would like to have a document with static content but with no page breaks. Which document format can do that? I guess that it could be achieved with PDF with a single page as long as it needs to be, but I don't know that any software could do that, and it seems like abuse of PDF.
I create PDF documents in LATEX, and they almost never are printed, and the page layout is in the way when they are read on a screen. So I'm looking for how I could have documents where the layout is fixed because of hyphenation, mathematics and graphics, but more suitable for reading on screens.

Related

Using PDFBox or something else, is it possible to know if a pdf contains no scanned pages?

I'm looking for a solution to detect if a pdf document contains some non-searchable text, I'm thinking about a scenario where a multi-page pdf contains some plain text pages, with or without images it doesn't matter, and one or some pages containing non-searchable texts.
So I would like a method returning true/false which is able to detect if a pdf contains some non-searchable text (or viceversa), in your opinion is it possible with PDFBox or something else?
Thx

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Get size of single pages in a multi-page pdf document in objective-c

I am writing an app in objective-c that splits multi-page pdf files into several files.
In order to control the size of the resulting split files I am looking for a way to progammatically get the size of the single pages in the multi-page pdf file. Could not find anything in the APIs. Any help is appreciated.
There is no such information in iOS/OS X PDF related APIs. Due to the structure of PDF files computing the size a PDF page occupies on disk is quite complicated. Also because pages can share resources, the sum of separate page sizes will not equal the size of the final file.

Load auto paged pdf in iOS like iBook

In iBook, when you open a PDF, you can auto format and paged the pdf, e.g. if in iPhone, there are 5 pages, but when you view with iPad, it only contains 2 pages.
When you change the text size, the page also updated automatically.
How to do this using CGPDFDocumentRef?
I'm assuming you are talking about Apple iBooks on the iPad? Are you sure you are observing the behavior of a PDF and not an ePub file?
The native format of iBooks is either ePub or the format created by iBooks Author.
PDF files are usually (in the vast majority of cases) used in a non-reflowing way. Reproducing the exact visual appearance of pages - explicitly without reflow - is exactly why PDF was invented.
There are constructs you can add to PDF files to make them a little more alike to formats like HTML and ePub; these constructs can tag text with styles, logically define paragraphs, columns and tables and so on. Usually they are used to make a PDF file suitable for long-time archiving (according to the ISO PDF/A standard) or accessible (suitable for reading by screen-reader software for vision-impaired people for example). Such a PDF file is commonly referred to as a tagged PDF.
As far as I know iBooks doesn't actually support tagged PDFs (meaning, it doesn't use the information in such a PDF file to reflow the file). And as far as I know you cannot create the necessary tags and structure with the built-in iOS library.
If your target app is iBooks, you'd probably be better off looking into generating ePub...

Is this possible to break the pdf file smaller than page wise breaking?

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?
PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.