Avoid multiple pages - pdf

Is it impossible to generate a PDF from HTML source as one single long paper? Since now if the content exceeds the A4 (for instance) paper height, it moves content to the next page.

I'm 90% sure that you can't do anything preemptive except guesstimate the size of the resulting PDF, which is not reliable.
I would generate the PDF and then use pdftk (or similar tool) to check how many pages were generated. Based on that, you could run wkhtmltopdf with the --page-height option set to something higher than an A4 and loop this process until you have the desired one page output.
Tedious, but the only reliable way.

Related

how to customize the PDF generated from jEuclid using MathML?

I can generate the MathML to PDF out using the jEuclid ("mml2xxx.bat") but what I get is a PDF with a size A4
What I need is, I want the PDF size exactly as per the equation in the PDF document
What I got now :
What I need :
What is the intended destination of the PDF equation/expression? The reason I ask is that some applications are smart enough to know not to use the whole A4 size, and display just the math part. If you're using InDesign, for example, and you use the Place command to place this PDF into a document, it should appear at the proper size, and likely even with the correct baseline adjustment.

Multipage background in PDF, using pdftk or other tool

How to add multipage background (eg. odd and even backgrounds) to 10 thousands pages PDF, with keeping output file as small as possible?
I'm doing massively multipage documents (eg. 10000 pages in one document). Each page has background, which I apply in such way:
I have lot of .dvi documents, I join them using dviconcat
next I do dvipdf on joined .dvi
and next I use pdftk to apply background, using pdftk infile.pdf background bg.pdf output outfile.pdf
In this way, I have fairly small file, eg. 200MB, comparing to situation when I produce lot of .pdf files with background and join them using pdftk and resulting file is eg. 2G.
I think it's because background is not repeated every page, but it's copy is stored in PDF only once and there is some kind of reference in pages.
Unfortunately, now I need to use 2pages / 2 sides background. Different background for odd pages and different for even. PDFtk don't know how to do it. I can prepare 10.000 pages background, but it will be huge (eg. 1G).
Any suggestion how could I accomplish it, without playing with multi-gigabytes files? Is it doable at all? If yes - with pdftk or some different tool?
One solution would be to do the background when you convert PostScript to PDF. Using a BeginPage procedure you can paint the background before you pain the page contents. By checking the page count in BeginPage you can choose which background to paint, so you can have different ones for even/odd/whatever pages.
If you specify each background as a PostScript form, then your BeginPage can be small, also (and rather more importantly) the current version of Ghostscript, 9.14, will attempt to pass PostScript forms into a PDF file as a PDF form,and it can identify and consolidate duplicates so it 'should' only embed each form once. This should result in the minimum possible file size.
However, this code is at an early stage of development and might not work for you, also you'll need to do some PostScript programming.
I'm not familiar with pdftk, but would it be possible to produce all the even page, add a background to them. Produce all the odd pages, add a different background, then use pdftk to merge and interleave the pages ?
NB Ghostscript doesn't handle .dvi files, so I'm rather at a loss to know how you use Ghostscript to 'join' them. Also, if you are somehow creating the PostScript files using Ghostscript, you would almost certainly be better off using Ghostscript to produce the PDF file directly. (I'm assuming here that you are using Ghostscript's ps2pdf, but even if you aren't it'll still be quicker to produce the PDF in one step, and almost certainly produce better output too)

Possible to control PDF layout with iText?

I'm writing some logic to build a large single PDF file that our users can print at their convenience. I'm using Java's iText library (through Clojure's clj-pdf).
I'm trying to have the PDF show the same exact template form on every single page, however I can't seem to find any documentation or indication that one can have PDF content "fit to a page".
The text in these forms varies a little bit, so there's a chance it might require more of fewer text lines per page. This means that the content has a chance of spilling over to the next page, or being too short, making the next page creep up into the previous one, breaking the requirement of "one form per page" for the rest of the document.
I'm trying to figure out if my option is pretty much only to manually check the length of the text on each page and potentially crop it by hand if I goes over n lines, or if the PDF format somehow supports a smart way of having paragraphs+tables+headings all fit in one page. Some UI systems allow you to control how spill-over is handled, anywhere from cropping to resizing the font, so I'm curious if PDF supports anything of that sort.
Edit: ended up going with pagebreaks for simplicity, wasn't aware of that option when I wrote this question.
If you want to take control over the space taken by text, for instance to fit it on a single page, the way to go would be to create a ColumnText object and to add the content in simulation mode. If the text fits the page, add it for real. If it doesn't, use a smaller font size. This is demonstrated in the MovieAds example where snippets of text are fitted into AcroForm fields.

Is this possible to break the pdf file smaller than page wise breaking?

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?
PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.

Is there a way to programmatically remove all blank pages from a PDF file?

Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.