In the context of my studies I often receive PDF files written in LaTeX, with big margins.
When I have to print those files, I like to print them with 2 pages per sheet to spare paper. But I then have a lot of white-space and the text is quite small.
So I'm looking for a way to scale the page contents first and only then print them 2 pages per sheet, to avoid losing space and to have the text as big and readable as possible.
Has anyone an idea of how I could do that either programmatically, or scripted, or on a "step-by-step commands" basis ?
(Note that I have no access to the LaTeX code, otherwise I would just change the margins...)
I used FinePrint to do this on windows. But there are some alternatives, which I haven't try:
https://superuser.com/questions/190869/fineprint-alternative-on-linux
https://superuser.com/questions/107687/good-virtual-printers-with-cropping-for-windows-and-linux
Here are previous answers (all mine) which provide building blocks that will help you construct your own programmatic or scripted or "some step-by-step commands" solution:
PDF Manipulation: "2-Up" page layout (SuperUser)
Linux-based tool to chop PDFs into multiple pages (SuperUser)
Convert PDF 2 sides per page to 1 side per page (SuperUser)
How can I split a PDF's pages down the middle? (SuperUser)
Cropping a PDF using Ghostscript 9.01 (StackOverflow)
Split one PDF page into two (StackOverflow)
PDF - Remove White Margins (StackOverflow)
Related
I'm using pdftotext to extract info from a pdf. Currently using the -raw option. I do have a few problems with the PDFs I'm working with. If I select the text from top to bottom it selects in the following fashion.
PDF content:
A
B
C
It selects A then C and then B. So when I extract the text it is presented in the same way. Is there a way to reformat the PDF so I can select the content from top to bottom?
NOTE: I'm aware that if I omit the "raw" option the layout will be preserved, but it seems to be buggy when the document includes tables so raw works better for me.
Yes, you can reformat the PDF so that the content is returned from top to bottom. This is not something that can be easily done using Adobe Acrobat or any other viewer that I am aware of and here is why.
From the documentation of pdftotext, the -raw option is defined as
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
"content stream order" is the important piece in the description.
In PDFs, the content on the page does not have to be written in the content stream (the instructions that are interpreted to display the page) in the order that a human would read the content when the page is rendered. The internals of PDFs do not care about the ordering, they were designed to reproduce the same visualization of a document on a variety of platforms. Since all that matters to PDF is the visualization, applications or libraries that write PDF tend to not order the content stream in any meaningful way.
So you can reorder the instructions in a content stream so that they are in the order a human would read them, it is not an easy task to do by hand and using a library that understands PDF to manipulate the content stream would be one way of doing this. Another way is to look for a more advanced tool to use to extract text from the PDF (there are a number of tools that will look at the placement of the content on a page rather than just where it appears in the content stream).
I am not aware of anything that will reorder the content stream in the PDF based on where the content appears on the page automatically though.
How to add multipage background (eg. odd and even backgrounds) to 10 thousands pages PDF, with keeping output file as small as possible?
I'm doing massively multipage documents (eg. 10000 pages in one document). Each page has background, which I apply in such way:
I have lot of .dvi documents, I join them using dviconcat
next I do dvipdf on joined .dvi
and next I use pdftk to apply background, using pdftk infile.pdf background bg.pdf output outfile.pdf
In this way, I have fairly small file, eg. 200MB, comparing to situation when I produce lot of .pdf files with background and join them using pdftk and resulting file is eg. 2G.
I think it's because background is not repeated every page, but it's copy is stored in PDF only once and there is some kind of reference in pages.
Unfortunately, now I need to use 2pages / 2 sides background. Different background for odd pages and different for even. PDFtk don't know how to do it. I can prepare 10.000 pages background, but it will be huge (eg. 1G).
Any suggestion how could I accomplish it, without playing with multi-gigabytes files? Is it doable at all? If yes - with pdftk or some different tool?
One solution would be to do the background when you convert PostScript to PDF. Using a BeginPage procedure you can paint the background before you pain the page contents. By checking the page count in BeginPage you can choose which background to paint, so you can have different ones for even/odd/whatever pages.
If you specify each background as a PostScript form, then your BeginPage can be small, also (and rather more importantly) the current version of Ghostscript, 9.14, will attempt to pass PostScript forms into a PDF file as a PDF form,and it can identify and consolidate duplicates so it 'should' only embed each form once. This should result in the minimum possible file size.
However, this code is at an early stage of development and might not work for you, also you'll need to do some PostScript programming.
I'm not familiar with pdftk, but would it be possible to produce all the even page, add a background to them. Produce all the odd pages, add a different background, then use pdftk to merge and interleave the pages ?
NB Ghostscript doesn't handle .dvi files, so I'm rather at a loss to know how you use Ghostscript to 'join' them. Also, if you are somehow creating the PostScript files using Ghostscript, you would almost certainly be better off using Ghostscript to produce the PDF file directly. (I'm assuming here that you are using Ghostscript's ps2pdf, but even if you aren't it'll still be quicker to produce the PDF in one step, and almost certainly produce better output too)
I have a Pdf file which contains several slides per page, including text (not only images).
This pdf was probably created using pdfnup.
Can I revert the pdfnup operation so that each slide is shown on one page?
As far as I know, there is no simple to be used 'undo' operation.
However, the following answers show you the approach principle, how you can achieve the undo-equivalent operation using Ghostscript:
Convert PDF 2 sides per page to 1 side per page (Superuser)
How can I split a PDF's pages down the middle? (Superuser)
Cropping a PDF using Ghostscript 9.01 (Stackoverflow)
PDF - Remove White Margins (Stackoverflow)
(Should these not help you to find the final solution, ask again. But then to come up with a fully working commandline, I'd need the complete output of the following command first: pdfinfo -f 1 -l 100 -box your.pdf.)
I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?
PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.
Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.