Is this possible to break the pdf file smaller than page wise breaking? - pdf

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?

PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.

Related

"Re-paginate" PDF using iText

Disclaimer:
I am using iText 5. I know this is generally frowned upon (vs. using iText 7), but I am working with considerable legacy code that uses iText 5, and upgrading does not fall under my control.
Requirements:
A "simple" PDF/A is received as input (text only, these are generated from RTF), as well as a float value corresponding to a desired first page length in inches.
A PDF/A must be output that is identical to the input PDF, except it is paginated as follows: first page length = input value; each subsequent (not first or last) page will fill a standard page length; the last page will be truncated a constant number of points below the content nearest the bottom of the page. Note that input and output width will be identical and constant.
Progress / Approach:
I have extended the SimpleTextExtractionStrategy to generate XML containing font information (size and family, bold or italics, etc.) as well as location information (relative an absolute coordinate system where the origin is at the top left corner of the first page of the input PDF) for each "span" of text extracted from the input PDF.
I then generate a new PDF page by page (where each page is the desired length according to the requirements outlined above), filtering the extracted XML info with LINQ based on the bounds of each new page, and adding appropriately formatted text at the appropriate location using ColumnText.ShowTextAligned(...).
Problem:
The approach outlined above does fine. It generates PDFs with the desired page structure, but some information is lost in translation, namely colored text and underlined text. While colored text shouldn't be seen in these PDFs, underlined text absolutely must be detected.
This set of requirements should also include PDFs with tables. I originally planned on implementing a different module that adheres to the same interface for table PDFs, as these are generated and used separately from the PDFs generated from RTF, and iText has relatively strong table functionality built in.
The two concerns outlined above, coupled with the fact that my described approach was born out of an attempt to reuse existing code leads me to believe that an entirely different approach may be necessary or at least much better. It seems to me that there should be a way to capture content byte info and clip it as necessary to "re-paginate" the input PDF, only worrying about moving content that falls along a page boundary.
Essentially, I am looking for (iText based) recommendations for a better approach. Pseudo-code type answers or simply recommendations for classes / interfaces that may help are acceptable. While it would be nice to handle text and tables together, any advice pertinent to one or the other would also be appreciated. I have perused much of the available documentation on the iText website and other SO questions, but have not found quite what I'm looking for.
Note that no code is included in this question as I am looking for a high-level approach that is entirely different from what I have tried.
Edit:
I didn't notice it before, but the way in which I was reusing fonts (similar to this) resulted in some unexpected (but documented as such) behavior. It seems that I will need to avoid extracting information for re pagination at the text level, as it will be difficult to ensure continuity of fonts between input and output.
I solved this problem a while ago, but figured I would post my solution. I'm sure it's not the most efficient solution, but it works well for my purposes. Note that this will re-paginate a PDF as described in the question containing text only. Table PDF's are handled separately.
The basic process is this:
Use a custom TextExtractionStrategy to extract XML containing information regarding ascent and descent lines for all text in the input PDF, as well as what page it originally appears on.
Given the page length requirements as described in the question (first page = input value, subsequent = standard length, last page = fit content) and the XML info regarding text positions, determine what content will fit on each page of the output PDF. Create a map of where each input page will need to be cropped (top and bottom, note that each input page may be cropped more than once), as well as a map of which cropped pages will need to be "concatenated" together in the final output.
Copy the input PDF page by page to an intermediate temporary PDF (using PdfCopier). If an input page must be cropped more than once (ex: first 2 inches of input page 1 = page 1 output, next 6 inches of input page 1 = page 2 output, final 0.5 inch of input page 1 = top of page 3 output), ensure that it is copied the appropriate number of times (1 time per crop).
Crop each page of the intermediate copied PDF appropriately. This is done by modifying the MediaBox and / or CropBox.
Concatenate the appropriate cropped pages together into the final output PDF's pages. I used a PdfWriter to first create a new page of the appropriate height, then add each appropriate cropped page at the appropriate position in the output PDF page's byte content usingcontentByte.AddTemplate(inputCroppedPage, 0, bottomOfLastAddedCroppedPage).
To anyone who managed to read and understand all of that, congratulations. To anyone else, please let me know what you if you are confused. The solution described above is a little twisted and tough to put into words. While there is too much code to post here (and I am not at liberty to share the code on GitHub or similar), I would be happy to answer any questions that will help someone else implement something similar.
The TextExtractionStrategy mentioned in step 1 was inspired by this answer. Essentially, I used System.Xml.Linq to create an XML document rather than concatentating strings to form HTML, and I ignored any font information, storing only information regarding where text is located in the page (you'll see that this information is available in the linked answer, just isn't written into the final HTML).

Adjust PDF scale to print

In the context of my studies I often receive PDF files written in LaTeX, with big margins.
When I have to print those files, I like to print them with 2 pages per sheet to spare paper. But I then have a lot of white-space and the text is quite small.
So I'm looking for a way to scale the page contents first and only then print them 2 pages per sheet, to avoid losing space and to have the text as big and readable as possible.
Has anyone an idea of how I could do that either programmatically, or scripted, or on a "step-by-step commands" basis ?
(Note that I have no access to the LaTeX code, otherwise I would just change the margins...)
I used FinePrint to do this on windows. But there are some alternatives, which I haven't try:
https://superuser.com/questions/190869/fineprint-alternative-on-linux
https://superuser.com/questions/107687/good-virtual-printers-with-cropping-for-windows-and-linux
Here are previous answers (all mine) which provide building blocks that will help you construct your own programmatic or scripted or "some step-by-step commands" solution:
PDF Manipulation: "2-Up" page layout (SuperUser)
Linux-based tool to chop PDFs into multiple pages (SuperUser)
Convert PDF 2 sides per page to 1 side per page (SuperUser)
How can I split a PDF's pages down the middle? (SuperUser)
Cropping a PDF using Ghostscript 9.01 (StackOverflow)
Split one PDF page into two (StackOverflow)
PDF - Remove White Margins (StackOverflow)

Possible to control PDF layout with iText?

I'm writing some logic to build a large single PDF file that our users can print at their convenience. I'm using Java's iText library (through Clojure's clj-pdf).
I'm trying to have the PDF show the same exact template form on every single page, however I can't seem to find any documentation or indication that one can have PDF content "fit to a page".
The text in these forms varies a little bit, so there's a chance it might require more of fewer text lines per page. This means that the content has a chance of spilling over to the next page, or being too short, making the next page creep up into the previous one, breaking the requirement of "one form per page" for the rest of the document.
I'm trying to figure out if my option is pretty much only to manually check the length of the text on each page and potentially crop it by hand if I goes over n lines, or if the PDF format somehow supports a smart way of having paragraphs+tables+headings all fit in one page. Some UI systems allow you to control how spill-over is handled, anywhere from cropping to resizing the font, so I'm curious if PDF supports anything of that sort.
Edit: ended up going with pagebreaks for simplicity, wasn't aware of that option when I wrote this question.
If you want to take control over the space taken by text, for instance to fit it on a single page, the way to go would be to create a ColumnText object and to add the content in simulation mode. If the text fits the page, add it for real. If it doesn't, use a smaller font size. This is demonstrated in the MovieAds example where snippets of text are fitted into AcroForm fields.

Extract titles from each page of a PDF?

I am working on a project, SIGGRAPH Image Wall.
My first challenge is to figure out how to extract titles of each page in a PDF, SIGGRAPH 2013 Technical Papers First Pages (44 MB PDF).
This PDF is a compilation of the first page of each papers.
Therefore, there is a paper title for each page, a little different from
the traditional scholar paper.
Does anyone have any idea for this?
I think you can accomplish this using any of a number of text extraction approaches, though I will caution that getting to 100% accuracy will be tricky...
Some possible tools to use:
pdftotext or pdf2txt - Simple and easy cross-platform extraction utilities.
PDFNet - Robust SDK for digging into PDFs and pulling out exactly the data you want.
Perl modules: PDF::API2, CAM::PDF - I'm a Perl guy so I'd go this route, but I'm sure similar libraries exist in Python, Ruby, etc.
Your source pages look reasonably consistent - I feel like you'll be able to make some smart guesses about where on the page your content will be and what it'll look like. I'd try this out:
Inspect the PDF manually to figure out the title font name and size.
Extract text information for the top portion of the page (something like the top 150 pixels). Make sure to extract font info.
This should get all of your title text and maybe some author names. Parse this data (either within the script you write, or in the XML output files from pdftotext, etc.), keeping only the words that match your title font info.
If the title font varies, you'll need to guess what the title font is for each page and differentiate it from author names (the only other content you should get from the top of the page) which you can probably do simply by comparing font sizes.

PDF Colo(u)r Analysis (without Acrobat itself ?)

Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)
You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.
Most PDF tools have access to this information but no api to access it. You could take any tool and add it in
Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.
We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1
Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator
If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).