Split by bookmark in middle of PDF page - pdf

For a little open-source project I am trying to split PDFs based on the bookmark information. I found several ways to split them by page, but sadly the PDFs I need to split have bookmarks in the middle of the page. So if I have 3 pages and there are two bookmarks, 1 at the first page, one at the middle of the 2nd page, I need to extract 2 PDFs with 2 pages each and no overlap in text in both PDFs.
Does something like that exist?

Related

Adding a pdf footer conditionally on certain pages in a multi-page pdf document

I have one web application which generates pdfs for each request.The data would be different in these pdfs based on user information.The number of pages can vary from 6 to 9.To construct the pdfs,i have multiple PdfPTables and each table has its own cells.Once i construct all the PdfPTables,as a final step i am adding the tables to the document.
Recently i have a requirement as,when ever there is a particular text then we need to add the footer to indicate the occurrence of this text in the respective pages.This can in 3 page or this can be 6 page or in both.I was thinking to figure out a way for this.
One of the approach i have is to identify this text at the time of adding to the PdfPCell and then generate a footer.But at this stage i dont have an idea as which page this would be in the document.I am letting the table to grow to the next page if it doesn't fit to the current page.
Another approach is to parse the entire pdf before sending the response back.Take one by one page,get the text and compare against the search text and if exists add a footer.Some how i feel this is a costly operation.
Please let me know if any of you have any suggestion to this.
Any help would be highly appreciated.
Thanks,

Adjust PDF scale to print

In the context of my studies I often receive PDF files written in LaTeX, with big margins.
When I have to print those files, I like to print them with 2 pages per sheet to spare paper. But I then have a lot of white-space and the text is quite small.
So I'm looking for a way to scale the page contents first and only then print them 2 pages per sheet, to avoid losing space and to have the text as big and readable as possible.
Has anyone an idea of how I could do that either programmatically, or scripted, or on a "step-by-step commands" basis ?
(Note that I have no access to the LaTeX code, otherwise I would just change the margins...)
I used FinePrint to do this on windows. But there are some alternatives, which I haven't try:
https://superuser.com/questions/190869/fineprint-alternative-on-linux
https://superuser.com/questions/107687/good-virtual-printers-with-cropping-for-windows-and-linux
Here are previous answers (all mine) which provide building blocks that will help you construct your own programmatic or scripted or "some step-by-step commands" solution:
PDF Manipulation: "2-Up" page layout (SuperUser)
Linux-based tool to chop PDFs into multiple pages (SuperUser)
Convert PDF 2 sides per page to 1 side per page (SuperUser)
How can I split a PDF's pages down the middle? (SuperUser)
Cropping a PDF using Ghostscript 9.01 (StackOverflow)
Split one PDF page into two (StackOverflow)
PDF - Remove White Margins (StackOverflow)

Creating acrobat links in source document

I've used Acrobat Pro for many years but only scraped the surface of its capabilities.
Here's the problem I'm facing, I'm sure that Acrobat can deal with it, I just can't figure out how...
So... I have a wiki based content management system from which I need to regularly produce packs of content (consisting of up to 250 articles), packaged as a single PDF.
I have the Webkit html - pdf conversion engine up and running and have written code (javascript/jquery) that selects the relevant articles (based on user input), sorts them alphabetically and PDFs them into a single acrobat document. The process creates simple bookmarks, one for each article, based on its title.
So far, so good...
I also dynamically create a second document that contains a Table of Contents and index.
An example... The wiki contains pages:
All about Foo
All about Bar
All about Baz
These are exported into a single PDF, in the order
All about Bar
All about Baz
All about Foo
And bookmarks are created that link to each 'page'
The ToC and index are then created as temp wiki pages (using Javascript and JQuery) and exported to PDF eg:
1 All about Bar
2 All about Baz
3 All about Foo
And
Bar 1
Baz 2
Foo 3
Systems 1,2,3
Kittens 2,3
Etc etc
This has worked fine and everyone has been happy. However, in the nature of users everywhere, they want more. Specifically, they want to merge the ToC and Index into the PDF (that part's easy) and link on the article numbers.
That's where my knowledge and the results I can pull out of Google come to a halt. I can do it manually of course but with three or four of these packs per week, each containing between 20 and 350 articles and well over a thousand index entries - that would be a world of pain.
As the ToC and Index are generated by scripts, I can wrap the numbers in any sort of html markup.
I wondered about something like
4
But don't see how to pick up on that on merge so that it converts the html links to pdf ones.
Any ideas or pointers?

Quartz-2D : spotting text other the main text in PDF book pages

I would like to know if it is possible (Quartz2D) to programmatically recognize and handle the text above (or below) in a PDF page that shows page number and paragraph title or other information to know where you are in the book. Is it just text like the main text in the page or can be somehow distinguished?
The page number (if printed on the page) is no different to any other text on the page (there are other kinds of page numbers in a PDF file however).
Some kinds of PDF (PDF/A-1a, 'tagged' PDF) do have things like page numbers and titles marked in a separate way, but in the general case PDF files are neither of these and the page number or titles are indistinguishable from the remainder of the text.

Is this possible to break the pdf file smaller than page wise breaking?

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?
PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.