Split long PDF page into mutiple pages - pdf

What is the best method for splitting one very long pdf page into seperate pages? In this case, the one page image is made up of what was originally multiple letter size pages that have a black line where each page should be divided. To be clear, it is a single PDF document, with a single page. The single page is an image of hundreds of pages, so it is a very long image.
https://filebin.net/h2wiqckndsugnr1o/sample_pdf_long3.pdf
The pages wihin the image are not consistently the same size because white space was removed on some of the letter sized pages,so some are longer than others.
This explains the issue: https://dustinfreeman.org/blog/pdf-splitting/ However, they don't have a solution to fix that the page breaks are not aligned correctly.
Is there a software, or solution to programically do extract the single image into multiple pages in the single pdf document?

I would suggest you to use this approach
Create XObject from the contents of the first page.
Create a number of smaller pages.
Draw the XObject on each page using negative top offset.
Different parts of the XObject will be visible on different pages. Size of the file won't increase much because the image will be reused.
You will need to calculate the top offset and size for each page. You can do this manually, of course. Or you can use some kind of computer vision algorithms to find horizontal black lines. You will have to extract image first. Given the array of coordinates for these lines you will be able to calculate page bounds.

Related

Printing multiple pages on one page in landscape orientation

I'm trying to print four landscape-oriented pages of a document in a grid on one page in landscape-orientation using VBA with:
ActiveDocument.PageSetup.Orientation = wdOrientationLandscape
ActiveDocument.PrintOut PrintZoomRow:=2, PrintZoomColumn:=2
This however is printing the four small landscape-oriented pages in a grid on a portrait-oriented page, which leaves them too small and with too much free space between them vertically.
I looked at the documentation for PrintOut, but didn't find anything concerning orientation.
I tried reversing the order of the PrintZooms.
I also tried manually configuring the width and height of the printed paper with PrintZoomPaperWidth and -Height, which lead to the small pages being cut off and the printing one still in portrait mode.
This just doesn't seem to be possible in the current version of Office (2019), neither with code nor the UI.
As a workaround, one could take screenshots, change the orientation to portrait and paste them in rotated 90° or use rotated textboxes in Word.
Alternatively and probably much easier, create a PDF and use a PDF reader capable of printing this way, e.g. Adobe Reader.

How to get rid of whitespace when printing documents?

A PDF with lecture slides contains huge whitespaces (ca. 50% of actual slide dimensions on each border). How can I get rid of those in a printout?
NOTE: Printer settings are not useful. Zooming is not possible, as this immediately cuts into one side of the slides and toggling automatic centering does not solve this issue either. Need document level solution!
Is there a function for this in common word processing programs? I have imported the PDF into LibreOffice Draw. The slides are imported as images, but I do not want to rescale 60 images on 30 pages by hand:
Source: http://www.cs.toronto.edu/~kyros/courses/418/Lectures/lecture.2010f.02.pdf

PhantomJS image captures images of different dimensions despite constant page content

I am trying to use PhantomJS image capture to capture the image of the browser.
Each time I run the image capture function, the dimensions of the image is slightly different. Example, once I get 1400x5185, if I open the same url after few hrs, I get 1399x5185 or 1400x5186.
I have tried croping from left top corner, but then pixels are slightly skewed.
Note:The content of the page is always constant
How do I always ensure that I get the same dimension of image without copping the pixels?
Something probably changes on the page, otherwise there is no reason for PhantomJS to render different images.
You should check the differences of the images in detail. Ads are probably the culprit when they are not uniformly formatted. If you identified the changing DOM elements, you can use casper.evaluate() to access the DOM and remove/hide those elements before capturing the screenshot.
You could also change the viewport size to 1920x1080 for example using casper.viewport(). If the page is vertically scrolling, then only one of the y-direction might change. If you want to be sure, then change the viewport size to 1400x5187.

Is it possible to remove the background of a text block in pdf using ghostscript

I am trying to convert a pdf into tif using ghost script. Is it possible to remove the background (grey color) of a text block (back font color) in a pdf using ghost script? I would like to replace the grey background to white.
Appreciate your help!!
I don't think you'll get a generic solution to your problem because there are many different ways such a background may be coded in your PDF and there is no sure way to distinguish such a background from a rectangular form of some vector image.
PDF essentially offers a set of tools for positioning glyphs and vector graphics in some rectangle (page) to display and some additional tools to add some interactivity (e.g. forms). Thus, a colored background in a PDF generally is created by drawing a line along the edge of the area of the background, fill this form with the desired color, and position glyphs and graphics (text and images) atop it. There are other operators, too, which can be used, though, and many variants of their use, and generally the form created is not marked as background.
In the answer Dingo refers to in his comment a rectangle covering the whole page, actually even a bit more (in case of a fairly common choice of a media box), is drawn (m: move to a corner; 4*l: draw the 4 edge lines; h: close the path; f fill the form).
Thus, please make the PDF in question available for inspection, maybe there is some specific solution for your file.

Parse Pdf to get the boudaries of image in the pdf page in Objective-c

I have an iPad app that displays pdf pages.I need to add annotations on the image (if exists on the pdf page) for which i need the coordinates at which the image is situated in the pdf page.I am able to get the image data from the XObject and the image width and height,but i also need the x and y coodrinate of the image.Any idea about how to obtain the coordinates of image by parsing pdf page?
Im assuming you have seen this apple developer page describing how to parse XObjects: http://developer.apple.com/library/mac/#documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html
XObjects do not contain any position data as they just describe image data that can be reused through the pdf.
From http://itext-general.2136553.n4.nabble.com/finding-the-position-of-xobject-in-an-existing-pdf-td2157152.html
"An XObject is a stream that can be reused in many different
other streams. For instance: you could have an image XObject
of a logo that appears on every page in the document.
Suppose that you have some pages in landscape and some in portrait.
Then the logo will have different coordinates on these different
pages. Therefore the position of the XObject IS NEVER STORED with
the XObject, the position can be found in the stream that refers
to the XObject.
Maybe your reaction is: "Oh right, then it's simple: I have to
look in the content stream of the pages using the XObject."
Yes and no. That's indeed where you should look, but it's not
simple. Because the actual position depends on the current
transformation matrix of the state at the moment the image is
added. It's quite some programming work to parse the content
stream and calculate the position of an XObject. "
I think you should find another option and avoid this all together.
If your still determined you will have to use CGPDFScanner and find the transforms through the page.