pdfposter: crop / tile / posterize long PDF to multi pages from Safari Export as PDF - pdf

When I save a webpage with Safari's > File > Export as PDF...
Safari renders a long PDF in several (long) pages.
Here a screenshot of Preview's Crop Inspector
The 200 inch height appears to be a distiller’s limit for PostScript, based on the Windows printer driver limitation.
Before saving I set Safari > Develop > Show Responsive Design Mode
for my iPad mini with a resolution of 768 x 1024 (portrait)
The beauty of this feature (unlike File > Print) is that it can be used with Safari in Responsive Design Mode, so an exact snapshot of the webpage (responsive layout, images and even dark modes) gets exported to PDF, without any print margins and such.
--> Now I want to cut / tile / crop / posterize / de-impose (or whatever one should call it) these [200 inch or 14400 pt long] long pages into more manageable page sizes.
So with Responsive Design Mode set to iPad mini (768 x 1024) I would like to cut to the same dimensions; a mediabox / cropbox of 768pt x 1024pt
I tried already various command line tools like BRISS, PDFTILECUT, PLAKATIV, MUPDF ecc.
Some libraries like the Python binding PYMUPDF seem to convert the PDF to an image first to get it cut, thus loosing all the hyperlinks = NO go
Until now I get a decent result with PDFPOSTER using following command line; I have set the height of the --poster-size BOX to something really long 100000pt
pdfposter \
-v \
-m 768x1024pt \
-p 768x1000000pt \
Safari-Export-as-PDF-IN.pdf \
Safari-Export-as-PDF-OUT.pdf
That works for all the pages, one after the other, but I can’t find a solution to set the Y coordinates of the first page to 0
The pages always seem to start from the bottom of the poster size, leaving space at the top..
Example PDF: >>> download here <<<
--------- =========
| | | xxxxx |
========= | xxxxx |
| xxxxx | | xxxxx |
--------- ---------
| xxxxx | | xxxxx |
| xxxxx | -> | xxxxx |
| xxxxx | | xxxxx |
--------- ---------
| xxxxx | | xxxxx |
| xxxxx | =========
| xxxxx | | |
========= ---------

OK with a lot of testing I found out something: PDFPOSTER does not like PDF's generated from HTML
I first made a 100x200px box in Illustrator and exported that to a PDF.
than run:
pdfposter -m 100x80pt -p 100x99999pt in-100x200.pdf out-100x200.pdf
This gives me a very nice result, the first page has a Crop Box of 100x40px and a Media Box of 100x80px, the rest of the pages Crop & Media Boxes of 100x80px
Than I made a very very basic HTML (left even out the doctype)
<html>
<body style="background-color:white;margin:0;padding:0">
<div style="background-color:gold;width:100%;height:1500px"></div>
</body>
</html>
and run:
pdfposter -m 767x1024pt -p 767x99999pt cleanHTML-IN.pdf cleanHTML-OUT.pdf
And get the first page with a white margin in the top, like in my initial problem.
So this is actually the Crop Box which does not seem to be set when using a PDF generated from HTML?
UPDATE:
Thanks to PDFPOSTER I have found my way to PYPDF.
Basically you define:
reader = PdfReader('in.pdf')
writer = PdfWriter()
I than loop over the pages page_x = reader.pages[i] from the input file, set mediaboxes for each "new" page (like photocopying) and append it to the writer writer.add_page(page_x)
Finally write out with writer.write()
Regarding corrupt PDF files, PIKEPDF a Python wrapper around QPDF features automatic repairs just by opening and saving the file.
# pikepdf / pikepdf:
# https://github.com/pikepdf/pikepdf
# https://pikepdf.readthedocs.io/en/latest/
#
# py-pdf / pypdf:
# https://github.com/py-pdf/pypdf
# https://pypdf.readthedocs.io/en/latest/
import pikepdf, os, math
from pypdf import PdfWriter, PdfReader
# define, could become arguments
pagecut_h = 1024
inputfile = 'in.pdf'
outputfile = 'out.pdf'
# repair with PikePDF
print("repairing {0} .....".format(inputfile))
pdf = pikepdf.Pdf.open(inputfile)
pdf.save(inputfile + '.tmp')
pdf.close()
os.unlink(inputfile)
os.rename(inputfile + '.tmp', inputfile)
reader = PdfReader(inputfile)
writer = PdfWriter()
pages_n = len(reader.pages)
print('reading ..... {} input pages'
.format(pages_n))
for i in range(pages_n):
page = reader.pages[i]
page_w = page.mediabox.width
page_h = page.mediabox.height
print('input page {}/{} [w:{}, h:{}]'
.format(i + 1, pages_n, page_w, page_h))
if (page_h <= pagecut_h):
print('> input page height is smaller than the cut height')
print('appending original input page [w:{}, h:{}]'
.format(page_w, page_h))
writer.add_page(page)
else:
pagesfull_n = math.floor(page_h / pagecut_h)
print('calculating .......... {} output pages'
.format(pagesfull_n + 1))
# first FULL page
page.mediabox.left = 0
page.mediabox.right = page_w
page.mediabox.top = page_h
page.mediabox.bottom = page_h - pagecut_h
print('appending output page 1/{} [w:{}, h:{}]'.
format((pagesfull_n + 1), page_w, pagecut_h))
writer.add_page(page)
# other FULL pages
for j in range(pagesfull_n - 1):
page.mediabox.top -= pagecut_h
page.mediabox.bottom -= pagecut_h
print('appending output page {}/{} [w:{}, h:{}]'
.format((j + 2), (pagesfull_n + 1), page_w, pagecut_h))
writer.add_page(page)
# LAST (not full) page
pagelast_h = (page_h - (pagecut_h * pagesfull_n))
page.mediabox.top = pagelast_h
page.mediabox.bottom = 0
print('appending last output page {}/{} [w:{}, h:{}]'
.format((pagesfull_n + 1), (pagesfull_n + 1), page_w, pagelast_h))
writer.add_page(page)
with open(outputfile, 'wb') as fp:
writer.write(fp)

Related

Read a file from a position in Robot Framework

How can I read a file from a specific byte position in Robot Framework?
Let's say I have a process running for a long time writing a long log file. I want to get the current file size, then I execute something that affects the behaviour of the process and I wait until some message appears in the log file. I want to read only the portion of the file starting from the previous file size.
I am new to Robot Framework. I think this is a very common scenario, but I haven't found how to do it.
There are no built-in keywords to do this, but writing one in python is pretty simple.
For example, create a file named "readmore.py" with the following:
from robot.libraries.BuiltIn import BuiltIn
class readmore(object):
ROBOT_LIBRARY_SCOPE = "TEST SUITE"
def __init__(self):
self.fp = {}
def read_more(self, path):
# if we don't already know about this file,
# set the file pointer to zero
if path not in self.fp:
BuiltIn().log("setting fp to zero", "DEBUG")
self.fp[path] = 0
# open the file, move the pointer to the stored
# position, read the file, and reset the pointer
with open(path) as f:
BuiltIn().log("seeking to %s" % self.fp[path], "DEBUG")
f.seek(self.fp[path])
data = f.read()
self.fp[path] = f.tell()
BuiltIn().log("resetting fp to %s" % self.fp[path], "DEBUG")
return data
You can then use it like this:
*** Settings ***
| Library | readmore.py
| Library | OperatingSystem
*** test cases ***
| Example of "tail-like" reading of a file
| | # read the current contents of the file
| | ${original}= | read more | /tmp/junk.txt
| | # do something to add more data to the file
| | Append to file | /tmp/junk.txt | this is new content\n
| | # read the new data
| | ${new}= | Read more | /tmp/junk.txt
| | Should be equal | ${new.strip()} | this is new content

Ghostscript, Duplex print same side twice when NumCopies is bigger than 1

I using the command :
-q -dBATCH -dNOPAUSE -dNODISPLAY -dPDFFitPage \
-c "mark /BitsPerPixel 1 \
/NoCancel true \
/NumCopies 2 \
/Duplex true \
/OutputFile (%printer%Ricoh c2051) \
/UserSettings << /DocumentName (Arquivo Teste) \
/MaxResolution 500 >> \
(mswinpr2)finddevice putdeviceprops setdevice" -f "C:\Test123.pdf"
The PDF file have 3 pages, when I set the NumCopies 2 for example, the result is 3 pages :
page 1 = the text of page one, in both sides;
page 2 = the text of page two, in both sides;
page 3 = the text of page three, in both sides;
But when I set just one copy the result is 2 pages:
page 1 : the text of page one and in the other side the text of page 2
page 2 : the text of page three and the other side blank .
like Duplex are supposed to be.
Anyone knows how this happened ?
Its a consequence of the way mswinpr2 works, it doesn't care about you setting /Duplex true because the device is not a duplexing device (your printer obviously is, but that's not the same thing). In fact the majority of the command line will have no effect on teh output.
When you set NumCopies, it prints each page 'NumCopies' times to the printer so if your printer is set to do duplexing then it prints the first copy of page 1 on the first side, then the second copy of page 1 on the second side (ie the back of page 1) then the first copy of page 2 on the third side (the front of page 2) and so on.
You cannot achieve multiple collated copies using the mswinpr2 device.
The command line you have set suggests that you have the PostScript option for your printer, you could instead use the ps2write device to convert the PDF to PostScript and send the PostScript to the printer, the latest version of Ghostscript allows for the injection of device-specific options into the output PostScript so you could easily add NumCopies and Duplex there, assuming your printer has sufficient memory to do duplexing and NumCopies at the same time.

I want to put two landscape A5 pages (.ps or .pdf) on one portrait A4 page (.ps or .pdf)

I created a document in A5 size and managed to reshuffle the pages of the produced .pdf output with psbook, so that the pages have the perfect order for a booklet.
There are lots of hints that the next step would work with psnup, but that's not true. I also tried a2ps and pstops with various options. The last thing I found was bookletimposer (Ubuntu), but it has failed as well.
It seems so easy, because no scaling and no rotating is involved. Just put one page # position 0,0 and the following on # 0,14,85cm (half the height of the A4 page).
input:
+----------+
| this is |
| page one |
+----------+
+----------+
| this is |
| page two |
+----------+
output:
+----------+
| this is |
| page one |
| |
| this is |
| page two |
+----------+
assuming you had a multipage pdf file, let's say consisting in 16 sequentially ordered pages:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
and you already reordered the pages to have this new sequence:
16,1,2,15,14,3,4,13,12,5,6,11,10,7,8,9
and these pages are oriented in landscape mode
you can use, on this last file,
Multivalent.jar (last free version with pdf tools included)
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355
with this syntax:
java -cp /path/to.../Multivalent.jar tool.pdf.Impose -dim 1x2 -paper A4 A5-already-huffled_for_imposing.pdf
since you have already the pages rightly shuffled for imposing:
16,1,2,15,14,3,4,13,12,5,6,11,10,7,8,9
Multivalent will pick
pages 16,1 ...and put on same page (front 1)
pages 2,15 ...and put on same page (back 1)
and so on... achieving the goal to create a perfectly imposed booklet

Extract area from pdf

I want to extract an area given by x-y coordinates from a pdf page. The extracted area may be stored as a page in a new pdf document. This needs to be done several times and so I would want the process to be scripted. Are there any tools / libraries that can help do this?
If iText (for Java) or iText(Sharp) (for .Net) are acceptable libraries for you, you can use them to import an existing page from some PDF as a template of which sections can be displayed in another PDF.
Have a look at the example TilingHero.java / TilingHero.cs from chapter 6 of iText in Action — 2nd Edition. The central code is:
PdfImportedPage page = writer.getImportedPage(reader, 1);
// adding the same page 16 times with a different offset
float x, y;
for (int i = 0; i < 16; i++) {
x = -pagesize.getWidth() * (i % 4);
y = pagesize.getHeight() * (i / 4 - 3);
content.addTemplate(page, 4, 0, 0, 4, x, y);
document.newPage();
}
As you see, the original page is imported once and different sections of it are displayed on different pages.
(iText and iTextSharp are available either for free --- subject to the AGPL --- or commercially)
You may use 'pdftoppm' to do this task:
pdftoppm -f <first page> -l <last page> -jpeg -x <start x> -y <start y> -W <width> -H <height> -jpeg <in file> > <out file>
For exaple, crop the area of the first PDF page from point (x,y) = (100,200), which is the upper left corner of your crop area, with a width of 50 and a height of 80 and save it to a JPEG file is done by using:
pdftoppm -f 1 -l 1 -jpeg -x 100 -y 200 -W 50 -H 80 'my.pdf' > 'crop.jpg'
If you get in trouble with your documents resolution, you can use the '-r' option of 'pdftoppm' (see the man page of 'pdftoppm' for more).
Certainly, you can easily convert the JPEG file into a PDF, if needed.
Using ghostscript, you can crop the pdf the following way:
gs -f original.pdf -o final.pdf -sDEVICE=pdfwrite \
-c "[/CropBox [x-left y-bottom x-right y-top] /PAGES pdfmark"
x-left, y-bottom, etc., coordinates may be substituted with the required coordinates. Note that for gs, coordinates (0, 0) are at the left-bottom of the page.
This can then be easily scripted.

Convert subset of postscript file to pdf documents

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben
Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).
Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.
Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps