How to wget a website and then export all the pages into a single pdf? - pdf

IS there any way i can do that?

I did this once. It was almost 10 years ago, however, so I don't remember the details.
I used:
wget to download the pages
html2ps to convert the individual pages to PostScript
ps2ps to splice the individual PostScript files together
ps2ps again, to put 4 pages on 1
Then I sent the PostScript file to the printer. Since you want PDF you could add an additional step of ps2pdf.

Yes, you could write an AppleScript to automate Safari to do the printing (and the fetching, if you'd like, though you can drive wget from AppleScript using do shell script). If you don't fancy using AppleScript, you could use Automator instead; the result would be pretty much the same.
BTW if you aren't using Mac OS X, it wasn't clear from the question you asked :-)

Related

Replace words/phrases in existing PDF or docx with other words

I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.

Concatenate PDFs without ruining accessibility or PDF tags

Would appreciate your help with the following:
I have 2 partially accessible PDFs (containing tags), and I want to concatenate them using some command line tool (as PDFtk or Ghostscript, or any Perl module):
I've tried doing this with PDFtk and Ghostscript and both output a non accessible PDF without the original tags (each of the concatenated PDFs had tags).
Do you know of any way to implement this with one of the mentioned tools or some other command line tool for Linux?
(Not necessarily freeware)
Perl modules are also an option.
Thanks!
pdfunite in-1.pdf in-2.pdf in-n.pdf out.pdf
You can read more in a similar question
Solved - new version of iText works (the former, which was the newest when writing the message didn't work- only since 5.4.4 it works).
It's important to mention (was missing in documentation in the past) that
when concatenating documents in tagged mode, you must keep all readers
opened until the resultant document is closed, i.e.:
first:
document.close();
and only after this:
reader.close();

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

Can we extract pdf pages using lua scripts

Our application is receiving PDF file based on 150 pages from business line, I want to extract pages from this pdf file using lua scripts.
Any body share his experience.
Thanks
Sure, you can do this. As long as you write a Lua module that can read PDF files.
There are some Lua modules for writing PDFs, but none for reading them. No public ones, at any rate. You may want to switch to Python for this, as there are quite a few Python modules for dealing with PDFs.
You could write a Lua wrapper calling something like pdftk.

Gluing (Imposition) PDF documents

I have several A4 PDF documents which I would like (two into one) "glue" together into A3 format PDF document. So I will get from 2PDFs A4 a single one sided PDF A3.
I have found the excellent utility PDFToolkit and some others but none of them can be used to "glue" side by side two documents.
I just came across a nice tool on superuser.com called PDFjam that can do all of the above in a single command:
pdfjam --nup 2x1 file1.pdf file2.pdf --outfile DONESKI.pdf
It has other standard features like page size plus a nice syntax for more sophisticated collations of pages (the tricky page re-ordering necessary for true booklet-style page imposition).
It's built on top of TeX which is, whatever it is. Installing is a breeze on Ubuntu: you can just apt-get install pdfjam. On Mac OS, I recommend getting BasicTeX (google "mactex basictex"; SO thinks I'm a spammer and won't let me post the link).
This is a lot easier and more maintanable than installing both pdftk and Multivalent (on both Mac OS for dev and Ubuntu for deploy), which wasn't going so well for me anyway...!
Found the following (free and open-source) tool for doing Imposition called Impose (thanks danio for the tip). This solved my problem perfectly.
EDIT:
Here is how it's done:
Use PDF Toolkit to joint two PDF files into one (two A4)
pdftk File1.pdf File2.pdf cat output OutputFile.pdf
Create from this a single page (one A3):
java -cp Multivalent.jar tool.pdf.Impose -dim 2x1 -verbose -paper-size "42.2x29.9cm" -layout "1,2" OutputFile.pdf
I would like to advertise my pdftools
It's written in Python so should run on any platform. It's a wrapper to Latex (the pdfpages packages) but can do lot of things with a single command line: merge pdf files, nup them (multiple input pages per output page) and number the pages of the output file (you specify the location and the format of the number)
It still needs some work but I think it's quite stable to be usable right now :)
This puts two landscape letter pages onto a single portrait letter sheet, to be "bound" (i.e., folded) along the top.
pdftops $1 - |
psbook |
pstops -w11in -h8.5in '4:1#.65(.5in,0in)+0#.65(.5in,5.5in),2U#.65(8in,5.5in)+3#.65U(8in,11in)' |
ps2pdf - $(basename $1 .pdf).psbook.pdf
By the way, I do this often, so I'll probably submit more "answers" to this question just to keep track of successful pstops pagespecs. Let me know if this is an inappropriate use of SO.
A nice, powerful, open-source imposition tool is included
in the PoDoFo package:
http://podofo.sourceforge.net/
It works for me. Some imposition plans can be found at:
http://www.av8n.com/computer/prepress/
PoDoFo can do lots of other stuff, not just imposition.
Another useful imposition tool is Bookbinder (on the
quantumelephant site). It has a GUI that appeals to non-experts.
It is not as flexible or powerful as PoDoFo, but it can do
imposition.
pdftk is more-or-less essential to have, but it will not
do imposition.
pdfjam is useless to me, because there are a wide range of
valid pdf files that it cannot handle.
I've never been able to get multivalent to work, either.
What you want to do is imposition. There are commercial tools to impose PDFs such as ARTS crackerjack and Quite imposing but they are pretty expensive (US$500), require a copy of acrobat professional and are overkill for imposing 2 A4 pages to an A3 sheet.
On the Postscript side, a tool named pstops is able to rearrange pages of a Postscript file in any way you could imagine. I've not heard of such a tool for PDF. But pdf2ps and ps2pdf exist. So a not-so-ideal solution may be a combination of pdf2ps, pstops and ps2pdf.
I would combine the two A4 pages into one 2-page PDF using pdftk. Then Print to PDF using something like PrimoPDF, and tell it to print to A3 format, two pages per side.
I just tested this printing some slides from PowerPoint. It worked great. I selected A3 as my paper size in PowerPoint, and then chose to print 2 pages per side. Printed to Primo and voila, I have two A4 slides per A3.
You can put multiple input pages on one output page using BookletImposer.
And you can change page orders and combine multiple pdf files using PDF Mod.
With these two tools, you can do almost everything you want with pdf files (except editing their content).
I had a similar problem. I tried Impose but it was giving me an
Exception in thread "main" java.lang.NoClassDefFoundError: tool/pdf/Impose
Caused by: java.lang.ClassNotFoundException: tool.pdf.Impose
(...)
Could not find the main class: tool.pdf.Impose. Program will exit.
I then tried PDF Snake which isn't free or open source, but has a completely unrestricted 30-day trial version. It worked perfectly, after tweaking the parameters to achieve what I wanted. It's a great tool. I would definitely buy it if it wasn't so expensive! Anyway, I thought I'd leave my 2 cents in case anyone had the same problem I had with Impose.
look at this
http://sourceforge.net/projects/proposition/
It needs laTex to run,
but when it does, works really fine
Regards