We've just moved hosting to a new server and hosting company. After the moved we've noticed there's a function we've missed during testing. The function creates PDFs with pdftk:
exec('pdftk mypdf.pdf fill_form mydata.fdf output myoutput.pdf flatten', $output);
Problem is, there's no pdftk on the new server and we also couldn't get it. Only three PDF related tools are available.
PDF-Tools and HTMLDoc doesn't work due to a lack of functions.
Is there a way to replace the command line we have at the moment we ghostscript or is there maybe a PHP Library i haven't found yet?
I am happily converting docx files to PDF via the command line (controlled via C# process calls) out of my service.
Unfortunately I could not find any internet search results on how to set the options for the output PDF that the GUI offers me. I am specifically looking for generating PDF/A and tagged PDF via the command line.
Anyone ever done this and knows how to do that?
Obviously getting a PDF/A can be done by using unoconv instead.
On windows one would use the following command line in a checked out unoconv repository:
python.exe .\unoconv -f pdf -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
I did not find further information on how to select other things (tagged PDF etc.) and where to get a complete list of the options that are available.
EDIT: It seems as one could try the different options in the GUI. The settings get saved to C:\Users\<userName>\AppData\Roaming\LibreOffice\4\user\registrymodifications.xcu. Then one can look up the changed setting and provide that to unoconv as this:
python.exe .\unoconv -f pdf -eUseTaggedPDF=1 -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
Still not sure if I am doing this correctly though.
The gotenberg project shows how that can be done using unocov.
$ curl --request POST 'http://localhost:3000/forms/libreoffice/convert' --form 'files=#"doc.docx"' --form 'nativePdfFormat="PDF/A-1a"' -o pdfA.pdf
Example PDF
The following question has to do with installation of the MaTeX package to Mathematica, and the difficulties I encounter in making it compatible with Inkscape’s Textext (LaTeX addon).
I first summarize my problem in Long story short (I have the detailed series of events in Long story). I then present my questions in Questions and supply some additional information regarding versions of the various programs in Supplementary Information.
Long story short
I am having issues with using both Textext add-on in Inkscape and MaTeX package in Wolfram's Mathematica. I have tried uninstalling and re-installing all related Inkscape programs but nothing seems to change.
Long story
I am using Inkscape to produce figures with LaTeX code (using textext according to this guide https://people.orie.cornell.edu/jmd388/design/guides/textext.pdf). I have previously installed Textext, and Inkscape was working well – allowing me to include LaTeX text in my figures.
I am also using Wolfram Mathematica. To include LaTeX text in Mathematica I needed to install the MaTeX package (from here https://github.com/szhorvat/MaTeX). However, once I did this, Textext stopped working.
I have uninstalled and reinstalled all inkscape’s related programs – pstoedit, ghostscript, GSview, ImageMagick, Textext and Inkscape itself – but still MaTeX wouldn’t work. Textext seems to be working now, but MaTeX does not.
The error Mathematica gives when running the MaTeX package is the following
MaTeX::gserr: Error while running Ghostscript.
After examinning this issue, I have realized that the problem might originate from the Ghostscript version. I have ran the following line in the command:
gswin64c.exe -o mt-gs.pdf -dNoOutputFonts -sDEVICE=pdfwrite mt.pdf
and the outcome I obtain is
**** Could not open temporary file ''
****Unable to open the initial device, quitting.
But when I only put
gswin64c.exe -o mt-gs.pdf -dNoOutputFonts mt.pdf
Ghostscript seems to operate (that is, a pdf window pops-up and immediately closes).
Additionally, when I try to run GS on a different pdf file, I get the following error
Could not open the scratch file encoded_file_ptr_0.
+ c:\users\cjl\artifex\gs-release'9.21\ghostpdl-9.21\base\gdevp14.c:6044: gs_pdf14_devide_push<>: Fatal
GPL Ghostscript 9.21: Unrecoverable error, exit code 255
where the same file works on a different computer's GS (so the file should be OK).
Overall, I cannot use MaTeX at the moment since I get this error, which forces me to produce figures in Mathematica and move them into Inkscape to include axis labels and other notations (such that the fonts are consistent).
What is wrong with my Ghostscript? How can I fix it?
Has anyone encountered such difficulties before (making Textext and MaTex packages work at the same time)?
Does anyone have an idea of how to fix MaTeX/Textext such that both would work?
Supplementary information
Here are the specs of my OS, as well as the versions of the different involved programs:
Windows 7 64-bit OS.
Mathematica verion for Windows 64-bit.
Inkscape version 0.48
Ghostscript version 9.21
pstoedit and importps version 3.7
ImageMagick version 7.0.7 - Q16
Textext version 0.4.4
MiKTeX 2.9 (updated today).
I would really appreciate any comments and ideas.
Thanks in advance
There is nothing wrong with your Ghostscript command, but the pdfwrite device requires the ability to write temporary files to the system temporary directory. (other devices such as the default display device do not always require the ability to write temporary files)
The empty filename is suspicious in the error, that should not be possible.
There is clearly some kind of problem because the file 'encoded_file_ptr.0' couldn't be created either apparently, and that's a valid filename.
I'd have to guess at some kind of permissions problem. I note that you are running a Windows Ghostscript, are you running this under some kind of Linux-alike ? I'd be suspicious that there is some kind of permissions or access problem on the temp partition if so.
Have you tried running Ghostscript from the WIndows command shell ?
I have much trouble to have a code to convert pdf file to png on python 3.6, windows 10.
I know what you are going to say : google it !
But barely everything I've found was on python 2.7. And some packages haven't been updated.
What I've seen so far it's that the best way to do it is using Wand, right ? (I have installed ImageMagick before )
from wand.image import Image
# Converting first page into JPG
with Image(filename='0.pdf') as img:
# Resizing this image
Here was my second error :
wand.exceptions.DelegateError: PDFDelegateFailed
`The system cannot find the file specified.' # error/pdf.c/ReadPDFImage/809
So i read i need ghostscript. I installed it. But the package is for python 2.7 and it doesn't work. I found python3-ghostscript 0.5.0. https://pypi.python.org/pypi/python3-ghostscript/0.5.0
New error :
RuntimeError: Can not find Ghostscript DLL in registry
So here I needed to install Ghostscript 9 :
First of all it's not a GPL license ... That's not even a package but a program. I don't know how I can use it in my futures python codes...
and there is still an error :
RuntimeError: Can not find Ghostscript DLL in registry
and i can't find anything for it.
Ghostscript is licensed under the AGPL, the licence can be found in /Program Files (x86)/gs/gs9.21/doc if you want sources then they are available from the Ghostscript Git repository. Note I'm assuming you are running on Windows since you refer to the Registry.
If you install the prebuilt binary then it will create an entry in the Windows Registry, I assume that's what your Python code is looking for but I can't be sure. You should make sure you install the correct word size (32 or 64) version required by Python, if it cares.
You can, of course, simply run Ghostscript to render a PDF file and produce PNG output.
gswin32c -sDEVICE=png16m -sOutputFile=out%d.png input.pdf
This will create one file per page of the input PDF file, use gswin64c for the 64-bit version...
You can alter the resolution of the output with the -r switch, eg -r300
I presume you can simply fork a process from Python. Otherwise you'll have to get someone to tell you what the Python script is looking for in the Registry. Perhaps its looking for a specific version of Ghostscript, or the 32-bit version or something.
I can create PDF file with wkhtmltopdf in linux Ubuntu: "wkhtmltopdf www.stackoverflow.com file.pdf"
How to take webpage screenshots?
You can use wkhtmltoimage as suggested by Mathew. You can go to the first URL and download the file suitable for your architecture (or its source code). Usage is simple:
./wkhtmltoimage-i386 http://www.google.com google.png
wkhtmltoimage - this also uses the webkit render engine for excellent results and is available as standalone binary with no install problems anticipated, just trial and error to get the results you need.
If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?
This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.
So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:
scan through all subdirectories recursively for PDF files;
check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.
I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well.
I spent countless hours manually checking which files already had a text-layer "under" the image.
UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem.
There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.
So, here it is, step by step:
Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
Install the very small bash script (below) to your home directory ~.
Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.
Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:
type cd ~. This will bring you to your home folder.
type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
echo "Not yet OCR'ed: $1 -------- Processing...."
echo " "
ocrmypdf -l eng+deu+nld -s "$1" "$1"
echo " "
echo "Already OCR'ed: $1"
echo " "
What does this do?
Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing.
I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/
Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....
I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.
I hope someone finds this and finds this useful. If anyone has improvements, please post them here.
Jos Jonkeren
Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.
If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.
Unburying this thread.
You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.
As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.
You may find it at https://github.com/deajan/pmOCR
pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).