Converting docx to PDF/A with libre office writer - pdf

I am happily converting docx files to PDF via the command line (controlled via C# process calls) out of my service.
Unfortunately I could not find any internet search results on how to set the options for the output PDF that the GUI offers me. I am specifically looking for generating PDF/A and tagged PDF via the command line.
Anyone ever done this and knows how to do that?
EDIT:
Obviously getting a PDF/A can be done by using unoconv instead.
On windows one would use the following command line in a checked out unoconv repository:
python.exe .\unoconv -f pdf -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
I did not find further information on how to select other things (tagged PDF etc.) and where to get a complete list of the options that are available.
EDIT: It seems as one could try the different options in the GUI. The settings get saved to C:\Users\<userName>\AppData\Roaming\LibreOffice\4\user\registrymodifications.xcu. Then one can look up the changed setting and provide that to unoconv as this:
python.exe .\unoconv -f pdf -eUseTaggedPDF=1 -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
Still not sure if I am doing this correctly though.

The gotenberg project shows how that can be done using unocov.
$ curl --request POST 'http://localhost:3000/forms/libreoffice/convert' --form 'files=#"doc.docx"' --form 'nativePdfFormat="PDF/A-1a"' -o pdfA.pdf
Example PDF

Related

Replacement PDFTK with ghostscript

We've just moved hosting to a new server and hosting company. After the moved we've noticed there's a function we've missed during testing. The function creates PDFs with pdftk:
exec('pdftk mypdf.pdf fill_form mydata.fdf output myoutput.pdf flatten', $output);
Problem is, there's no pdftk on the new server and we also couldn't get it. Only three PDF related tools are available.
PDF-Tools
HTMLDoc
Ghostscript
PDF-Tools and HTMLDoc doesn't work due to a lack of functions.
Is there a way to replace the command line we have at the moment we ghostscript or is there maybe a PHP Library i haven't found yet?

Running tesseract 4.1 with openjpeg2 - cannot produce pdf output

I have installed on my RedHat machine:
(py36_maw) [rvp#lib-archcoll box]$ tesseract -v
tesseract 4.1.0
leptonica-1.78.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
Found SSE
I try to run, per what docs I can find, to produce pdf output:
(py36_maw) [rvp#lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275
That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.
However, it tries to read a file called PDF, but I cannot figure how to get PDF output.
I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.
How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?
Robert
After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:
Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
Go to your dir with the test.jp2 file and create file config with these lines.
tessedit_create_pdf 1 Write .pdf output file
tessedit_create txt 1 Write .txt output file
(Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)
Run in that dir:
$ tesseract test.jp2 outputbase -l eng config
Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.
Hope this helps someone else!

Specify default substitution font when converting pdf to image using imagemagick and font is missing

I am using Spatie/pdfToImage that builds on ghost script and imagemagick to on my server:
Take a multiple page pdf from an email using mailgun routing.
Save the pdf in folder /docs_pdf like file.pdf
Use a foreach to loop through each page and save each page as a png to /docs like file_#.png
locally where I use laravel -> valet everything works fine.
On my server using digital ocean through laravel forge the language in a multipaged pdf that is in swedish transforms from normal swedish to a bunch of random letters and signs.
The left is correct (yes, its true. Its Swedish) and the right is wrong:
Someone suggested to me that this is probably a matter of the font missing on the server. The fonts used in the pdf:
<</StemV 68/FontName/PSQHMO+FoundrySans-Normal/FontFile2 216 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -240/FontBBox[-40 -240 960 916]/Ascent 916/FontFamily(FoundrySans-Normal)/CapHeight 667/XHeight 465/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 100/FontName/MLHPWU+FoundrySans-Medium/FontFile2 217 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -241/FontBBox[-42 -241 1008 916]/Ascent 916/FontFamily(FoundrySans-Medium)/CapHeight 667/XHeight 470/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 68/FontName/SUEECI+FoundrySans-Normal/FontFile2 218 0 R/FontStretch/Normal/FontWeight 400/Flags 4/Descent -240/FontBBox[-40 -240 960 916]/Ascent 916/FontFamily(FoundrySans-Normal)/CapHeight 667/XHeight 465/Type/FontDescriptor/ItalicAngle 0>>
<</StemV 48/FontName/KIDDUY+FoundrySans-Light/FontFile2 9 0 R/FontStretch/Normal/FontWeight 400/Flags 32/Descent -248/FontBBox[-28 -248 978 924]/Ascent 924/FontFamily(FoundrySans-Light)/CapHeight 667/XHeight 458/Type/FontDescriptor/ItalicAngle 0>>
Here is configuration of fonts in imagemagick and ghostscript:
https://www.imagemagick.org/script/resources.php
how can this be solved?
Update:
I have now made a clean install on a new server.
Installed Imagick and spatie/pdfToImage
As suggested by KenS I ran
gs -sDEVICE=png16m -o out%d.png
terminal output
forge#Server:~/app/storage/app/public/files$ gs -sDEVICE=png16m -o test_out%d.png file.pdf
GPL Ghostscript 9.22 (2017-10-04)
Copyright (C) 2017 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2
the document rendered the same = wrong.
I am at a complete loss.. Don't know what next step might be..
Update2:
I also run the convert imagemagick commando and the img rendered the same way also.
So even if I do it with ghostscript solo, imagemagick or spatie/pdfToImage it gives me the same output
Well, the current version of Ghostscript (9.25) renders this acceptably for me; that is the text appears to be correct. All the fonts are embedded, so there shouldn't be any problems.
And this means that even if you did replace the default font substitution, it wouldn't help, because Ghostscript shouldn't be using the default font, it will be using the fonts embedded in the PDF file.
Without knowing what version of Ghostscript you are using (I see from a later comment that its 9.25), or the command line that is used to start it, I can't really do a like-for-like comparison. Its hard for me to see how you could be getting such a different result though. That looks like Ghostscript has failed to find the embedded fonts.
Its possible that whatever package you are using has done something 'unfortunate'. The various package maintainers on Linux add their own patches, and sometimes modify the way that Ghostscript is built. Possibly that has broken something.
If you are able to build Ghostscript yourself you could try cloning our Git repository and doing that. You could also try downloading the Linux binaries off our website. They won't work with every Linux distribution (different ABI) but you can try, you might be lucky.
You could also try running Ghostscript directly on the PDF file. Something like:
gs -sDEVICE=png16m -o out%d.png
should produce 2 PNG files, out1.png and out2.png. It will also produce a bunch of stuff on the terminal. That back channel output is valuable information for me so if you can reproduce the problem, I'd like to see that too.
One last thought; its possible to have more than one version of Ghostscript installed simultaneously, its possible that your current setup is using an old version of Ghostscript.
I can't help you with ImageMagick or Spatie, but if you can debug those to the point where you can reproduce the problem with a plain Ghostscript command line then I can look further at it.
Finally got it to work. I want to first give kudos to KenS that really helped me, and without him it would not have worked.
This is what I did:
1 - I removed Ghostscript:
sudo apt-get purge --auto-remove ghostscript
then
wget https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs925/ghostscript-9.25.tar.gz
tar xvf ghostscript-9.25.tar.gz
Enter the unpacked folder and do
./configure
make
make install
then
sudo ln -s /usr/local/bin/gs /usr/bin/gs
On top of the above I did:
sudo add-apt-repository ppa:glasen/freetype2
and then:
sudo apt update && sudo apt install freetype2-demos

wget downloading only PDFs from website

I am trying to download all PDFs from http://www.fayette-pva.com/.
I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf file extension. I saw and used another forum answer similar to this but the .pdf extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.
Here is the code I have been testing with:
wget --no-directories -e robots=off -A.pdf -r -l1 \
http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/
I am using this on a single page of which I know that it has a PDF on it.
The complete code should be something like
wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/
Related answer: WGET problem downloading pdfs from website
I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?
Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.
The solution is to use the --content-disposition option, which tells wget to honor the Content-Disposition field in the HTTP response, which carries the actual filename:
HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close
This option is supported in wget at least since version 1.11.4, which is already 7 years old.
So you would do the following:
wget --no-directories --content-disposition -e robots=off -A.pdf -r \
http://www.fayette-pva.com/

Batch OCRing PDFs that haven't already been OCR'd

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?
This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.
So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:
scan through all subdirectories recursively for PDF files;
check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.
I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well.
I spent countless hours manually checking which files already had a text-layer "under" the image.
UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem.
There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.
So, here it is, step by step:
Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
Install the very small bash script (below) to your home directory ~.
Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
Done!
Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.
Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:
type cd ~. This will bring you to your home folder.
type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
echo "Not yet OCR'ed: $1 -------- Processing...."
echo " "
ocrmypdf -l eng+deu+nld -s "$1" "$1"
echo " "
else
echo "Already OCR'ed: $1"
echo " "
fi
What does this do?
Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing.
I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/
Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....
I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.
I hope someone finds this and finds this useful. If anyone has improvements, please post them here.
Thanks.
Jos Jonkeren
Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.
If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.
Unburying this thread.
You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.
As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.
You may find it at https://github.com/deajan/pmOCR
pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).