Batch convert svg to pdf page size - pdf

I have a number of svg files created with inkscape that contain text in non-standard fonts. As far as I understand, in order to have them printed I need to convert the text to paths. It seems that if I just use
convert input.svg output.pdf
the text is automatically converted to paths. Is this correct?
However my problem is with the page size. The input svg have a page size of A5, landscape. However the converted pdf seem to be cut on the right and bottom of the image by about 5% of the image width/height.
Why is that? How do I fix it?

As long as you have Inkscape on your system, ImageMagick convert actually delegates the PDF export to Inkscape. You can use it directly on the command line as
inkscape -zA output.pdf input.svg
Quote from man:
Used fonts are subset and embedded.
There are some options to manipulate the export area. -C explicitely sets the page area, -D the drawing bounding box.
You could even preserve the SVG format by using
inkscape -Tl output.svg input.svg
which would convert text to path.
Lastely, since you have to batch-process multiple files, you should open a shell with
inkscape --shell
and process all files in one go. Otherwise, startup time of inkscape would be 1-3 seconds for every file. Something like:
ls -1 *.svg | awk -F. \
'{ print "-AC " $1 ".pdf" $0 }
END { print "quit" }' | \
inkscape --shell

Related

How to make a vector PDF searchable?

My workflow includes making figures in Inkscape, which are then converted to PDF and included into LaTeX documents. In these figures, I often have to include mathematical formulas. For that, I use TexText. For font consistency and simplicity, when I want to add some plain text to my figure, I also use TexText. When the resulting SVG is converted to PDF, the TexText-generated text is not searchable.
How can I make a PDF from the SVG such that it is searchable while remaining a vector PDF?
I know I could rasterize the figure and then use e.g. Tesseract to create a searchable PDF. But the resulting PDF will of course contain a rasterized version of my figure. I would like the figure itself to remain vector graphics.
I am guessing there has to be a way that would go something like this: indeed rasterize the PDF and use Tesseract to extract the text. But then take the output of Tesseract and somehow add it to the original vector PDF. Unfortunately, I don't know how to do this.
It turns out a question that is directly relevant to mine was answered on another StackExchange, here. The script that answers my actual question is svgToSearchablePDF.sh, and I post it below. It uses, as the key element, the script pdf-merge-text.sh from the accepted answer to that other question. For completeness, I will repost pdf-merge-text.sh in this answer.
The solution
Note that perhaps you will need to magnify the SVG file before converting it to a searchable PDF: larger image sizes help the OCR process. To magnify, in Inkscape, select the entire image, then go to Object -> Transform… . In the Transform tab, select Scale. Then select 'Scale proportionally', and finally, in either 'Width' or 'Height', enter something like '300' (make sure % is selected in the menu immediately to the right of the 'Width' field). Next, File->Document Properties…. In the 'Document Properties' tab, under Custom size, click on Resize page to content. Make sure that either nothing is selected or else that the entire image is selected. Then click the button 'Resize page to drawing or selection'. Save the SVG.
The script svgToSearchablePDF.sh uses an SVG file as input and produces a searchable vector PDF file as output. It is assumed that all of the following are installed: Tesseract, Inkscape, and Ghostscript.
For example, assume that we used Inkscape to create the file mygraphics.svg. Then the following command will produce a searchable PDF file mygraphics.pdf:
svgToSearchablePDF.sh mygraphics.svg
The scripts
First, svgToSearchablePDF.sh:
#!/bin/bash
filename="$1"
inkscape ${filename%.*}.svg -o ${filename%.*}_auxfile.png
inkscape ${filename%.*}.svg -o ${filename%.*}_auxfileVCT.pdf
tesseract ${filename%.*}_auxfile.png ${filename%.*}_auxfileTXT -l eng pdf
pdf-merge-text.sh ${filename%.*}_auxfileTXT.pdf ${filename%.*}_auxfileVCT.pdf ${filename%.*}.pdf
rm -f ${filename%.*}_auxfile.png ${filename%.*}_auxfileVCT.pdf ${filename%.*}_auxfileTXT.pdf
As I said, that script uses the script pdf-merge-text.sh from here. For completeness, here it is:
#!/usr/bin/env bash
set -eu
pdf_merge_text() {
local txtpdf; txtpdf="$1"
local imgpdf; imgpdf="$2"
local outpdf; outpdf="${3--}"
if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
(
local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
)
}
pdf_merge_text "$#"

Reverse white and black colors in a PDF

Given a black and white PDF, how do I reverse the colors such that background is black and everything else is white?
Adobe Reader does it (Preferences -> Accessibility) for viewing purposes only in the program. But does not change the document inherently such that the colors are reversed also in other PDF readers.
How to reverse colors permanently?
You can run the following Ghostscript command:
gs -o inverted.pdf \
-sDEVICE=pdfwrite \
-c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" \
-f input.pdf
Acrobat will show the colors inverted.
The four identical parts {1 exch sub} are meant for CMYK color spaces and are applied to C(yan), M(agenta), Y(ellow) and (blac)K color channels in the order of appearance.
You may use only three of them -- then it is meant for RGB color spaces and is applied to R(ed), G(reen) and B(lue).
Of course you can "invent" you own transfer functions too, instead of the simple 1 exch sub one: for example {0.5 mul} will just use 50% of the original color values for each color channel.
Note: Above command will show ALL colors inverted, not just black+white!
Caveats:
Some PDF viewers won't display the inverted colors, notably Preview.app on Mac OS X, Evince, MuPDF and PDF.js (Firefox PDF Viewer) won't. But Chrome's native PDF viewer PDFium will do it, as well as Ghostscript and Adobe Reader.
It will not work with all PDFs (or for all pages of the PDF), because it is also dependent on how exactly the document's colors are defined.
Update
Command above updated with added -f parameter (required) before the input.pdf. Sorry for not noticing this flaw in my command line before. I got aware of it again only because some good soul gave it its first upvote today...
Additional update: The most recent versions of Ghostscript do not require the added -f parameter any more. Verified with v9.26 (may also be true even with v9.25 or earlier versions).
Best method would be to use "pdf2ps - Ghostscript PDF to PostScript translator", which convert the PDF to PS file.
Once PS file is created, open it with any text editor & add {1 exch sub} settransfer before first line.
Now "re-convert" the PS file back to PDF with same software used above.
If you have the Adobe PDF printer installed, you go to Print -> Adobe PDF -> Advanced... -> Output area and select the "Invert" checkbox. Your printed PDF file will then be inverted permanently.
None of the previously posted solutions worked for me so I wrote this simple bash script. It depends on pdftk and awk. Just copy the code into a file and make it executable. Then run it like:
$ /path/to/this_script.sh /path/to/mypdf.pdf
The script:
#!/bin/bash
pdftk "$1" output - uncompress | \
awk '
/^1 1 1 / {
sub(/1 1 1 /,"0 0 0 ",$0);
print;
next;
}
/^0 0 0 / {
sub(/0 0 0 /,"1 1 1 ",$0);
print;
next;
}
{ print }' | \
pdftk - output "${1/%.pdf/_inverted.pdf}" compress
This script works for me but your mileage may vary. In particular sometimes the colors are listed in the form 1.000 1.000 1.000 instead of 1 1 1. The script can easily be modified as needed. If desired, additional color conversions could be added as well.
For me, the pdf2ps -> edit -> ps2pdf solution did not work. The intermediate .ps file is inverted correctly, but the final .pdf is the same as the original. The final .pdf in the suggested gs solution was also the same as the original.
Cross Platform try MuPDF
Mutool draw -I -o out.pdf in.pdf [range of pages]
It should permanently change colours in many viewers
Later Edit
A sample file that did not reverse was one with linework only (no image) and the method needed was to save the graphics as inverted image then reuse that to build a replacement PDF, however beware converting the whole pages to image will make any searchable text just simply unsearchable pixels thus would need to be run with the OCR active on rebuild.
The two commands needed will be something like (%4d means numbers for images start output0001)
mutool draw -o output%4d.png -I input.pdf
For Linux users the folowing second pass should work easily:-
mutool convert -O compress -o output.pdf output*.png
For windows users you will for now (v1.19) need to combine by scripting or use groups
mutool convert -O compress -o output.pdf output0001.png output0002.png output0003.png
next version may include an #filelist option see https://bugs.ghostscript.com/show_bug.cgi?id=703163
This is probably just a frontend for the ghostscript command Kurt Pfeifle posted, but you could also use imagemagick with something like:
convert -density 300 -colorspace RGB -channel RGB -negate input.pdf output.pdf

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?
I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.
I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.
If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).
Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

JPG to PDF Conversion, How to Fit Full Page

I have page scans of various sizes in JPG format which I convert to a single PDF using ImageMagick. However I noticed every PDF page for each type of scan produces a different size PDF page, even if I use -page A4 option on ImageMagick. I would every JPG, in whatever size to "fill" each PDF page, and every PDF page to be the same. I also have access to tools like pdftk, pdfjam.
Any ideas?
As a hack you can use pdflatex and the wallpaper package. It does the trick and has the advantage over most other methods of not altering the image content (resolution, compression, pixel content) and adds only about 1.2kB of overhead.
To keep the aspect ratio, use:
filename=test.jpg;
echo "\documentclass[a4paper]{article}\
\usepackage{wallpaper}\usepackage{grffile}\
\begin{document}\
\thispagestyle{empty}\
\ThisCenterWallPaper{1}{$filename}~\
\end{document}"\
| pdflatex --jobname "$filename";
rm "$filename".aux "$filename".log
To fill the page completely, use:
filename=test.jpg;
echo "\documentclass[a4paper]{article}\
\usepackage{wallpaper}\usepackage{grffile}\
\begin{document}\
\thispagestyle{empty}\
\ThisTileWallPaper{\paperwidth}{\paperheight}{$filename}~\
\end{document}"\
| pdflatex --jobname "$filename";
rm "$filename".aux "$filename".log
Finally you can concatenate your pages using pdftk
pdftk page1.pdf ... page2.pdf cat out final_document.pdf
When I used -density 50% on ImageMagick convert it managed to zoom lower res images to bigger PDF pages.
This should do the trick, assuming your PDF output should have A4 sized pages (portrait):
convert -scale 595x842\! *.jpg output.pdf

PDF text extraction from given coordinates

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.
Can anyone help me out?
Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.
First: Ghostscript's txtwrite output device (not so good)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use
-sOutputFile=textfilename.txt
gs Update:
Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.
Second: Ghostscript's ps2ascii.ps PostScript utility (better)
This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.
If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.
Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....
Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
pdftotext Update:
Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:
-x <int> : top left corner's x-coordinate of crop area
-y <int> : top left corner's y-coordinate of crop area
-W <int> : crop area's width in pixels (defaults to 0)
-H <int> : crop area's height in pixels (defaults to 0)
Best, if used with the -layout parameter.
Fourth: MuPDF's mutool draw command can also extract text
The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:
mutool draw -F txt the.pdf
will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.
Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)
TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...
And there are even more options:
podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf
I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.
Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.
Debenu Quick PDF Library can extract text from a defined area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can also specify the width and height of the area.
Left = The horizontal coordinate of the left edge of the area
Top = The vertical coordinate of the top edge of the area
Width = The width of the area
Height = The height of the area
Then the GetPageText function can be called immediately after this to extract the text from that defined area.
Here's an example using C# (though the library is multi-platform and can be used with many different programming languages):
DPL.LoadFromFile(#"Sample.pdf", "");
DPL.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
DPL.SetTextExtractionArea(35, 35, 229, 30); // Left, Top, Width, Height
string ExtractedContent = DPL.GetPageText(8);
Console.WriteLine(ExtractedContent);
Using GetPageText it is also possible to return just the text located in that area or the text located in that area as well as information about the text's font such as name, color and size.