convert pdf to svg - pdf

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.

Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.

Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/

pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf

pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done

I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.

inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).

Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg

Related

How to make "pandoc" use a specific font when converting a plaintext file to PDF?

After a lot of trouble, I was finally able to run the command without errors:
pandoc -i 1.txt -o 1.pdf
The result is a PDF with completely messed up text because it uses some other font than Courier[ New]. Some varying-width, default font.
After reading and searching for a long time, I found this: https://pandoc.org/MANUAL.html#creating-a-pdf
The option "fontfamily" is mentioned, so I tried to do:
pandoc -i 1.txt -o 1.pdf --fontfamily=Courier
However, this results in:
Unknown option --fontfamily.
Try pandoc --help for more information.
I have looked through the entire "pandoc --help" output without finding any mention of fonts.
How do I set the font to be used?
(I'm trying my very best to not also add: "and why is it so incredibly difficult/cryptic/undocumented to do the most basic imaginable thing?"...)
I'm not even sure that this will fix all the problems. I just assume that the document is all messed up because the font isn't using fixed-width letters.

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?
I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.
I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.
If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).
Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?
If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.
You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.
Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps
I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps
https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.
If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.
There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

How to create large PDF files (10MB, 50MB, 100MB, 200MB, 500MB, 1GB, etc.) for testing purposes?

I tried this:
for ((i=1; i<=10; i++)); do convert 100MB.pdf 10MB.pdf 100MB.pdf; done
to create 100MB file but very quickly run out of RAM.
The most simple tool: use pdftk (or pdftk.exe, if you are on Windows):
pdftk 10_MB.pdf 100_MB.pdf cat output 110_MB.pdf
This will be a valid PDF. Download pdftk here.
Update: if you want really large (and valid!), non-optimized PDFs, use this command:
pdftk 100MB.pdf 100MB.pdf 100MB.pdf 100MB.pdf 100MB.pdf cat output 500_MB.pdf
or even (if you are on Linux, Unix or Mac OS X):
pdftk $(for i in $(seq 1 100); do echo -n "100MB.pdf "; done) cat output 10_GB.pdf
Windows: fsutil
Usage:
fsutil file createnew [filename].[extension] [# of bytes]
Source: https://www.windows-commandline.com/how-to-create-large-dummy-file/
Linux: fallocate
Usage:
fallocate -l 10G [filename].[extension]
Source: Quickly create a large file on a Linux system?
For those using macOS mkfile might be a good alternative to fallocate or dd
mkfile 100m some100mfile.pdf
reference -
https://stackoverflow.com/a/33478049/711401
according to http://www.maketecheasier.com/combine-multiple-pdf-files-with-pdftk/ the command should be
pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
note that you should download windows version of pdftk
I had problems using pdftk with the cat parameter had a better success with output.
The following command worked for me:
pdftk file_1.pdf file_1.pdf file_1.pdf file_1.pdf cat output.pdf
Using cat produced the following error:
Error: Unexpected text in page range end, here:
output.pdf
Exiting.
Acceptable keywords, for example: "even" or "odd".
To rotate pages, use: "north" "south" "east"
"west" "left" "right" or "down"
Errors encountered. No output created.
Done. Input errors, so no output created.
http://www.pdflabs.com/docs/pdftk-cli-examples/.
I created a 172mb PDF is no time at all.
If you want a really big valid PDF file, then
take all the biggest valid pdf you can
With a tool like PDF24Creator make a fusion of pdfs
It works for me to create a big file (140MB) after some minutes.
Under Linux there is pdfunite (part of poppler) that can concatenate the same pdf files to get one large pdf file:
pdfunite in.pdf in.pdf in.pdf out.pdf
see manpage
Partly it depends on what you are trying to increase the size of... number of pages, number of images, size of a single image. In my experience, the vast bulk (90%+) of any given 'large' PDF file will be the images.
You could try using a pro product like Adobe InDesign to quickly build a large project and export it as a PDF.
Adobe Acrobat Pro has built-in tools to optimize PDF files -- you try using the tools to 'un-optimize' your file. :)
One possibility is, if you are familiar with PDF format:
Create some simply PDF with one page (Page should be contained within one object)
Copy object multiply times
Add references to the copied objects to the page catalog
Fix xref table
You get an valid document of any size, entire file will be processed by a reader.
Have you tried using cat to combine the files?
cat 10MB.pdf 10MB.pdf > 20MB.pdf
That should result in a 20MB file.

Does grep work properly on pdf files?

Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?
Well, PDF is a binary format, and grep can search binary files as if they were text
grep -a
or you can just use pdftotext (which comes with xpdf) like this:
pdftotext whee.pdf | grep pattern
You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line:
mdfind -onlyin search/directory/path "kind:pdf search text"
use something like Solr or clucene I think they can do what you want.
Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this:
ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword"
Or you can use the pdf2text command on pdf's and then search result with grep.
This tool pdfgrep will do the work. It has a syntax similar to grep. To search in several files just a simple shell script. For example:
$> ls Documents/*.pdf | xargs pdfgrep -n -H "system"
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: designed episodic memory system
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: how ISAC's episodic memory system is
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: cognitive system employs a combination
....
PDF is a binary dump of objects used to display the pages. There may be some meta data you can grep but the actual page text is in a Postscript stream and may be encoded in a variety of ways. Its also not guaranteed to be in any order. You need to think of PDF as more like a Vector image file than a text file.
There is a short article explaining text in PDFs in more detail at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
If you have pdftotext installed via the popplar package, then try this perl script :
#!/usr/bin/perl
my $p = shift;
foreach my $fn (#ARGV) {
open(F,"pdftotext $fn - |");
while (<F>) { print "$fn:$_" if /$p/; }
close(F);
}