My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.
My usual strategy was to use an old software called "Pdfedit" that could decode PDF-Files, all the byte-streams would then be human-readable, edit the relevant part of the PDF containing the glyph mappings, and open the file with Adobe Acrobat that automatically re-encoded it.
Now I have some PDFs that cause "Pdfedit" to crash upon opening. I tried using PDF-Parser but its output cannot be re-encoded by Adobe Acrobat.
Also, the relevant parts used to look like this decoded:
/CMapType 2 def
1 begincodespacerange
<00><04>
endcodespacerange
5 beginbfchar
<00><0000>
<01><0000>
<02><263A>
<03><0000>
<04><0000>
endbfchar
endcmap
But now I use the following command python3 pdf-parser.py -f -n /path/to/file.pdf > dump.txt and inside dump.txt the relevant part looks like this:
b'/CMapType 2 def\n1 begincodespacerange\n<00><04>\nendcodespacerange\n5 beginbfchar\n<00><0000>\n<01><0000>\n<02><263A>\n<03><0000>\n<04><0000>\nendbfchar\nendcmap\nCMapName currentdict/CMap defineresource pop end end'
So it is a bytestring and any linebreak is rendered literally as \n. The txt file that contains this cannot be interpreted as a PDF by Adobe Acrobat.
I have now also realized that many elements such as %%EOF are delimited by ''.
The true issue is how to get an Acrobat-readable output from pdf-parser.py, as the shell-command > does not work and stdout in the shell is also faulty.
I will try out a few things but could really need some help on this!
Answering my own question in case this is relevant for someone down the line.
Didier Stevens, the dev behind the pdf-parser, answered that his tool is not made for this. He recommended qpdf instead.
That was indeed the solution. Make sure you use the flag --stream-data=uncompress so that compressed parts are also accessible in the output. The command to use with qpdf is:
qpdf old_file.pdf --stream-data=uncompress --decode-level=all new_file.txt
You can output new_file also as .pdf. In any case you will be able to open it in the text editor. Once you're done applying the changes you wish to apply, you can change the ending to pdf and process it further with acrobat or any other conversion program.
I am using a shell script to modify many pdfs and would like to create a script that adds the page number (1 of X format) to the bottom of PDFs in a directory along with the text of the filename.
I tried using pdfjam with this format:
pdfjam --pagenumbering true
but it fails saying undefine pagenumbering
Any other recommendations how to do this? I am OK installing other tools but would like this to all be within a shell script.
Thank you
tl;dr: pdfjam --pagecommand '' input.pdf
By default, pdfjam adds the following LaTeX command to every page: \thispagestyle{empty}. By changing the command to an empty command, the default plain page style is used, which consists of a page number at the bottom. Of course you may want to play with other styles or layout options to position the page number differently.
I have a number of svg files created with inkscape that contain text in non-standard fonts. As far as I understand, in order to have them printed I need to convert the text to paths. It seems that if I just use
convert input.svg output.pdf
the text is automatically converted to paths. Is this correct?
However my problem is with the page size. The input svg have a page size of A5, landscape. However the converted pdf seem to be cut on the right and bottom of the image by about 5% of the image width/height.
Why is that? How do I fix it?
As long as you have Inkscape on your system, ImageMagick convert actually delegates the PDF export to Inkscape. You can use it directly on the command line as
inkscape -zA output.pdf input.svg
Quote from man:
Used fonts are subset and embedded.
There are some options to manipulate the export area. -C explicitely sets the page area, -D the drawing bounding box.
You could even preserve the SVG format by using
inkscape -Tl output.svg input.svg
which would convert text to path.
Lastely, since you have to batch-process multiple files, you should open a shell with
inkscape --shell
and process all files in one go. Otherwise, startup time of inkscape would be 1-3 seconds for every file. Something like:
ls -1 *.svg | awk -F. \
'{ print "-AC " $1 ".pdf" $0 }
END { print "quit" }' | \
inkscape --shell
I'm using TexStudio 2.8.4 to create a pdf containing knitr output and I'm running into issues with symbols showing up incorrectly either in the pdf or when copy and pasted from the pdf. Here's a minimal working example.
\documentclass{beamer}
\begin{document}
\begin{frame}[fragile]
<<>>=
#dollar$sign
if(2+2 == 4){print("math")}
#
\end{frame}
\end{document}
In my pdf output, the $ in the commented out font shows up as the pound (currency) sign, but when copy and pasted shows up correctly as a dollar sign. This does not occur when it is not commented out.
More problematically, while the braces {} appear correct in the pdf output, when copied and pasted they are f and g. This confusion does not affect R's interpretation of the braces, however.
Do you have any thoughts/suggestions for fixing this? As a work around, I'm just using a non-echoed knitr block and using a latex verbatim environment for the code on the front side, though this is not ideal.
The command I'm using in my custom build is:
"C:/Program Files/R/R-3.2.2/bin/Rscript.exe" -e "library(knitr); knit2pdf('%.Rnw')" | pdflatex -synctex=1 -interaction=nonstopmode %.tex | "C:/Program Files (x86)/Adobe/Reader 11.0/Reader/AcroRd32.exe" "?am.pdf"
Cheers!
This seems to be a problem with LaTeX encoding. The solution is adding \usepackage[T1]{fontenc} to your preamble as suggested here.
I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -
PDDocument document = PDDocument.load( pdfFile );
DOMImplementation domImpl =
GenericDOMImplementation.getDOMImplementation();
// Create an instance of org.w3c.dom.Document.
String svgNS = "http://www.w3.org/2000/svg";
Document svgDocument = domImpl.createDocument(svgNS, "svg", null);
SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);
ctx.setEmbeddedFontsOn(true);
// Ask the test to render into the SVG Graphics2D implementation.
for(int i = 0 ; i < document.getNumberOfPages() ; i++){
String svgFName = svgDir+"page"+i+".svg";
(new File(svgFName)).createNewFile();
// Create an instance of the SVG Generator.
SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);
Printable page = document.getPrintable(i);
page.print(svgGenerator, document.getPageFormat(i), i);
svgGenerator.stream(svgFName);
}
This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.
Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.
To use Inkscape's command-line interface to convert a PDF to an SVG, use:
inkscape -l out.svg in.pdf
Which you can then probably call using:
Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29
I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.
Take a look at pdf2svg (also on on github):
To use
pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]
When using all give a filename with %d in it (which will be replaced by the page number).
pdf2svg input.pdf output_page%d.svg all
And for some troubleshooting see:
http://www.calcmaster.net/personal_projects/pdf2svg/
pdftocairo can be used to convert pdf to svg. pdfcairo is part of poppler-utils.
For example to convert 2nd page of a pdf, following command can be run.
pdftocairo -svg -f 1 -l 1 input.pdf
pdftk 82page.pdf burst
sh to-svg.sh
contents of to-svg.sh
#!/bin/bash
FILES=burst/*
for f in $FILES
do
inkscape -l "$f.svg" "$f"
done
I have encountered issues with the suggested inkscape, pdf2svg, pdftocairo, as well as the not suggested convert and mutool when trying to convert large and complex PDFs such as some of the topographical maps from the USGS. Sometimes they would crash, other times they would produce massively inflated files. The only PDF to SVG conversion tool that was able to handle all of them correctly for my use case was dvisvgm. Using it is very simple:
dvisvgm --pdf --output=file.svg file.pdf
It has various extra options for handling how elements are converted, as well as for optimization. Its resulting files can further be compacted by svgcleaner if necessary without perceptual quality loss.
inkscape (#jbeard4) for me produced svgs with no text in them at all, but I was able to make it work by going to postscript as an intermediary using ghostscript.
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdf2ps -dFirstPage=$page -dLastPage=$page -dNoOutputFonts $1.pdf $1_$page.ps
inkscape -z -l $1_$page.svg $1_$page.ps
rm $1_$page.ps
done
However this is a bit cumbersome, and the winner for ease of use has to go to pdf2svg (#Koen.) since it has that all flag so you don't need to loop.
However, pdf2svg isn't available on CentOS 8, and to install it you need to do the following:
git clone https://github.com/dawbarton/pdf2svg.git && cd pdf2svg
#if you dont have development stuff specific to this project
sudo dnf config-manager --set-enabled powertools
sudo dnf install cairo-devel poppler-glib-devel
#git repo isn't quite ready to ./configure
touch README
autoreconf -f -i
./configure && make && sudo make install
It produces svgs that actually look nicer than the ghostscript-inkscape one above, the font seems to raster better.
pdf2svg $1.pdf $1_%d.svg all
But that installation is a bit much, too much even if you don't have sudo. On top of that, pdf2svg doesn't support stdin/stdout, so the readily available pdftocairo (#SuperNova) worked a treat in these regards, and here's an example of "advanced" use below:
for page in $(seq 1 `pdfinfo $1.pdf | awk '/^Pages:/ {print $2}'`)
do
pdftocairo -svg -f $page -l $page $1.pdf - | gzip -9 >$1_$page.svg.gz
done
Which produces files of the same quality and size (before compression) as pdf2svg, although not binary-identical (and even visually, jumping between output of the two some pixels of letters shift, but neither looks wrong/bad like inkscape did).
Inkscape does not work with the -l option any more. It said "Can't open file: /out.svg (doesn't exist)". The long form that option is in the man page as --export-plain-svg and works but shows a deprecation warning. I was able to fix and update the command by using the -o option on Inkscape 1.1.2-3ubuntu4:
inkscape in.pdf -o out.svg