How can I programatically convert SVG files containing text to PDF files (specifically on CentOS 5.3 x86_64)? - pdf

I would like to programatically convert SVG files to PDF files. However, the SVG files contain text that must be searchable in the generated PDF files. Also, it has to work on Red Hat Enterprise Linux 5.3 or CentOS 5.3 for the x86_64 architecture. It would be nice if it were Open Source or at least not very expensive.
Here is what I've tried. All of these, except Batik, work fine on Debian Lenny.
Inkscape
I can get it installed using autopackages from http://inkscape.modevia.com/ap, but when I use it from the command line, the text is not searchable.
Batik rasterizer [sic]
When it converts SVG files to PDF files, the text is no longer searchable.
svg2pdf
The source for this and several of its dependencies are available to download. I have been trying to get it to compile on CentOS, but haven't had success yet. I found a precompiled version for Debian x86_64, but it doesn't work on CentOS.
rsvg-convert
Generated PDF isn't searchable on CentOS 5.3. Perhaps installing a newer version of cairo would help. Thanks to DaveParillo for mentioning rsvg-convert (on superuser).
SOLUTION (but perhaps some of the above will still be useful to the reader)
princeXML
It works fine on CentOS when installed from source. For some reason it doesn't work when installed from the .rpm. Thanks Erik Dahlström!
Cross posted on superuser

You could try princexml, it's free for non-commercial use.

Related

Tess4J - Native library (linux-x86-64/libtesseract.so) not found in resource path

I'm using Tess4J (JNA wrapper around tesseract), and trying to call tess.doOCR(myFile) to OCR text from a single-page PDF.
I have GhostScript installed (by using yum install ghostscript), gs -h works correctly.
My app server is using 64-bit JVM, and I have gsdll64.dll, and the 64-bit tesseract dll's liblept168.dll and libtesseract302.dll in the class path.
When tess.doOCR(myFile) is called, this is logged:
GPL Ghostscript 8.70 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
But then it just stops there. The program doesn't go any further.
UPDATE --
It looks like the real issue is from this error:
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path
After looking around a lot, I don't see a convenient place to find this libtesseract.so file, and I'm not sure what it takes to get this onto my Linux app server. I read that maybe I need to download some C++ runtime, but I don't see a Linux download for that. Any advice would be much appreciated.
Or is this something to do with a symbolic link?
The Fix was simple for me,just do sudo apt-get install tesseract-ocr from the command line. For linux you dont need to worry about the DDL librarires or the jvm version. Installing tessearct from apt-get will do the trick.
Those DLLs are for Windows. For Linux, you'll need to install or build from Tesseract source.
That GS version, 8.70, is quite old. The latest Ghost4J library that Tess4J uses is not compatible with that.
Tess4J should include required libraries. However, you need to extract them first.
This should do the trick:
File tmpFolder = LoadLibs.extractTessResources("win32-x86-64"); // replace platform
System.setProperty("java.library.path", tmpFolder.getPath());
You should replace the argument of extractTessResources(..) with your platform. You can find possible options by looking into the Tess4J jar file.
This way you need not to install Tesseract on your system.
Recently I wrote a blog post about Tess4J in which I used this technique. Maybe it can help if you need further information or a running example project.
sudo apt-get update
sudo apt-get install tesseract-ocr
download test data by git
https://github.com/tesseract-ocr/tessdata

LibreOffice 4.3 does not convert to PDF (command line), but no errors reported

We are running Fedora on a dedicated server:
Linux host.**obscured**.<tld> 2.6.18-348.6.1.el5 #1 SMP Tue May 21
15:29:55 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
One important aspect of our web application is the ability to upload all types of documents, such as Open Office or Microsoft, and convert them dynamically to PDF, so they can be displayed on the site, and also using a JPG thumbnail created from the PDF asset.
Until recently, this function worked great, using LibreOffice 4.0. We used the soffice binary to dynamically convert uploaded files in a background shell command.
Then suddenly, LibreOffice stopped working, and we could not restore it, so we downloaded and installed LibreOffice 4.3.
The program now works, in the sense that it no longer bombs when forking off a the process, but the conversion no longer works, and doesn't produce any output or errors:
We essentially use the same syntax as from LibreOffice 4.0, which used to work correctly:
/opt/libreoffice4.3/program/soffice --headless --convert-to pdf --nofirststartwizard
--outdir **obscured** --nofirststartwizard **obscured**.docx
(I have obscured certain information here, intentionally, for the privacy of our users)
Again, this same syntax used to work with LibreOffice 4.0, until it broke, presumably due to an update of Java JRE on the server (we're not 100% sure...)
I cross-checked the syntax against online resources.
There was also mention of not being able to convert when another LibreOffice instance is running, and I checked that this was the only process!
Any thoughts or ideas will be appreciated, as this function is an important part of the application user experience
I have the same problem, after I 've used strace , I could see that fonts are missing.
http://ask.libreoffice.org/en/question/30069/pdf-font-embedding-in-libreoffice-42/

Jasper generated pdf report character rendering issue for language like Hindi

Report created from jasper not rendering properly for pdf in case of Hindi language.
e.g.
Correct Rendering should look like that
५ दिन
but rendered it wrongly.
Getting rendering issue for some characters only.
My application is running on Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic x86_64).
I used Mangal ttf file for rendering Hindi language.
I have tried following:
ireport pdf ecoding setup.
fontName="Mangal" size="9" isStrikeThrough="false" pdfFontName="akshar.ttf" pdfEncoding="Identity-H" isPdfEmbedded="true"
Font Extension and included jar file into my application.
Even upgraded my application jasper library.
Upgrading of Mangal ttf file.
Upgraded ubuntu specifically mscorefonts.
Help needed ASAP.
Thanks.

I've installed ImageMagick, added the proper line to development.rb, but Paperclip still won't let me upload photos

Using Windows 7. I've installed ImageMagick in C:\Program Files\, and I've tried adding the line
Paperclip.options[:command_path] = "C:\Program Files"
as well as
Paperclip.options[:command_path] = "C:\Program Files\ImageMagick-6.6.7-Q16"
to config/environment/development.rb. I still get the following error when trying to upload an image:
/Local/Temp/stream20110212-5000-s69b6a.png is not recognized by the 'identify' command.
Any ideas?
I had the same issue with a rails app using paperclip. In my case I'm on a Mac, but the error was:
/var/folders/es/es6TeBjk2RasAk+1YvsY6++++TI/-Tmp-/stream20110214-45420-usux9f-0.png is not recognized by the 'identify' command.
It seems to occur if ImageMagick (IM) is not compiled with PNG support. The windows binaries should come with PNG as one of the default delegates in the installation files.
Try reinstalling the latest version if IM. The alternate option is to recompile ImageMagick with the PNG library available. This is how I fixed this issue on my Mac.
For windows, recompilation is not so simple. ImageMagick offer a download of the libpng library from their delegates folder. They also supply instructions on an Advanced windows install using delegates. It requires the MS Visual Studio IDE to compile.

Server side solution to convert PDF to SWF?

I've found some tools that can convert PDF to SWF, but I'm hoping to find something or even a library to be able to be able to incorporate this on the server to be able to store the SWF in a database.
Have you tried SWFTOOLS? On Debian/Unbuntu:
sudo apt-get install swftools
pdf2swf filename.pdf
http://wiki.swftools.org/index.php/Examples#pdf2swf
They have a Windows version, and the source so you can compile it on other Linux flavors.