I need to convert DOC/TXT files to PDF in large batches - pdf

We are changing systems and the new system only outputs .DOC or .TXT files for reports. Several of the reports that come out need to be converted to PDF so they are available for our web users on a daily basis. Currently I am testing about 1500 of a single report and before the system is ready I will need to support at least 10 types of reports, each possibly have this 1500 or so convert.
So far I have not found a way to convert this many reports effectively. Part of the problem is that the reports must be converted to a specific size PDF for the them to be read easily. I have tested some software solutions but so far I have not been able find a solution.
I really like Batch Document Converter Pro. We have uses software from this company before and it worked very well for out needs. Whenever I try it though it gives the error
Problem with conversion: word to pdf, check word 2007 or greater is installed and the MS PDF Addon pack for office 2007
I have tried installing different versions of Office (including 2007) on the machine and installed the addon pack with no change.

One tool to try is Libre Office since:
it can run on multiple platforms
it can be driven from the command line or programmatic API
you can use it manually to confirm whether it will do what you need before doing any scripting/programming
it does pretty good conversions
the docx files page format will transition naturally to the PDF
the text files will be converted into a "normal" page layout
I would suggest you firstly install Libre Office, and open some of your documents by hand then export to PDF. If the results are good enough, then you can automate this to run in batches.
If the first step is promising, then the simplest automation is to use the command line. eg:
c:\Program Files\LibreOffice 5\program\soffice --convert-to pdf myDoc.docx
I hope that helps.

Related

I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible?

I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.

How can I convert old xif image files (Pagis, Xerox) to PDF (or another format.)

Years ago I began scanning my "important" documents using Pagis, software that came with my HP Scanner. Eventually I began to scan to PDF (as the scanner software became able), but I still had many old XIF file. The Pagis software would run only on 32 bit OS (Windows) which is now becoming less and less common. In fact I have a Win32 system I've kept alive just to retain access to the XIF files.
I can convert these files using Adobe Acrobat (or equivalent) "simply" by opening the XIF viewer, then printing the doc to the Adobe PDF "printer". Unfortunately I have enough files that this manual process would take many years.
So, what's the best way to convert a large number of XIF files to PDF?
I recently found SikuliX, a scripting tool intended mainly for testing GUI. It is different from most such tools I have seen (e.g., Selenium) in that it is purely image based and cares not what the underlying technology is (HTML, XAML, etc.)
It took me about an hour to learn enough to write a script to open the XIF viewer, select the PDF "printer", click the button to print, fill in the desired output file name (XIF viewer truncated to short name if left alone), and then wait for the print to complete. The script then moved to the next XIF file. (I fed the script a file listing all of the XIF file paths on the drive.) I was using Nitro PDF rather than Adobe.
The script ran for a couple of days (I didn't say it was fast!), but converted all but a few of the files. From time to time it would stall and I'd have to modify the script a bit (increase wait time for UI to change, etc.)
There are probably not many folks facing this particular conversion problem, but I've been looking for a good solution literally for years. So, if you're in the same boat then this is a way to get to shore!

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.

I need a (preferably free) PDF/Word generator .Net component that can work from a document template

I'm looking for a .Net component that will allow me to generate Word and/or PDF documents.
This must work on the server without MS Office installation. Preferably free. Also, it needs to be able to generate the documents based on an existing template of some sort i.e. I don't want to generate the whole document from scratch but allow a number of different templates that all have similar content that comes from elsewhere (e.g. database, XML files etc).
My initial investigations have turned up iTextSharp (but not sure if it can work from templates).
Any help that can expedite my investigation time will be much appreciated.
Thanks
I use ActivePDF at work with .NET - give it some HTML and it will output a pdf doc. However it isn't free - but we did look at a few other ways and this was 1
http://pdfcrowd.com/html-to-pdf-api/
It doesn't do word documents but converts html (your template) to pdf

How can I create a PDF file in classic ASP?

Is there any way to generate PDF files from classic ASP? I have a bunch of user-entered data that needs to be turned into a PDF that the user can download. How can I do this? OpenOffice allows exporting documents to PDF, so could this somehow be leveraged?
I played around a bit with this (Persits ASPPDF): http://www.asppdf.com/
Maybe running an external application that could be using CrystalReports... and you just pass it as an xml?
That's how i would do it... (lazy mode)
See a full list of PDF components here: http://www.aspin.com/home/components/document/pdf Many of them are free.
It is also possible to use XSLT to output PDF but I am not sure if this is supported by the Microsoft XML Parser. I remember there were something stopping me when I tried to do this 3-4 years ago. Might be worth checking out know depending out the type of data you have as source.
However if these are static files or a one time job consider using a PDF converter on your computer and just upload the files to the server. There are heaps of tools for this, including Adobe Acrobat.