Converting PCL to PDF - pdf

I am looking to create (as a proof-of-concept) an OCaml (preferably) program that converts PCL code to PDF format. I am not sure where to start. Is there a standardized algorithm for doing so? Is there any other advice available for accomplishing this task?
Thanks!

Conversion of PCL to PDF can be incredibly complex (assuming you need it to be generic and not just for simple PCL). We've investaged this many times and in the end always revert to using other tools. We keep investigating as we are a development shop who uses and understands all elements of PCL to great detail. If you are not really familure with PCL it will be daunting task. One of the major issues is that overtime, printers have become, for the most part, tollerent of malformed PCL and as such, creating something that follows the rules to the letter of the law is not always sufficient. If; however, you have control over the PCL, you may be able to work it out with some amount of success.
I don't mean to turn you off of this and I realize that you've come here looking for a programming answer but I have to say, this is a far from simple task and there are no 'standarized algorithms' for this (that I'm aware of).
If this is designed to be a tool to work alongside of somehting else you are building I'd highly recommend looking at these guys:
PageTech
This is by far the most complete set of tools (Windows) for handling this. There are a few others but, based on our extensive use of PCL and conversion tools over the years, this is the only one that work all the time.
EDIT: Most recently we've been working with LincPDF (http://www.lincolnco.com/). This is also an excellent product with has one big benefit, deployment is simple. Some of the other tools have complex software installations. This solution is very easy for us to deploy as a feature in an application. It's also faster then any tools we've tested to date (at least with the PCL that we generate from our apps which is quite complex as they include specialized fonts and macros).

Ghostscript developers have recently integrated their sister products GhostXPS, GhostPCL and GhostSVG into their Ghostscript source code tree. (It's now called GhostPDL.) So all of these additional functionalities (load, render and convert XPS, PCL and SVG) are now available from there.
This means you could build their language switching binary from their sources. This, in theory, can consume PCL, PDF and PostScript and convert this to a host of other formats. While it worked for me whenever I needed it, Ghostscript developers recommend to stop using the language switching binary (since it's 'almost non-supported' -- see KenS' comment to this answer) and instead switch to using the explicit binaries pcl6.exe (PCL input), gsvg.exe (SVG input, also 'almost non-supported') and gxps.exe (support status unclear to me).
So to 'convert PCL code to PDF format' as the request areads, you could use the pcl6 command line utility, a sister product to Ghostscript's gs/gswin32c.exe.
Sample commandline:
pcl6.exe \
-o output.pdf \
-sDEVICE=pdfwrite \
[...more parameters as required (optional)...] \
-f input.pcl
Updated as per KenS' hints in the comment....

There is a series of reference books from HP; you could re-implement a PCL parser and output corresponding PDF.
You might start with the "PCL 5 Printer Language Technical Reference Manual" (http://h20000.www2.hp.com/bc/docs/support/SupportManual/bpl13210/bpl13210.pdf) . Search HP for more (http://search.hp.com/query.html?qt=PCL+reference).
Or you could steal code or ideas from GhostPCL (http://www.ghostscript.com/GhostPCL.html)

Related

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.

Creating ODT and PDF files as end result

I've been working on an app to create various document formats for a while now, and I've had limited success.
Ideally, I'd like to dynamically create a fairly simple ODT/PDF/DOC file. I've been focusing my efforts on ODT, because it is editable, and open enough that there are several tools which will convert it to any of the other formats I need.
The problem is that the ODT XML files are NOT simple, and there aren't any good-quality API's I could find (especially in python). So far, I've had the most success creating a template ODT file, and then manipulating the DOM in python as needed. This is ok generally, but is quickly becoming inadequate and requires too much tweaking every single time I need to alter one of the templates.
The requirements are:
1) Produce a simple document that will have lists, paragraphs, and the ability to draw simple graphics on the page (boxes, circles, etc...)
2) The ability to specify page size, and the different formats should generally print the exact same output when sent to a printer
My questions:
1) Are there any other ways I can produce ODT/PDF/DOC files?
2) Would LaTeX be acceptable? I've never really used it, does anyone have experience converting LaTeX files into other formats?
3) Would it be possible to use HTML? There are a lot of converters online. Technically you can specify dimensions in mm/cm, etc..., but I am worried that the printed output will differ between browsers/converters....
Any other ideas?
have you tried pandoc? i've been using it with good success for the conversion of different formats into each other. why try to invent the wheel twice?
I suppose to be successful, you'd have to define how you want to input everything. Why don't you just use openoffice? it will save to ODT (duh...), PDF, and HTML (though it's not clean HTML, it's actually quite ugly).
In my recent experience, I've had success going from latex -> xhtml via LaTeXML (i had to compile from source). LaTeX is seeming more and more like a terminal format. It's great for PDF, but once you need some flexibility, it kind of fails. I should also note that there is no latex -> dvi in my workflow, so I can't comment on things like tex4ht that reads out of a dvi file (I have too many graphics that don't work with DVI to switch them now).
Shortly I'll be moving everything into docbook 4.5-- i like the docbook-utils package which supports latex, html, and i even saw a converter to ODT. But docbook is super-heavy on the markup, which is annoying, but it will provide me with the flexibility i need going forward.
Since you're using python, have you just considered using ReStructured Text?
I've also really enjoyed publishing from emacs' orgmode, which is a super light weight markup that goes into a bunch of different formats.

Autodocumentation type functionality for Fortran?

In the past I've used Doxygen for C and C++, but now I've been thrown on Fortran project and I would like to get a quick all encompassing look at the architecture.
In the past I've found reverse engineering tools to be useful where no documentation of the architecture exists.
So, is there a tool out there that will reverse engineer Fortran code?
I tried to use Doxygen, but didn't have any luck. I will be working with two different projects - one Fortran 90 and I think is in Fortran 77.
Thanks for any insights and feedback.
Tools which may help with reverse engineering:
SciTools Understand
Link with some more tools (search "fortran")
Also, maybe some of these unit testing frameworks will be helpful (I haven't used them, so I cannot comment on the pros and cons of any of them):
FUnit
FRUIT
Ftnunit
(these links link to fortranwiki, where you can find a tidbit on every one of them, and from there there are links to their home sites).
Doxygen 1.6.1 will generate documentation, call graphs, etc. for Fortran source code in free-format (F90) format. You are out of luck for auto-documenting fixed-format (F77) code with doxygen.
All is not lost, however. The conversion from fixed to free format is straightforward and can be automated to a great degree - change comment characters to '!', change continuation characters to '&', and append '&' to lines to be continued. In fact, if the appended continuation character is placed in column 73, it should be ignored by standard F77 compilers (which still only recognize code in columns 1 through 72) but will be recognized by F9x/F2003/F2008 compilers. This allows the same code to be recognized as both in fixed and free format, which lets you gracefully migrate from one format to the other.
Conveniently, there are about a thousand small programs that will do this format adjustment to some degree or another. Realistically, if you're going to be maintaining the code, you might as well move it away from the 1928 spec for Hollerith (IBM) punched cards. :)

What are the relative merits of pdflatex?

Not sure this is a programming question, but we use LaTeX for all our API documentation and user documentation, so I hope it will go through.
Can someone please explain what are the relative merits of using pdflatex as opposed to the "classic" technique of
latex foo
dvips -Ppdf foo
ps2pdf foo.ps
From time to time I run into people who have difficulty because things don't work in pdflatex, and I know that using pdflatex gives up two things I have grown to value:
Can't use the very speedy xdvi viewer
Can't use the PStricks package
I should add that I typically get PDF with hyperlinks by using something on the order of
\usepackage[ps2pdf,colorlinks=true]{hyperref}
so it's not necessary to use pdflatex to get good PDF.
So
What are the advantages of pdflatex that I don't know about?
What are the disadvantages of the old tools that I've overlooked?
My favorite pdflatex feature is the microtype package, which is available only when using pdflatex to go directly to PDF, and really produces stunning results with no effort on my part. Apart from that, the only caveats I run into are image formats:
pdflatex supports PDF, PNG, and JPG images.
the postscript drivers support (at least) EPS.
Also, if you want to install fonts, the procedures are slightly different depending on what fonts that driver supports. (Hint: use XeTeX to instantly enable OpenType fonts.)
As it turns out, I recently read a post that shows the difference directly. Any document that uses tables or narrow columns will be improved automatically. I also find the inter-word spacing to be far more pleasing with pdflatex.
Is xdvi much faster than xpdf? I find the edit, TeX, view cycle to be very quick with pdflatex.
Have you tried MetaPost or MetaFun for graphics? I tend to put graphics creation in the hands of the capable, but MetaFun would likely be the package I'd use. Just reading the manuals is a pleasure.
Also pdftex is the engine under development (towards luatex) and maintenance. I'm not sure the DVI counterparts are as actively maintained.
PStricks is supplanted by Tikz.
I didn't use xdvi in years, so pardon the trollish rhetorical questions: Does xdvi display vector fonts? Does it support synctex (jumping to and from code)? Does it have the confort of use of PDF readers like Skim?
Taco Hoekwater is working on Escrito, a Postscript interpreter written in Lua, which would allow you to use pstricks in Luatex. He has an impressive project completion record: maybe I should have used "will" rather than "would" in the previous sentence.
I used pdflatex to generate the PDF for my ICFP 2009 paper. (I still needed to use standard latex to generate the PostScript file.) I did so for two reasons:
I couldn't seem to get ps2pdf to generate Letter, rather than A4 output, no matter what command line options I used.
For the printers, I needed to produce a version 1.3 PDF file, not 1.4. pdflatex made this easy to do. I set the PDF author and title information while I was at it.
Both of these problems may be fixable in some way, but as a first-time latex user, I didn't find any obvious solutions, nor did more experienced users whom I'd asked.

Which is it Perl or perl, TIF or TIFF, ant or Ant, ClearCase or Clear Case?

In one sentence I have manage to create 16 possible variations on how I present information. Does it matter as long as the context is clear? Do any common mistakes irritate you?
regarding Perl: How should I capitalize Perl?
TIFF stands for Tagged Image File Format, whereas the extension of files using that format is often ".tif".
That is for the purpose of compatibility with 8.3 filenames, I believe.
I generally like the Perl way of capitalizing when used as a proper noun, but lowercasing when referring to the command itself (assuming the command is lowercase to begin with).
Well, Perl and TIFF have already been answered, so I'll add the last two
the Apache Foundation writes "Apache Ant".
Rational ClearCase (or sometimes "IBM Rational ClearCase") is written as such at its web site.
Even though Perl was originally an acronym for Practical Extration and Report Language, it is written Perl.
These things dont 'bother' me as much as they provide insights into the level of knowledge of the speaker/author. You see, we work in a industry that requires precision, so precision in language does matter as it affects the understanding of the consumer.
The one that really seems to bother me is when people fully upper case JAVA as though it was an acronym.