Scribus, IText reduce file size - pdf

I'm generating PDFs using IText within a webapp. The PDF files come from Scribus, and are huge (2MB for one page). The current approach is that the PDF has lots of form fields, which then get populated by IText (AcroForms, etc.)
The individual PDF generated by Scribus is 2MB. It could be as small as 150K. I know that due to having run GhostScript on it. See below.
For large files (some could be 150 pages), the server bogs down, and often no PDF results.
GhostScript will reduce the file to 150K per page. But, I can't run that as a post-process if the PDF generation never completes. If I run GhostScript on the initial PDF that gets fed to IText, then the form fields go away, and the result is an empty form.
So, I either need a way to run GhostScript without losing the form fields (or another external tool that does the same thing), or a way for IText to populate a PDF via some means other than form fields. Is there any IText feature equivalent to good old JavaScript's document.getElementById('xyz').innerHTML = "new text";?
Of course, the absolute best solution would be an export option in Scribus that would simply not do the "place one glyph at a time" that they are so proud of.

Related

Tables or images too wide in Pandoc output as DOCX or PDF/LaTeX

I am writing a quick and dirty report using pandoc and markdown.
I need to generate a PDF or a DOCX with minimum hassle, I don't care much about which (best would be both, of course). Also, I am somewhat constrained regarding the figures and tables -- they have been generated a priori with another program and I would rather be able to insert them as they are then to convert them to suit pandoc's needs.
However, the main constraint is that I don't want to edit the resulting document manually, be that LaTeX or DOCX. I want to do all editing in markdown.
Here is the problem:
In DOCX, the tables are displayed fine: they have the width of the document. However, the figures are much too wide. I can either convert the images to a lower resolution (which doesn't look nice), or manually resize the images in Word (which is out of question).
In PDF, the generated figures are fine (more or less), however another two problems appear:
The tables are too wide, because there are no line breaks, and
LaTeX being LaTeX, the order of figures and tables are "reorganized", that is, they are not consecutive.
Thus, none of the documents generated are usable for my purposes.
All I wanted to do is to slap together some results and generate a file that I can send to another scientist.
Question: what is the best solution to generate a quick and dirty report in pandoc with minimum effort and at least all results visible?
Update: Upgrading pandoc to 1.4 or later solves the issue -- the figures have now correct sizes in docx documents.
Control over image size
Currently you cannot control that feature directly from Markdown. For LaTeX/PDF output, this is automatically handled by LaTeX/pdflatex itself.
In recent months there have been some discussions going on in the Pandoc developer + user community about how to best implement it and create an easy-to-use syntax, for example
![Image Caption](./path/to/image.jpg "Image Comment"){width="60%", height="150px"}
(Warning: Example only, made up on the spot + extracted from thin air by myself -- can't remember the latest state of the discussion...) This is designed to then transfer to all the supported output formats which can contain images, not just to LaTeX/PDF.
So something along these lines is planned to be a major new feature for the next major release of Pandoc, and will start to be working better in ODT/DOCX output as well.
Control over table/cell widths and line breaks within cells
How exactly do you specify your tables in Markdown syntax?
Are you aware that Pandoc supports several variations like gid_tables, pipe_tables, simple_tables and multiline_tables?
You should look into using pandoc --from=markdown+multiline_tables ... as your command and write the critical tables as multiline_tables in your Markdown.
Read all about the details via man pandoc_markdown...
Multiline tables give you a limited control over the width of individual columns in the output, just by widening or narrowing the column widths in the markdown source itself.
Order of figures and tables when outputting LaTeX/PDF
Pandoc supports the insertion of raw_tex lines and environments into the Markdown source file. When it encounters such lines, it transmits them un-changed into its LaTeX output. (But it will be ignored for all other outputs.)
So you can insert lines like
\newpage{}
into the Markdown to enforce a page break. This already gives you some limited control over keeping the order of mis-behaving figures or tables. (After all, you said you look for a "quick and dirty" method, not a sophisticated typeset document...)
Of course, if you know LaTeX more and better, you can also use stuff like
/FloatBarrier inside your Markdown.
Going down that road (mixing LaTeX code into Markdown) gives you a few disadvantages:
The Markdown will not look as pretty any more.
The Markdown will not work fully with other output formats (should you need them).
But the advantage still are:
You will be writing and modifying the document text much faster in Markdown than authoring it in LaTeX.
You have some additional control over the final look of your PDF:
order of tables + figures
look + width of tables + figures (because, you can of course insert a complete LaTeX 'figure' or 'table' environment).

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Save data as editable Pdf

We have a software, which creates user reports and saves them into pdf documents. We're using Ghostscript for this.
I'm aware that PDF is "normally" an export format which is not editable, but one of our customer needs the possibility (for legal reasons) to edit these files.
I thought it can be possible to save the text in fillable forms (like adobe acrobat offers) and save it that way. Is it possible to create Text within a fillable form in a PDF and save it (with free tools like Ghostscript), so that the user can edit it later?
I read the Ghostscript documentation, but I didn't find anything.
GhostScript isn't really a terrific tool for this. You'd be better off with a PDF generation library which can add the appropriate annotations to the page - if you're wedded to using annotations.
If the "content" must be edited by end users, using widget annotations is not a horribly bad way of doing things, except that every end user needs to have a copy of Acrobat and if only some people are allowed to edit, you will likely have to play with owner password protection and permissions in order prevent anyone from changing field contents.
As for free tools, depending on the usage you could use iText or iTextSharp.
If you are required to be able to take the content of the document and be able to make changes to it on the fly, that's a trickier beast. If you can afford it (and it's certainly not free), my company Atalasoft, publishes a product that I wrote that lets you build PDF documents from scratch or from templates and embed the .NET objects that create the content into the PDF itself, which means that you can read those objects back out and change the content with a site-specific application, for example.

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.