Append text to PDF in Coldfusion 8 - pdf

I have a PDF that I want to append some text to. the addFooter() that is available in CF9 would work perfectly, but I only have access to CF8.
Any one have workarounds for this feature in 8?
Thanks

Yes, even in ColdFusion 8 you can use DDX to add footers and headers to a PDF. See the specific Adobe 8 Livedocs on how to do this. I also have a couple blog posts 1 and 2 that might help. ALthough I tested on CF9, there's CF8 valid information as well. You might also want to get the almost impossible to find DDX reference. Also check out ColdFusion Jedi's 8 part series on PDF manipulation in CF8.
UPDATE (Added information below on combining text):
To take PDF1 and PDF2 and put the text on a single page in resulting PDF, the first thing that comes to mind is that you could use cfpdf with the getinfo action to get the text (if you don't already have it in a plain text or HTML format). Then you could cfoutput the text into a cfdocument element of type pdf. That way you get a new merged PDF with the contents combined.

Related

How to merge PDFs into a PDFA1b with watermarks using iText5

Here is what I need to do:
Merge several PDF documents (which may or may not be PDFA) into one PDFA1b.
Add a watermark (a simple text label) on each page of the resulting PDF.
It has to be with iText 5
I have looked at this official merging example: http://developers.itextpdf.com/examples/merging-pdf-documents/adding-cover-page-existing-pdf
But can this method be used to create a PDFA, and also add watermarks?
Or am I stuck with using this other method which he specifically says not to use: http://developers.itextpdf.com/examples/merging-pdf-documents-itext5/how-not-merge-documents
You can create files that conform to PDF/A-1b with just about any PDF library including iText. PDF/A, in general, is a subset of ISO 32000 (PDF) so it's really just a matter of using the tool to do what you need to with the files but not adding anything that is forbidden by PDF/A-1b (in your case).
The thing to be aware of is that iText or any of the other libraries that "support" PDF/A, will not prevent you from modifying PDF in a way that is forbidden by PDF/A... you just need to know what those things are.
So... before merging, you'll want to be sure that the input files don't have any annotations or form fields or any other interactive content.
After merging, add your watermark as page content and be sure your XMP metadata is conforming and you should be OK.

Cannot select PDF from top to bottom

I'm using pdftotext to extract info from a pdf. Currently using the -raw option. I do have a few problems with the PDFs I'm working with. If I select the text from top to bottom it selects in the following fashion.
PDF content:
A
B
C
It selects A then C and then B. So when I extract the text it is presented in the same way. Is there a way to reformat the PDF so I can select the content from top to bottom?
NOTE: I'm aware that if I omit the "raw" option the layout will be preserved, but it seems to be buggy when the document includes tables so raw works better for me.
Yes, you can reformat the PDF so that the content is returned from top to bottom. This is not something that can be easily done using Adobe Acrobat or any other viewer that I am aware of and here is why.
From the documentation of pdftotext, the -raw option is defined as
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
"content stream order" is the important piece in the description.
In PDFs, the content on the page does not have to be written in the content stream (the instructions that are interpreted to display the page) in the order that a human would read the content when the page is rendered. The internals of PDFs do not care about the ordering, they were designed to reproduce the same visualization of a document on a variety of platforms. Since all that matters to PDF is the visualization, applications or libraries that write PDF tend to not order the content stream in any meaningful way.
So you can reorder the instructions in a content stream so that they are in the order a human would read them, it is not an easy task to do by hand and using a library that understands PDF to manipulate the content stream would be one way of doing this. Another way is to look for a more advanced tool to use to extract text from the PDF (there are a number of tools that will look at the placement of the content on a page rather than just where it appears in the content stream).
I am not aware of anything that will reorder the content stream in the PDF based on where the content appears on the page automatically though.

iTextSharp - when extracting a page it fails to carry over Adobe rectangle highlighting important info

Per the following site...
http://forums.asp.net/t/1630140.aspx?extracting+pdf+pages+using+itextsharp
...I use the function ExtractPages to produce a new PDF based on range of page numbers. My problem is that I noticed a PDF that had a rectangle on the 2nd page was not extracted along with the page. This causes me some fear that perhaps Adobe comments are not being carried over as well as the pages get extracted.
Is there a way I can adjust this code to take into consideration to bring over comments and objects like rectangles to the new PDF when ExtractPages is called? Am I missing a syntax or is that not available with version 5.5.0 of iTextSharp?
Your use of the verb extract in the context of extracting pages is confusing. People will think you want to extract text from a page. In reality, you want to import or copy pages.
The example you refer to uses PdfWriter. That's wrong: you should use PdfStamper (if only one existing PDF is involved) or PdfCopy (if multiple existing PDFs are involved). See my answer to the question How to keep original rotate page in itextSharp (dll) to find out why the example on forums.asp.net is a really, really bad example.
The fact that a page has "a rectangle" is irrelevant. Maybe the rectangle is an annotation. In that case, you're throwing that rectangle away by using the wrong example. Maybe the origin of the page is different from 0,0...
If your purpose is to create a new PDF containing only a selection of pages of the original PDF, please read my answer to Function that I can use to remove a single page from a PDF using iText

How to delete first page from muliple PDF's

I have a collection of PDF's that sometimes have a info page for the first page of the document that I want to remove.
If there a quick way to delete this info page from all of my pdf's or at least a way to show all pdf's that have more than one page so I can better find the ones that need to be fixed?
Do you know of any program that can do this? Or way to do this with python?
Note: The info page has text on it that that always remains the same "LAND TITLE OFFICE"
Using Windows 7 OS
Thanks
Some Research turned up the following:
http://www.python.org/workshops/2002-02/papers/17/index.htm
http://www.unixuser.org/~euske/python/pdfminer/index.html
https://pypi.org/project/pypdf/
You can try these two ways:
PdfTK is an utility to manipulate PDFs. Check this link, they are doing something similar to what you need (in the comments someone also posted a script for windows)
PDFsam is a graphical powerful tool to manipulate PDFs in bulk. The split+merge sections should do the trick.
Both of them are free, I'd suggest to study the first if you want to write a "recipe" that you can use often, but the later if you have to do it once.
You can use the opensource PDFBox as a command line utility to split PDF's.
The link for PDFBox is here: link
The documentation for splitting a PDF using PDFBox is here: link
You could use the PDFBox extract text functionality from a batch script and combine with grep to identify pages that contain the text you are looking for. The extract text documentation is here: link

Parse Body Text from PDF

I have just recently been experimenting with parsing the text data from a PDF document using iTextSharp in a VB2010 app. the document doesn't contain any images or other fancy elements, just text. Ive read some articles and used some code snippets and it looks promising. However, what Ive been trying to do is just parse out the body of each page, minus a header or footer. I haven't found any guidance for that particular function.
Currently using the snippet found here Reading PDF content with itextsharp dll in VB.NET or C# but it parses all text in a page. There's got to be a way to just get the body. Or at least I hope so.
PDFs generally do not contain information about logical structure of contained text.
So there are no headers, footers, body, paragraphs and anything like this in a PDF. There is only bunch of operations like "draw this glyph here", "move to this position and draw that group of glyphs there". I wrote glyph and not character because PDFs are not required to contain readable text. Only visual appearance required to be specified.
One exception is Tagged PDF but most of PDFs in the wild are not tagged.
Given all of the above you are probably left with following approach:
Extract all text from each page
Analyze text and find similar parts at the beginning / end of each page
Remove similar parts
This is a heuristic-based detection, so it probably won't always give excellent results.