Co-ordinates of a element in a pdf file using iText

Co-ordinates of a element in a pdf file using iText - pdf

I'm creating a pdf file using BIRT reporting library. Later I need to digitally sign these files. I'm using iText to digitally sign the document.
The issue I'm facing is, I need to place the signature in different places in different reports. I already have the code to digitally sign the document, now I'm always placing the signature at the bottom of last page in every report.
Eventually I need each report to say where I need to place the signature. Then I've to read the location using iText and then place the signature at that location.
Is this possible to achieve using BIRT and iText
Thanks

If you're willing to cheat a bit, you can use a link... which BIRT supports according to my little dive into their docs just now.
A link is an annotation. Sadly, iText doesn't support examining annotations at a high level, only generating them, so you'll have to use the low-level object calls.
The code to extract it might look something like this:
// getPageN is looking for a page number, not a page index
PdfDictionary lastPageDict = myReader.getPageN(myReader.getNumberOfPages());
PdfArray annotations = lastPageDict.getAsArray( PdfName.ANNOTS );
PdfArray linkRect = null;
if (annotations != null) {
int numAnnots = annotations.size();
for (int i = 0; i < numAnnots; ++i) {
PdfDictionary annotDict = annotations.getAsDict( i );
if (annotDict == null)
continue; // it'll never happen, unless you're dealing with a Really Messed Up PDF.
if (PdfName.LINK.equals( annotDict.getAsName( PdfName.SUBTYPE ) )) {
// if this isn't the only link on the last page, you'll have to check the URL, which
// is a tad more work.
linkRect = annotDict.getAsArray( PdfName.RECT );
// a little sanity check here wouldn't hurt, but I have yet to come across a PDF
// that was THAT screwed up, and I've seen some Really Messed Up PDFs over the years.
// and kill the link, it's just there for a placeholder anyway.
// iText doesn't maintain any extra info on links, so no need for other calls.
annotations.remove( i );
break;
}
}
}
if (linkRect != null) {
// linkRect is an array, thusly: [ llx, lly, urx, ury ].
// you could use floats instead, but I wouldn't go with integers.
double llx = linkRect.getAsNumber( 0 ).getDoubleValue();
double lly = linkRect.getAsNumber( 1 ).getDoubleValue();
double urx = linkRect.getAsNumber( 2 ).getDoubleValue();
double ury = linkRect.getAsNumber( 3 ).getDoubleValue();
// make your signature
magic();
}
If BIRT generates some text in the page contents under the link for its visual representation, that's only a minor issue. Your signature should cover it completely.
You're definitely better of if you can generate the signature directly from BIRT in the first place, but my little inspection of their docs didn't exactly fill me with confidence in their PDF customization abilities... despite sitting on top of iText themselves. It's a report generator that happens to be able to produce PDFs... I shouldn't expect too much.
`
Edit: If you need to look for the specific URL, you'll want to look at section "12.5.6.5 Link Annotations" of the PDF Reference, which can be found here:
http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

I don't know anything about BIRT, and have only a little familiarity with iText. But maybe this works...
Can BIRT generate the signature box's outline as a regular form field with a given field name? If so, then you should be able to:
Lookup that field by name in iText's AcroFields hashmap, using getField;
Create a new signature using the pdf stamper, and set its geometry based on the values of the old field object; and
Delete the old field using removeField.

Related

how to get pdf origin contents using itext

I will make the problem concrete. I currently have three PDFs
The first PDF is a pure PDF without any signature. The link is as follows,
https://drive.google.com/file/d/14gPZaL2AClRlPb5R2FQob4BBw31vvqYk/view?usp=sharing
The second PDF, I digitally signed the first PDF using adobe_acrobat_dc, the link is here,
https://drive.google.com/file/d/1CSrWV7SKrWUAJAf2uhwRZ8ephGa_uYYs/view?usp=sharing,
The third PDF is generated like this, I used the code you once provided as below
com.itextpdf.kernel.pdf.PdfReader pdfReader = new com.itextpdf.kernel.pdf.PdfReader(new
FileInputStream("C:\\Users\\Dell\\Desktop\\test2.pdf"));
com.itextpdf.kernel.pdf.PdfDocument pdfDocument = new com.itextpdf.kernel.pdf.PdfDocument(pdfReader);
SignatureUtil signatureUtil = new SignatureUtil((pdfDocument));
for(String name: signatureUtil.getSignatureNames()){
System.out.println(name);
PdfSignature signature = signatureUtil.getSignature(name);
PdfArray b = signature.getByteRange();
long[] longs = b.asLongArray();
RandomAccessFileOrArray rf = pdfReader.getSafeFile();
try (InputStream rg = new RASInputStream(new RandomAccessSourceFactory().createRanged(rf.createSourceView(),longs));
ByteArrayOutputStream byteArrayOutputStream = new com.itextpdf.io.source.ByteArrayOutputStream();) {
byte[] buf = new byte[8192];
int rd;
while ((rd = rg.read(buf, 0, buf.length)) > 0) {
byteArrayOutputStream.write(buf, 0, rd);
}
byte[] bytes1 = byteArrayOutputStream.toByteArray();
String s2 = DatatypeConverter.printBase64Binary(bytes1);
}
}
Process the second PDF to get the base64 encoded form of the third PDF, finally,the third pdf link is https://drive.google.com/file/d/1LSbZpaVT9GrfotXplmKWl6HaCvxmaoH9/view?usp=sharing
My question is, is there a method which the input parameter is the first PDF and the output is the third PDF

If I understand you correctly, you start with an unsigned PDF document test1.pdf. You sign it using Adobe Acrobat and get a signed PDF document test2.pdf. Then you apply your code to that signed PDF and get a file test3.pdf.
And now you wonder whether you can get test3.pdf immediately from test1.pdf some other way, independent from the specific signing step done in Adobe Acrobat.
This is not possible in practice.
Signing a PDF does not merely append a few signature related attributes, it can completely re-organize the PDF internally!
For example, your original test1.pdf is a normally saved PDF with cross reference tables. Adobe Acrobat saved the signed document as a linearized PDF with object streams and cross reference streams. Also all the PDF objects are renumbered. This causes a byte-wise comparison of test1.pdf and test2.pdf to hardly find any similarities.
All these changes are not necessary for signing but merely represent Acrobat's preferred way of saving a hitherto unsigned PDF. Thus, after the next program update Acrobat may or may not change this behavior completely without prior notice.
But even if Acrobat only saved necessary changes (whenever it saves as an incremental update, it forgoes most unnecessary changes), there would still be multiple valid ways to format them.
Additionally there are multiple date and version information pieces. E.g. signing, creation, and modification time; also the signature in test2.pdf claims to have been created by Adobe Acrobat Pro DC version 2018.011.20038. A small change in the software used or in the timing of the use will create different information in the result file.
And as the output of your code, your third file, contains everything of test2.pdf except the embedded signature container, all the changes mentioned above are also in your third file.
Concerning the terms you use:
You call the output of the code you posted original content or original text (in your previous question here). This is a bit of a misnomer because that output does contain all the changes introduced by the signing program, in your example all the re-organization of the objects in the PDF by Adobe Acrobat, so it is not really original. This output merely are the signed bytes or signed byte ranges in the signed PDF.
Furthermore, you call that output a pdf. Strictly speaking it is not a PDF anymore, at least not a valid one. By removal of (the placeholder for) the signature container, the signature dictionary is broken and all offsets in the file after that missing value have shifted.

iText's Alt-Text adding sample code not working for PDFs tagged using Acrobat

I'm working on a PDF accessibility assignment, which is to add alternative text in a tagged PDF. I got the sample code for the same at: Add alternative text for an image in tagged pdf (PDF/UA) using iText
Very much excited about that my task is going to end in a very short time, without much R&D.
Created a Java project based on the code, and when I executed it, it worked perfectly for the input PDF used in iText.
Unfortunately, the same source code is not working with PDFs tagged using Acrobat.
Sample Inputs: iText PDF: no_alt_attribute.pdf & My PDF: SARO_Sample_v1.7.pdf
Issue:
// This line works and returns RootElement
PdfDictionary structTreeRoot = catalog.getAsDict(PdfName.STRUCTTREEROOT);
// --> This line always returns NULL,
// Instead of returning the child elements of RootElement
PdfArray kids = structTreeRoot.getAsArray(PdfName.K);
// --> As per the structure Kids are present
Compared the structure of both PDFs and the following are my observations:
Tagging Structure - exactly same in both PDFs Tagging Structure
Content Structure - almost same, but a few additions are available in the PDF created by me. Content Structure
Tag Tree Structure - almost same respective to Tags, but with a major difference: iText's PDF tags are marked with /T:StructElem whereas that's not found in MY-PDF Even re-tagging doesn't help. Tag Tree Structure
Verified with various tagged PDFs available with us and all are similar (without /T:StructElem). These PDFs are validated and have passed accessibility compliance.
Need some thoughts on how to make this source code work with the PDFs we have. Alternatively, I need a way to ADD the missing /T:StructElem automatically in the PDFs while tagging in Acrobat.
Any help will be much appreciated!
Please do let me know if any further information is needed.
Note: I'm still not sure adding this /T:StructElem will work, since the PDFs were passed in PAC.
If this is really an issue, then those PDFs wont be passed the validations, right? But this is the only difference I found between those two PDFs.
PS: The Acrobat version I'm using is "Adobe Acrobat (Pro) DC."
-- Thanks,SaRaVaNaN

Bruno's code in the referenced answer does not walk the whole structure tree because he did not implement all cases of the K contents. The structure element K entry is specified like this:
The children of this structure element. The value of this entry may be one of the following objects or an array consisting of one or more of the following objects in any combination: [...]
(ISO 32000-2, Table 355 — Entries in a structure element dictionary)
Bruno's code, though, always assumes the value to be an array:
PdfArray kids = element.getAsArray(PdfName.K);
(Most likely he implemented that code with just the structure tree of the PDF in question there in mind.)
Thus, replace
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) return;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
by something like
PdfObject kid = element.getDirectObject(PdfName.K);
if (kid instanceof PdfDictionary) {
manipulate((PdfDictionary)kid);
} else if (kid instanceof PdfArray) {
PdfArray kids = (PdfArray)kid;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
}
As you did not share an example document, I could not test the code. If there are problems, please share an example PDF.

pdfOutline of PDFKit in Objective-C

i try badly to extend (im a beginner with Objective-C) a pdf-viewer with PDF-included Outlines. The viewer is based on Apples PDFKit. (https://developer.apple.com/documentation/pdfkit/pdfoutline)
Thats what i have done so far:
PDFPage *page = [_pdfDocument pageAtIndex:_pdfDocument.pageCount-1];
PDFOutline *pdfOutline = [_pdfDocument outlineRoot];
NSLog(#"LOG of pdfOutline");
NSLog(#"%#", pdfOutline);
NSLog(#"%i", pdfOutline.numberOfChildren);
Thats gives me the following Output:
[3685:9776989] LOG of pdfOutline
[3685:9776989] <PDFOutline: 0x60c000203370>
[3685:9776989] 4
So far so good, but I need somehow the labels and the page numbers in an jsonObject (its necessary cause of using it later in a react-native callback). Im even not sure what the output of "pdfOutline" is.
I really have no idea how to start. The goal is clear, generate an json-object from the outlines.

That's just given you a pointer to the object. You need to use the pdfOutline.label method to get the text of the outline's label.
Outlines don't contain a page number, but a Destination object, which you can read using the .destination method; or an Action object. A Destination is a page number, page co-ordinates, and optional zoom level. An Action may be "Go to page", or a URL, or other.
Don't forget that page numbers in PDFKit start at 0, not 1. !!

Detecting Headers and Borders in PDF Tables using PDF Clown

I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table?
private void Extract(ContentScanner level, PrimitiveComposer composer)
{
if(level == null)
return;
while(level.MoveNext())
{
ContentObject content = level.Current;
}
}

I am using PDF Clown's TextInfoExtractionSample...
In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders.
while(level.MoveNext())
{
ContentObject content = level.Current;
}
A) Visit all content
In your loop code you removed very important blocks from the original example,
if(content is XObject)
{
// Scan the external level!
Extract(((XObject)content).GetScanner(level), composer);
}
and
if(content is ContainerObject)
{
// Scan the inner level!
Extract(level.ChildLevel, composer);
}
These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content.
B) Inspect all content
Anyone know what object represents borders in PDF table
Unfortunately there is nothing like a border attribute in PDF content. Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles.
Thus, while scanning the page content (recursively, as indicated in A) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects) containing
moveTo m, lineTo l, and stroke S operations or
rectangle re and fill f operations.
(This answer may help)
When you come across such lines, you will have to interpret them. These lines may be borders, but they may also be used as underlines, page decorations, ...
If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. Unfortunately still less PDFs are tagged than not.
OR how to detect if a text is a header of the table?
Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. You have to interpret again. Is that text outside of lines you determined to form a table? Is it inside at the top? Or just anywhere inside? Is it drawn in a specific font? Or larger? Different color? Etc.

iText throws ClassCastException: PdfNumber cannot be cast to PdfLiteral

I am using iText v5.5.1 to read PDF and render paint text from it:
pdfReader = new PdfReader(new CloseShieldInputStream(is));
pdfParser = new PdfReaderContentParser(pdfReader);
int maxPageNumber = pdfReader.getNumberOfPages();
int pageNumber = 1;
StringBuilder sb = new StringBuilder();
SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();
while (pageNumber <= maxPageNumber) {
pdfParser.processContent(pageNumber, extractionStrategy);
sb.append(extractionStrategy.getText());
pageNumber++;
}
On one PDF file the following exception is thrown:
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNumber cannot be cast to com.itextpdf.text.pdf.PdfLiteral
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:382)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:80)
That PDF file seems to be broken, but maybe its contents still makes sense...

Indeed
That PDF file seems to be broken
The content streams of all pages look like this:
/GS1 gs
q
595.00 0 0
It looks like they all are cut off early as the last line is not a complete operation. This certainly can make a parser hickup as iText does.
Furthermore the content should be longer because even the size of their compressed stream is a bit larger than the length of this. This indicates streams broken on the byte level.
Looking at the bytes of the PDF file one cannot help but notice that
even inside binary streams the codes 13 and 10 only occur together and
cross-reference offset values are less than the actual positions.
So I assume that this PDF has been transmitted using a transport method handling it as textual data, especially replacing any kind of assumed line break (CR or LF or CR LF) with the CR LF now omnipresent in the file (CR = Carriage Return = 13; LF = Line Feed = 10). Such replacements will automatically break any compressed data stream like the content streams in your file.
Unfortunately, though...
but maybe its contents still makes sense
Not much. There is one big image associated to each page respectively. Considering the small size of the content streams and the large image size I would assume that the PDF only contains scanned pages. But the images also are broken due to the replacements mentioned above.

This isn't the best solution, but I had this exact problem and unfortunately can't share the exact PDFs I was having issues with.
I made a fork of itextpdf that catches the ClassCastException and just skips PdfObjects that it takes issue with. It prints to System.out what the text contained and what type itextpdf thinks it was. I haven't been able to map this out to some systemic problem with my PDFs (someone smarter than me will need to do that), and this exception only happens once in a blue moon. Anyway, in case it helps anyone, this fork at least doesn't crash your code, lets you parse the majority of your PDFs, and gives you a bit of info on what types of bytestrings seem to give itextpdf indigestion.
https://github.com/njhwang/itextpdf

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas