Use PDFBox to create page numbers marked as ARTIFACT for correct accessibility - pdfbox

How can I add accessible page numbers tagged as artifacts to a PDF using PDFBox?
https://www.pdfa.org/wp-content/uploads/2019/06/TaggedPDFBestPracticeGuideSyntax.pdf
Section 3.7: Artifacts The process of laying out and paginating
content for display can lead to the introduction of additional display
items (e.g. page numbers on each page or table borders). These items
are not part of what ISO 32000-1 defines as “real content”; they are
considered artifacts of layout (see
14.8.2.2, “Real Content and Artifacts” in ISO 32000-1). A requirement for tagged PDF is to clearly distinguish “real” content from
artifacts.

See question 16581471 second answer by Imal for how to add vanilla page numbers to a document using PDFBox. his code is copied below with my changes for accessibility. I added the lines
contentStream.beginMarkedContent(COSName.ARTIFACT)
and
contentStream.endMarkedContent()
A snippet of Imal's excellent code with my additions is:
PDDocument document = PDDocument.load("Input.pdf");
int page_counter = 1;
int numberOfPages = document.getNumberOfPages();
for(PDPage page : document.getPages()){
PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, false);
contentStream.beginMarkedContent(COSName.ARTIFACT);
contentStream.beginText();
contentStream.setFont(PDType1Font.TIMES_ITALIC, 10);
PDRectangle pageSize = page.getMediaBox();
float x = pageSize.getLowerLeftX();
float y = pageSize.getLowerLeftY();
contentStream.newLineAtOffset(x + pageSize.getWidth()- 100, y + 20);
String text = "Page " + page_counter + " of " + numberOfPages;
contentStream.showText(text);
contentStream.endText();
contentStream.endMarkedContent();
contentStream.close();
++page_counter;
}
document.save("Output.pdf");
As far as we can tell the page numbers are not in the Acrobat Pro accessibility tree, which seems correct. The page numbers seem to have been marked as artifacts.

Related

iText5.x Setting pushbutton appearance without breaking Seal

Here is the context:
We add two empty pages to an existing pdf, each containing an empty pushbutton field
We apply a PAdES B-B seal with all modification rights on the document
We modify a pushbutton to insert an image in it
When we try to modify the pushbutton appearance to set an image, the seal validity breaks with "unauthorized modification" no matter what we try.
Here is a code sample:
PdfReader pdfReader = new PdfReader("test.pdf");
PdfStamper pdfStamper = new PdfStamper(pdfReader, output, pdfReader.getPdfVersion(), true);
AcroFields acroFields = pdfStamper.getAcroFields();
String imageFieldId = "imageField1";
acroFields.setField(imageFieldId, Base64.encodeBytes(consentImage));
pdfStamper.close();
pdfReader.close();
We also tried with the recommanded way in documentation without success:
PushbuttonField pbField = acroFields.getNewPushbuttonFromField(imageFieldId);
pbField.setImage(Image.getInstance("image1.jpg"));
acroFields.replacePushbuttonField(imageFieldId, pbField.getField());
Problem is: i don't know if that type of modification is supported by iText or if it's our way of modifying the button which is wrong?
Update:
If the certification is replaced by a simple signature, we can set the pushbutton appearance without breaking it.
Why the certification signature is broken
You say
We apply a PAdES B-B seal with all modification rights on the document
which does not mean that all imaginable modifications of the document are allowed but instead that all allowable modifications are allowed. According to the PDF specification the choices are:
No changes to the document shall be permitted; any change to the document shall invalidate the signature.
Permitted changes shall be filling in forms, instantiating page templates, and signing; other changes shall invalidate the signature.
Permitted changes shall be the same as for 2, as well as annotation creation, deletion, and modification; other changes shall invalidate the signature.
Thus, in case of your document the allowed changes include form fill-ins and arbitrary annotation manipulation.
Unfortunately iText 5, when setting "the value" of an AcroForm push button, does not merely set the button appearance to the button but instead
PushbuttonField pb = getNewPushbuttonFromField(name);
pb.setImage(img);
replacePushbuttonField(name, pb.getField());
I.e. it essentially replaces the former push button with a similar one. This as such is not allowed.
Why a mere approval signature is not broken
The PDF specification does not restrict the changes allowed to a document signed by a mere approval signature (unless restrictions explicitly are given in a FieldMDP transform).
Adobe once claimed that they do restrict changes allowed to signed but not certified documents like those to a certified document with restriction value 3 plus "Adding signature fields", cf. this answer, but apparently they are a bit laxer in other respects, too. In particular current Adobe Reader versions only warn about "Form Fields with Property Changes" in the case at hand.
An additional complication
The PDF in question actually does not have only the AcroForm form definition, instead it has a similar XFA form definition, it is a hybrid form document. Thus, to change the image in both form definitions, one has to consider the filling of the XFA form, too.
Fortunately, the way iText 5 fills in the image into the XFA form does not make Adobe Reader assume the seal broken.
How to set the button image instead to not break the seal
To not break the seal, we have to set the button image without changing the underlying form, merely the widget. Thus, the following code attempts to only change the appearance of the button:
PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, TARGET, pdfReader.getPdfVersion(), true);
byte[] bytes = IMAGE_BYTES;
AcroFields acroFields = pdfStamper.getAcroFields();
String name = "mainform[0].subform_0[0].image_0_0[0]";
String value = Base64.getEncoder().encodeToString(bytes);
Image image = Image.getInstance(bytes);
XfaForm xfa = acroFields.getXfa();
if (xfa.isXfaPresent()) {
name = xfa.findFieldName(name, acroFields);
if (name != null) {
String shortName = XfaForm.Xml2Som.getShortName(name);
Node xn = xfa.findDatasetsNode(shortName);
if (xn == null) {
xn = xfa.getDatasetsSom().insertNode(xfa.getDatasetsNode(), shortName);
}
xfa.setNodeText(xn, value);
}
}
PdfDictionary widget = acroFields.getFieldItem(name).getWidget(0);
PdfArray boxArray = widget.getAsArray(PdfName.RECT);
Rectangle box = new Rectangle(boxArray.getAsNumber(0).floatValue(), boxArray.getAsNumber(1).floatValue(), boxArray.getAsNumber(2).floatValue(), boxArray.getAsNumber(3).floatValue());
float ratioImage = image.getWidth() / image.getHeight();
float ratioBox = box.getWidth() / box.getHeight();
boolean fillHorizontally = ratioImage > ratioBox;
float width = fillHorizontally ? 1 : ratioBox / ratioImage;
float height = fillHorizontally ? ratioImage / ratioBox : 1;
float xOffset = 0; // centered: (width - 1) / 2;
float yOffset = height - 1; // centered: (height - 1) / 2;
PdfAppearance app = PdfAppearance.createAppearance(pdfStamper.getWriter(), width, height);
app.addImage(image, 1, 0, 0, 1, xOffset, yOffset);
PdfDictionary dic = (PdfDictionary)widget.get(PdfName.AP);
if (dic == null)
dic = new PdfDictionary();
dic.put(PdfAnnotation.APPEARANCE_NORMAL, app.getIndirectReference());
widget.put(PdfName.AP, dic);
pdfStamper.markUsed(widget);
pdfStamper.close();
pdfReader.close();
(SetImageInSignedPdf test testSetInXfaAndAppearanceSampleCert)
In my tests this results in the image being visible both in viewers that support XFA forms and those that don't, and the seal not being considered broken by Adobe Reader.
Beware, though, I only developed and tested this with your sample document; chances are that some border conditions might not be considered.

How to convert existing pdf files to A4 size using pdfbox?

I want to set a size(A4) to an existing document.
I am using pdfbox for watermarking. I used the following link to add watermark. Here I am using another file in which watermark text is there. Latter we are only adding this layer as overlay to original file.
Here the problem arises when file with watermark text is with different size than original document to which the watermark is to be added. In those case the watermark is not getting added properly in terms of position.
Version: I am using pdfbox 1.8. I tried with 2.0 but I am more comfortable with this version.
Here is the code
PDDocument originalPdfFile = PDDocument.load(filename);
PDRectangle pdRect=new PDRectangle(595, 842);//Here I am setting height and width in terms of points
List PageList = originalPdfFile.getDocumentCatalog().getAllPages();
int noOfPages=PageList.size();
System.out.println("No of pages in original document="+noOfPages);
PDPage page=new PDPage();
//PDPage page=new PDPage(PDPage.PAGE_SIZE_A4);
//Here also I tried to add page size
for (int i = 0; i < PageList.size(); i++) {
page=(PDPage)PageList.get(i);
System.out.println("Original Document size in page before cropping: "+(i+1)+", Page Resolution: "+page.getMediaBox());
page.setMediaBox(pdRect);
System.out.println("Original Document size in page after cropping: "+(i+1)+", Page Resolution: "+page.getMediaBox());
//System.out.println("Original Document size in page: "+i+", Height: "+page.getMediaBox().getHeight()+",Width: "+page.getMediaBox().getWidth());
PDRectangle rec=page.getMediaBox();
generateWatermarkText(organisationName,rec);
}
HashMap<Integer, String> overlayGuide = new HashMap<Integer, String>();
for(int i=0; i<originalPdfFile.getNumberOfPages(); i++)
{
overlayGuide.put(i+1, "C:/drm/final/final.pdf");
//watermarktext.pdf is the document which is a one page PDF with your watermark image in it.
}
Overlay overlay = new Overlay();
overlay.setInputPDF(originalPdfFile);
overlay.setOutputFile(filename);
overlay.setOverlayPosition(Overlay.Position.FOREGROUND);
overlay.overlay(overlayGuide,false);
//pdf will have the original PDF with watermarks.
The above code add watermark successfully but I am not able to shrink the page.
This line
PDRectangle pdRect=new PDRectangle(595, 842);
crops the page but it cuts the contains of the page, which I don't want. I want the contains but to should be fit in that page and the page should be of specified size(like A4 in my case).

Increase left margin of an existing pdf using iTextSharp [duplicate]

My web application signs PDF documents. I would like to let users download the original PDF document (not signed) but adding an image and the signers in the left margin of the pdf document.
I've seen this idea in another web application, and I would like to do the same. Of course I would like to do it using itext library.
I have attached two images, the original PDF document (not signed) and the modified PDF document.
First this: it is important to change the document before you digitally sign it. Once digitally signed, these changes will break the signature.
I will break up the question in two parts and I'll skip the part about the actual watermarking as this is already explained here: How to watermark PDFs using text or images?
This question is not a duplicate of that question, because of the extra requirement to add an extra margin to the right.
Take a look at the primes.pdf document. This is the source file we are going to use in the AddExtraMargin example with the following result: primes_extra_margin.pdf. As you can see, a half an inch margin was added to the left of each page.
This is how it's done:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
int n = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// properties
PdfContentByte over;
PdfDictionary pageDict;
PdfArray mediabox;
float llx, lly, ury;
// loop over every page
for (int i = 1; i <= n; i++) {
pageDict = reader.getPageN(i);
mediabox = pageDict.getAsArray(PdfName.MEDIABOX);
llx = mediabox.getAsNumber(0).floatValue();
lly = mediabox.getAsNumber(1).floatValue();
ury = mediabox.getAsNumber(3).floatValue();
mediabox.set(0, new PdfNumber(llx - 36));
over = stamper.getOverContent(i);
over.saveState();
over.setColorFill(new GrayColor(0.5f));
over.rectangle(llx - 36, lly, 36, ury - llx);
over.fill();
over.restoreState();
}
stamper.close();
reader.close();
}
The PdfDictionary we get with the getPageN() method is called the page dictionary. It has plenty of information about a specific page in the PDF. We are only looking at one entry: the /MediaBox. This is only a proof of concept. If you want to write a more robust application, you should also look at the /CropBox and the /Rotate entry. Incidentally, I know that these entries don't exist in primes.pdf, so I am omitting them here.
The media box of a page is an array with four values that represent a rectangle defined by the coordinates of its lower-left and upper-right corner (usually, I refer to them as llx, lly, urx and ury).
In my code sample, I change the value of llx by subtracting 36 user units. If you compare the page size of both PDFs, you'll see that we've added half an inch.
We also use these coordinates to draw a rectangle that covers the extra half inch. Now switch to the other watermark examples to find out how to add text or other content to each page.
Update:
if you need to scale down the existing pages, please read Fix the orientation of a PDF in order to scale it

Some pdf file watermark does not show using iText

Our company using iText to stamp some watermark text (not image) on some pdf forms. I noticed 95% forms shows watermark correctly, about 5% does not. I tested, copy 2 original pdf files, one was marked ok, other one does not ok, then tested in via a small program, same result: one got marked, the other does not. I then tried the latest version of iText jar file (version 5.0.6), same thing. I checked pdf file properties, security settings etc, seems nothing shows any hint. The result file does changed size and markd "changed by iText version...." after executed program.
Here is the sample watermark code (using itext jar version 2.1.7), note topText, mainText, bottonText parameters passed in, make 3 lines of watermarks show in the pdf as watermark.
Any help appreciated !!
public class WatermarkGenerator {
private static int TEXT_TILT_ANGLE = 25;
private static Color MEDIUM_GRAY = new Color(160, 160, 160);
private static int SUPPORT_FONT_SIZE = 42;
private static int PRIMARY_FONT_SIZE = 54;
public static void addWaterMark(InputStream pdfInputStream,
OutputStream outputStream, String topText,
String mainText, String bottomText) throws Exception {
PdfReader reader = new PdfReader(pdfInputStream);
int numPages = reader.getNumberOfPages();
// Create a stamper that will copy the document to the output
// stream.
PdfStamper stamp = new PdfStamper(reader, outputStream);
int page=1;
BaseFont baseFont =
BaseFont.createFont(BaseFont.HELVETICA_BOLDOBLIQUE,
BaseFont.WINANSI, BaseFont.EMBEDDED);
float width;
float height;
while (page <= numPages) {
PdfContentByte cb = stamp.getOverContent(page);
height = reader.getPageSizeWithRotation(page).getHeight() / 2;
width = reader.getPageSizeWithRotation(page).getWidth() / 2;
cb = stamp.getUnderContent(page);
cb.saveState();
cb.setColorFill(MEDIUM_GRAY);
// Top Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, topText, width,
height+PRIMARY_FONT_SIZE+16, TEXT_TILT_ANGLE);
cb.endText();
// Primary Text
cb.beginText();
cb.setFontAndSize(baseFont, PRIMARY_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, mainText, width,
height, TEXT_TILT_ANGLE);
cb.endText();
// Bottom Text
cb.beginText();
cb.setFontAndSize(baseFont, SUPPORT_FONT_SIZE);
cb.showTextAligned(Element.ALIGN_CENTER, bottomText, width,
height-PRIMARY_FONT_SIZE-6, TEXT_TILT_ANGLE);
cb.endText();
cb.restoreState();
page++;
}
stamp.close();
}
}
We solved problem by change Adobe LifecycleSave file option. File->Save->properties->Save as, then look at Save as type, default is Acrobat 7.0.5 Dynamic PDF Form File, we changed to use 7.0.5 Static PDF Form File (actually any static one will work). File saved in static one do not have this watermark disappear problem. Thanks Mark for pointing to the right direction.
You're using the underContent rather than the overContent. Don't do that. It leaves you at the mercy of big, white-filled rectangles that some folks insist on drawing first thing. It's a hold over from less-than-good PostScript interpreters and hasn't been necessary for Many Years.
Okay, having viewed your PDF, I can see the problem is that this is an XFA-based form (from LiveCycle Designer). Acrobat can (and often does) rebuild the entire file based on the XFA (a type of xml) it contains. That's how your changes are lost. When Acrobat rebuilds the PDF from the XFA, all the existing PDF information is pitched, including your watermark.
The only way to get this to work would be to define the watermark as part of the XFA file contained in the PDF.
Detecting these forms isn't all that hard:
PdfReader reader = new PdfReader(...);
AcroFields acFields = reader.getAcroFields();
XfaForm xfaForm = acFields.getXfaForm();
if (xfaForm != null && xfaForm.isXfaPresent()) {
// Ohs nose.
throw new ItsATrapException("We can't repel XML of that magnitude!");
}
Modifying them on the other hand could be Quite Challenging, but here's the specs.
Once you've figured out what needs to be changed, it's a simple matter of XML manipulation... but that "figure it out" part could be interesting.
Good hunting.

How do I figure out the font family and the font size of the words in a pdf document?

How do I figure out the font family and the font size of the words in a pdf document? We are actually trying to generate a pdf document programmatically using iText, but we are not sure how to find out the font family and the font size of the original document which needs to be generated. document properties doesn't seem to contain this information
Fonts are stored in the catalog (I suppose in a sub-catalog of type font). If you open a pdf as a text file, you should be able to find catalog entries (they begin and end with "<<" and ">>" respectively.
On a simple pdf file, i found the following:
<</Type/Font/BaseFont/Helvetica-Bold/Subtype/Type1/Encoding/WinAnsiEncoding>>
thus searching for the prefix should help you (in some pdf files, there are spaces between
the commponents but '/Type /Font' should be ok).
Of course this is a manual process, while you would probably prefer an automatic one.
On another note, we sometime use identifont or what the font to find uncommon fonts that give us problem (logo font).
regards
Guillaume
Edit : the following code will find all font in the pages. To be short, you search the dictionnary of each page for the subdictionnary "ressource" and then the subdictionnary "font". Each entry in the later is a font dictionnary, describing a font.
PdfReader reader = new PdfReader(
new FileInputStream(new File("file.pdf")));
int nbmax = reader.getNumberOfPages();
System.out.println("nb pages " + nbmax);
for (int i = 1; i <= nbmax; i++) {
System.out.println("----------------------------------------");
System.out.println("Page " + i);
PdfDictionary dico = reader.getPageN(i);
PdfDictionary ressource = dico.getAsDict(PdfName.RESOURCES);
PdfDictionary font = ressource.getAsDict(PdfName.FONT);
// we got the page fonts
Set keys = font.getKeys();
Iterator it = keys.iterator();
while (it.hasNext()) {
PdfName name = (PdfName) it.next();
PdfDictionary fontdict = font.getAsDict(name);
PdfObject typeFont = fontdict.getDirectObject(PdfName.SUBTYPE);
PdfObject baseFont = fontdict.getDirectObject(PdfName.BASEFONT);
System.out.println(baseFont.toString());
}
}
The name (variable "name" in the following code) is what is used in the text to change font. In the PDF, you'll have to find it next to a text. The following number is the size. Here for example, it's size 12. (sorry, still no code for this part).
BT
/F13 12 Tf
288 720 Td
the text to find Tj
ET
Depending on the PDF, if it hasn't been outlined you may be able to open it in Adobe Illustrator, double click the text and select some of it to see it's font family, size, etc.
If the text is outlined then use one of those online tools that PATRY suggests to find out the font.
Good luck
If you have Adobe Acrobat you can see the fonts inside and examine the objects and text streams. I wrote a blog post on this at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects