Apache PDFBox and PDF/A-3 - pdf

Is it possible to use Apache PDFBox to process PDF/A-3 documents? (Especially for changing field values?)
The PDFBox 1.8 Cookbook says that it is possible to create PDF/A-1 documents with pdfaid.setPart(1);
Can I apply pdfaid.setPart(3) for a PDF/A-3 document?
If not: Is it possible to read in a PDF/A-3 document, change some field values and safe it by what I have not need for >creation/conversion to PDF/A-3< but the document is still PDF/A-3?

How to create a PDF/A {2,3} - {B, U, A) valid: In this example I convert the PDF to Image, then I create a valid PDF / Ax-y with the image. PDFBOX2.0x
public static void main(String[] args) throws IOException, TransformerException
{
String resultFile = "result/PDFA-x.PDF";
FileInputStream in = new FileInputStream("src/PDFOrigin.PDF");
PDDocument doc = new PDDocument();
try
{
PDPage page = new PDPage();
doc.addPage(page);
doc.setVersion(1.7f);
/*
// A PDF/A file needs to have the font embedded if the font is used for text rendering
// in rendering modes other than text rendering mode 3.
//
// This requirement includes the PDF standard fonts, so don't use their static PDFType1Font classes such as
// PDFType1Font.HELVETICA.
//
// As there are many different font licenses it is up to the developer to check if the license terms for the
// font loaded allows embedding in the PDF.
String fontfile = "/org/apache/pdfbox/resources/ttf/ArialMT.ttf";
PDFont font = PDType0Font.load(doc, new File(fontfile));
if (!font.isEmbedded())
{
throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
+ " text rendering in rendering modes other than rendering mode 3 are embedded.");
}
*/
PDPageContentStream contents = new PDPageContentStream(doc, page);
try
{
PDDocument docSource = PDDocument.load(in);
PDFRenderer pdfRenderer = new PDFRenderer(docSource);
int numPage = 0;
BufferedImage imagePage = pdfRenderer.renderImageWithDPI(numPage, 200);
PDImageXObject pdfXOImage = LosslessFactory.createFromImage(doc, imagePage);
contents.drawImage(pdfXOImage, 0,0, page.getMediaBox().getWidth(), page.getMediaBox().getHeight());
contents.close();
}catch (Exception e) {
// TODO: handle exception
}
// add XMP metadata
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
PDDocumentCatalog catalogue = doc.getDocumentCatalog();
Calendar cal = Calendar.getInstance();
try
{
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
// dc.setTitle(file);
dc.addCreator("My APPLICATION Creator");
dc.addDate(cal);
PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
id.setPart(3); //value => 2|3
id.setConformance("A"); // value => A|B|U
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
catalogue.setMetadata(metadata);
}
catch(BadFieldValueException e)
{
throw new IllegalArgumentException(e);
}
// sRGB output intent
InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
"../../../pdmodel/sRGB.icc");
PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
intent.setInfo("sRGB IEC61966-2.1");
intent.setOutputCondition("sRGB IEC61966-2.1");
intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
intent.setRegistryName("http://www.color.org");
catalogue.addOutputIntent(intent);
catalogue.setLanguage("en-US");
PDViewerPreferences pdViewer =new PDViewerPreferences(page.getCOSObject());
pdViewer.setDisplayDocTitle(true);;
catalogue.setViewerPreferences(pdViewer);
PDMarkInfo mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject());
PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
catalogue.setMarkInfo(mark);
catalogue.setStructureTreeRoot(treeRoot);
catalogue.getMarkInfo().setMarked(true);
PDDocumentInformation info = doc.getDocumentInformation();
info.setCreationDate(cal);
info.setModificationDate(cal);
info.setAuthor("My APPLICATION Author");
info.setProducer("My APPLICATION Producer");;
info.setCreator("My APPLICATION Creator");
info.setTitle("PDF title");
info.setSubject("PDF to PDF/A{2,3}-{A,U,B}");
doc.save(resultFile);
}catch (Exception e) {
throw new IllegalArgumentException(e);
}
}

PDFBox supports that but please be aware that due to the fact that PDFBox is a low level library you have to ensure the conformance yourself i.e. there is no 'Save as PDF/A-3'. You might want to take a look at http://www.mustangproject.org which uses PDFBox to support ZUGFeRD (electronic invoicing) which also needs PDF/A-3.

Related

Adding images to a DOCX

I have a template DOCX file that I am working with. The template file contains two placeholders for images (a logo and a barcode image). How can I replace these images using BufferedImage or just getting an image from a URL? There seem to not be many resources on this.
I finally got it to work using bookmarks. Apparently I didn't dig deeper before posting the question. The code is below. Although I did not find the methods to control the width and height of the image, which is important, the code below does answer my question.
public void addLogoAndBarCode(WordprocessingMLPackage pack, String agencyID)
{
MainDocumentPart documentPart = pack.getMainDocumentPart();
Document wmlDoc = (Document) documentPart.getJaxbElement();
Body body = wmlDoc.getBody();
List<Object> paragraphs = body.getContent();
RangeFinder rt = new RangeFinder("CTBookmark", "CTMarkupRange");
new TraversalUtil(paragraphs, rt);
for(CTBookmark bm:rt.getStarts())
{
if(bm.getName().equals("agencyLogo"))
{
logger.info("i found bookmark");
try
{
InputStream is = new FileInputStream(agencyLogoPath+agencyID+".jpg");
byte[] bytes = IOUtils.toByteArray(is);
BinaryPartAbstractImage imagePart = BinaryPartAbstractImage.createImagePart(pack, bytes);
Inline inline = imagePart.createImageInline(null, null, 0,1, false, 800);
P p = (P)(bm.getParent());
ObjectFactory factory = new ObjectFactory();
R run = factory.createR();
Drawing drawing = factory.createDrawing();
drawing.getAnchorOrInline().add(inline);
run.getContent().add(drawing);
p.getContent().add(run);
}
catch(Exception er)
{
er.printStackTrace();
}
}
}
}

Trying to replace graphics resources in a PDF - PDFBox 2.0.8

I'm trying to manipulate image resources in some PDF files; the workflow is: extract image resources -> process each -> replace old ones with the new.
Simple task really, I have working code for extracting and replacing, but when I replace, the new file size is nearly twice the original.
To replace the images, I use PDResources.put(COSName, PDXObject). Any ideas what would cause the size increase in the resulting document? It happens even if I completely omit the middle step in the workflow to process each image resource.
public static void PDFBoxReplaceImages() throws Exception {
PDDocument document = PDDocument.load(new File("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof PDImageXObject) {
counter++;
String path = "C:\\Users\\Markus\\workspace\\pdf-test\\images\\"+counter+".png";
PDImageXObject newImg =
PDImageXObject.createFromFile(path, document);
pdResources.put(c, newImg);
}
}
}
document.save("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf");
}

PDFBox Signature Field not well recognized

I'm going in trouble using PDFBox 2.0.0-RC3 and producing a digital signature field into a PDF.
This is the piece of code i use:
public static void main(String[] args) throws IOException, URISyntaxException
{
PDDocument document;
document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
PDSignatureField signatureBox = new PDSignatureField(acroForm);
signatureBox.setPartialName("ENSGN-MY_SIGNATURE_FIELD-001");
acroForm.getFields().add(signatureBox);
PDAnnotationWidget widget = signatureBox.getWidgets().get(0);
PDRectangle rect = new PDRectangle();
rect.setLowerLeftX(50);
rect.setLowerLeftY(750);
rect.setUpperRightX(250);
rect.setUpperRightY(800);
widget.setRectangle(rect);
page.getAnnotations().add(widget);
try {
document.save("/tmp/mySignatureFieldGEN_PDFBOX.pdf");
document.close();
} catch (Exception io) {
System.out.println(io);
}
}
The code generates a pdf document, i open it with Acrobat Reader and this is the result:
PDF BOX Generated
As you can see, the signature panel on the left is void but the signature field on the left is present and works.
I generate the same PDF with PDFTron. This is the result:
PDF Tron Generated
In this case the signature panel on the left show correctly the presence of the signature field.
I would like to obtain this second case (correct) but i don't understand why PDF Box can do this.
Many thanks
add this:
widget.setPage(page);
This sets the /P entry.
Now the panel on the left appears. How did I get the idea? I got a document with such an empty signature field (from here), and compared it with yours with PDFDebugger.

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Issues with iTextsharp and pdf manipulation

I am getting a pdf-document (no password) which is generated from a third party software with javascript and a few editable fields in it. If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages if I save the stream afterwards. When I now try to open the document the Acrobat Reader shows an extended feature warning and the fields are not fillable anymore (I haven't flattened the document). Do anyone know about such a problem?
Background Info:
My job is to remove the javascript code, fill out some fields and save the document afterwards.
I am using the iTextsharp version 5.5.3.0.
Unfortunately I can't upload a sample file because there are some confidental data in it.
private byte[] GetDocumentData(string documentName)
{
var document = String.Format("{0}{1}\\{2}.pdf", _component.OutputDirectory, _component.OutputFileName.Replace(".xml", ".pdf"), documentName);
if (File.Exists(document))
{
PdfReader.unethicalreading = true;
using (var originalData = new MemoryStream(File.ReadAllBytes(document)))
{
using (var updatedData = new MemoryStream())
{
var pdfTool = new PdfInserter(originalData, updatedData) {FormFlattening = false};
pdfTool.RemoveJavascript();
pdfTool.Save();
return updatedData.ToArray();
}
}
}
return null;
}
//Old version that wasn't working
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
}
//Solution
public PdfInserter(Stream pdfInputStream, Stream pdfOutputStream, char pdfVersion = '\0', bool append = true)
{
_pdfInputStream = pdfInputStream;
_pdfOutputStream = pdfOutputStream;
_pdfReader = new PdfReader(_pdfInputStream);
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
}
public void RemoveJavascript()
{
for (int i = 0; i <= _pdfReader.XrefSize; i++)
{
PdfDictionary dictionary = _pdfReader.GetPdfObject(i) as PdfDictionary;
if (dictionary != null)
{
dictionary.Remove(PdfName.AA);
dictionary.Remove(PdfName.JS);
dictionary.Remove(PdfName.JAVASCRIPT);
}
}
}
The extended feature warning is a hint that the original PDF had been signed using a usage rights signature to "Reader-enable" it, i.e. to tell the Adobe Reader to activate some additional features when opening it, and the OP's operation on it has invalidated the signature.
Indeed, he operated using
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream);
which creates a PdfStamper which completely re-generates the document. To not invalidate the signature, though, one has to use append mode as in the OP's fixed code (for char pdfVersion = '\0', bool append = true):
_pdfStamper = new PdfStamper(_pdfReader, _pdfOutputStream, pdfVersion, append);
If I load this pdf-document with the pdfReader class the NumberOfPagesProperty is always 1 although the pdf-document has 17 pages. Oddly enough the document has 17 pages
Quite likely it is a PDF with a XFA form, i.e. the PDF is only a carrier of some XFA data from which Adobe Reader builds those 17 pages. The actual PDF in that case usually only contains one page saying something like "if you see this, your viewer does not support XFA."
For a final verdict, though, one has to inspect the PDF.