XFA missing filled fields? - pdf

I am using pdfbox-1.8.12 to read content from PDF to get XFA.
I have been able to get XFA for most of the files successfully without missing out on any field values.
The trouble is with some files like error.pdf. I have many of the fields having no values like CIN, but when I open the file in any PDF Viewer, foxit or Acrobat it shows that field.
public static byte[] getParsableXFAForm(File file) {
if (file == null)
return null;
PDDocument doc;
PDDocumentCatalog catalog;
PDAcroForm acroForm;
PDXFA xfa;
try {
doc = PDDocument.load(file);
catalog = doc.getDocumentCatalog();
acroForm = catalog.getAcroForm();
xfa = acroForm.getXFA();
byte[] xfaBytes = xfa.getBytes();
doc.close();
return xfaBytes;
} catch (IOException e) {
// handle IOException
// happens when the file is corrupt.
System.out.println("IOException");
return null;
}
}
Then the byte[] is converted to String.
This is the xfa for this file and if you search in this for 'U72300DL1996PLC075672', it would be missing.
This is a normal file, that gives all fields.
Any Ideas? I have tried everything, but my guess is that since readers can see that value, I should be able to as well.
EDIT :
You will have to download the files, you might not be able to view them in the browser.

There are multiple entries of XFA content within the form representing the different states the form had prior to applying the different signatures. As you are using
PDDocument.load(file)
the PDF is parsed sequentially and the most current XFA content is not picked up. If you change that to
PDDocument.loadNonSeq(file,null)
the Xref information is used and the most current XFA is extracted containing the information you are looking for.
Note that for PDFBox 1.8.x one should always use PDDocument.loadNonSeq in order to parse the PDF in line with the specification i.e. by following the Xref information. PDDocument.load should only be used to handle files with (Xref related) parsing errors where a sequential parsing can be a fall back.
For PDFBox 2.x PDDocument.load parses following the Xref i.e. like `PDDocument.loadNonSeq' in 1.8 and sequential parsing is done behind the scenes in case there are errors.

Related

How to move XFA xml data into PDF/A-2 conforming File with iText/XFA Worker

In the Adobe's ISO 32000 spec for PDF/A it states that XFA data can be stored in a special place in the PDF/A-2 confirming PDF. Here is the text of that section.
Incorporation of XFA Datasets into a PDF/A-2 Conforming File
To support PDF/A-2 conforming files, ExtensionLevel 3 adds support for XML form data (XFA datasets)
through the XFAResources name tree, which is part of the name dictionary of the document catalog.
(See “TABLE 3.28 Entries in the name dictionary” on page 23.) While Acrobat forms (and form data) are
permitted in a PDF/A-2 conforming file, XML forms are not. Such XML forms are specified as XDP streams
referenced from interactive form dictionaries. XDP streams can contain XFA datasets.
For applications that convert PDF documents to PDF/A-2, the XFAResources name tree supports
relocation of XML form data from XDP streams in a PDF document into the XFAResources name tree.
The XFAResources name tree consists of a string name and an indirect reference to a stream. The string
name is created at the time the document is converted to a PDF/A-2 conforming file. The stream contains
the element of the XFA, comprised of elements.
In addition to data values for XML form fields, the elements enable the storage and retrieval
of other types of information that may be useful for other workflows, including data that is not bound to
form fields, and one or more XML signature(s).
See the XML Architecture, XML Forms Architecture (XFA) Specification, version 2.6 in the Bibliography
We have an XFA Form that we pass xml to and now need to convert that document to PDF/A-2.
We are currently testing out XFA Worker to see if that will allow us to do this, I have been unable to find a sample of XFA Worker that will do this for us.
I first tried to flatten with XFA Worker but that removes the data completely and is no longer able to be extracted.
How do you get the XFA xml data into the place that Adobe says to put it in with XFA Worker?
UPDATE: Thanks Bruno, my code isn't allowing me to convert the XFA Form to PDF/A-2. Here is the code I used.
xfa.fillXfaForm(new ByteArrayInputStream(xmlSchemaStream.toByteArray()));
stamper.close();
reader.close();
try (ByteArrayOutputStream outputStreamDest = new ByteArrayOutputStream()) {
PdfReader pdfAReader = new PdfReader(output.toByteArray());
PdfAStamper pdfAStamper = new PdfAStamper(pdfAReader, outputStreamDest, PdfAConformanceLevel.PDF_A_2A);
....
and I get an error com.itextpdf.text.pdf.PdfAConformanceException: Only PDF/A documents can be opened in PdfAStamper.
So I am now assuming the new PdfAStamper isn't a converter but just reading in the byte array of the XFA PDF.
Allow me to start with some fatherly advice. XFA will be deprecated in ISO-32000-2 (PDF 2.0) and it is great that you are turning your XFA documents into PDF/A documents. However, why would you choose for PDF/A-2? PDF/A-3 is identical to PDF/A-2 with one exception: in PDF/A-3, you are allowed to embed XML files. You can even indicate the relationship between the attached XML and the PDF. Wouldn't it be smarter to create a PDF/A-3 file and to attach the original data (not the XFA file) as an attachment?
Suppose that you'd ignore this fatherly advice, what could you do?
Annex D of ISO-19005-2 (and -3) tells you that you have to add an entry to the Names dictionary of the document catalog. Unfortunately, iText 5 doesn't allow you to add your own entries to this names dictionary while creating a file, so you will have to post-process the document.
Suppose that you have a file located in filePath, then you can get the Catalog entry and the Names entry of the Catalog entry like this:
PdfReader reader = new PdfReader(filePath);
PdfDictionary catalog = reader.getCatalog();
PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
You can add entries to this names dictionary. For instance: suppose that I want to add a stream with content some bytes as a custom entry, I would use this code:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
PdfDictionary catalog = reader.getCatalog();
PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
if (names == null) {
names = new PdfDictionary();
}
PdfStream stream = new PdfStream("Some bytes".getBytes());
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
names.put(new PdfName("ITXT_Custom"), objref.getIndirectReference());
catalog.put(PdfName.NAMES, names);
stamper.close();
reader.close();
}
The result would look like this:
In your case, you don't want to entry named ITXT_Custom. You want to add an entry called XFAResources and the value of that entry should be a name tree consisting of a string name and an indirect reference to a stream. It should be fairly easy to adapt my example to achieve this.
Note: All code provided by me on Stack Overflow can be used under the CC-BY-SA as defined in the Stack Exchange Network Terms of Service. If you do not like the CC-BY-SA, I also provide this code under the same license as used for iText, more specifically the AGPL.

Renaming a pdf file (scanned document) using OCR. It should read 3 zones and rename accordingly. e.g. Streetname_LastName_Date.pdf

I am running into trouble with tons of paperwork. I want to have it digitalized in order to simplify search and therefore cut down a huge amount of time spent on looking though the paperwork.
It is rather simple, I want to scan documents, which share the same layout and rename it according to 3 areas within the document. In my case its a reference number, a Last name and the date listed on the document. It would be even better if it could move the files to folders named after an area in the document.
Here is an image, basically this but with hundreds of pdfs in batch.
http://i.imgur.com/8vwwyEb.png
I couldn't find any solution for days and yet the technology is there. Have you ever gotten across a problem like this and found a solution? I would really appreciate your help.
The closes thing I have found is a program called FileCenter, but you need to click a button for each scan. Using ocr on existing files require you to go through a 3 click menu for each file. I wonder if there is an easy batch program, where you just select the rectangles and it does the renaming part.
I will edit this OP if any solution can be found, for anyone googling.
You may do this with the commercial component ByteScout PDF Extractor SDK designed specifically for this purpose. It may extract text from a given region by coordinates with optional OCR (that also works in the selected extraction region) in a batch. Coordinates of the region to extract text from can be measured in the base document with PDF Multitool free utility (asssuming all your PDF files are using the same layout).
You may extract text from given regions in C# like using OCR (English language):
using System;
using System.IO;
using System.Text;
using Bytescout.PDFExtractor;
using System.Drawing;
using System.Diagnostics;
namespace Example
{
class Program
{
static void Main(string[] args)
{
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
// enable OCR auto mode, will use English by default
extractor.OCRMode = OCRMode.Auto;
string sourceFile = "sample.pdf";
// Load source PDF file
extractor.LoadDocumentFromFile(sourceFile);
// extract from given area (measured from base typical file using PDF Multitool utility), assuming we have a reference string there
extractor.SetExtractionArea(Rectangle.FromLTRB(10, 10, 100,100));
string extractedReference = extractor.GetTextFromPage(0).Trim();
extractor = null; // dispose the extractor and release the original file
// Copy the original file into the file with filename based on the original reference so it will be like "1234-sample.pdf"
string outputFile = extractedReference + "-" + sourceFile;
File.Copy(sourceFile, outputFile);
Console.WriteLine();
Console.WriteLine(inputFile + " has been copied to " + outputFile);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
Disclosure: I'm connected with ByteScout

Add a cover page to a PDF document

I create a PDF document with EVO PDF library from a HTML page using the code below:
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
byte[] outPdfBuffer = htmlToPdfConverter.ConvertUrl(url);
Response.AddHeader("Content-Type", "application/pdf");
Response.AddHeader("Content-Disposition", String.Format("attachment; filename=Merge_HTML_with_Existing_PDF.pdf; size={0}", outPdfBuffer.Length.ToString()));
Response.BinaryWrite(outPdfBuffer);
Response.End();
This produces a PDF document but I have another PDF document that I would like to use as cover page in the final PDF document.
One possiblity I was thinking about was to create the PDF document and then to merge my cover page PDF with the PDF produced by converter but this looks like an inefficient solution. Saving the PDF and loading back for merge seems to introduce a unnecessary overhead. I would like to merge the cover page while the PDF document produced by converter is still in memory.
The following line added in your code right after you create the HTML to PDF converter object should do the trick:
// Set the PDF file to be inserted before conversion result
htmlToPdfConverter.PdfDocumentOptions.AddStartDocument("CoverPage.pdf");

PDF acroform fields become non editable in Adobe reader after writing to it using Pdfbox APIs

I am reading a PDF which has editable fields and the fields can be edited by opening it through Adobe Reader. I am using PDFBox APIs to generate an output PDF with data filled for the editable fields in input PDF. The output PDF can be opened using Adobe Reader and I am able to see the field values but I am unable to edit those fields directly from Adobe reader.
There is also a JIRA ticket for this issue and it is unresolved according to this link :
https://issues.apache.org/jira/browse/PDFBOX-1121
Can anybody please tell me if this got resolved? Also, if possible please answer the following questions related to my question:
Is there any protection policy or access permission that I need to explicitly set in order to edit the output PDF from Adobe reader?
Every time I open the PDF that was written to using pdfbox APIs, I get this message prompt:
" The document has been changed since it was created and use of extended features is no longer available...."
I am using PdfBox 1.8.6 jar and Adobe reader 11.0.8. I would really appreciate if anybody could help me with this issue.
Code snippet added to aid responders in debugging :
String outputFileNameWithPath = "C:\myfolder\testop.pdf";
PDDocument pdf = null;
pdf = PDDocument.load( outputFileNameWithPath );
PDDocumentCatal og docCatalog = pdf.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
//The map pdfValues is a collection of the data that I need to set in the PDF
//I am unable to go into the details of my data soutce
// The key in the data map corresponds to the PDField's expanded name and data
// corresponds to the data that I am trying to set.
Iterator<Entry<String, String>> iter=pdfValues.entrySet().iterator();
String name=null;
String value=null;
PDField field=null;
//Iterate over all data and see if the PDF has a matching field.
while(iter.hasNext()) {
Map.Entry<String, String> currentEntry=iter.next();
name=currentEntry.getKey();
value=currentEntry.getValue();
if(name!=null) {
name=CommonUtils.fromSchemaNameToPdfName(name);
field=acroForm.getField(name);
}
if( field != null && value!=null )
{
field.setValue( value ); //setting the values once field is found.
}
}
// Set access permissions / encryption here before saving
pdf.save(outputFileNameWithPath);
Thanks.
The document has been changed since it was created and use of extended features is no longer available....
This indicates that the original form has been Reader-enabled, i.e. an integrated Usage-Rights digital signature has been applied to the document using a private key held by Adobe which tells the Adobe Reader that it shall make some extra functionality available to the user viewing that form.
If you don't want to break that signature during form fill-ins with PDFBox, you need to make sure that you
don't do any changes but form fill-ins and
save the changes as incremental update.
If you provided your form fill-in code and your source PDF, this could be analyzed in more detail.

How to count specific words in locked pdfs

How can I count specific words within a pdf file that is locked.
I am talking about annual reports here. You can search within, but you cant copy out of it (for whatever reason, doesnt make sense).
After googling forever, I still havent found a solution.
If your file contains text (and not just scanned images) and used fonts contains information about mapping from glyphs to characters then you should be able to extract text from the file using any PDF library that provides text extraction capabilities.
Copying of text is usually forbidden by setting usage rights. Many PDF libraries ignore these settings and allow text extraction from locked PDFs.
Depending on the library, you might try extracting whole text and splitting it into words yourselves or extracting text as collection of words (if library can split text into words for you).
Here is a sample code for Docotic.Pdf library that shows how to build dictionary that contains information about words found in a PDF document and how many times they are used.
public static Dictionary<string, int> countWords(string file)
{
Dictionary<string, int> wordCounts = new Dictionary<string, int>();
using (PdfDocument pdf = new PdfDocument(file))
{
foreach (PdfPage page in pdf.Pages)
{
PdfCollection<PdfTextData> words = page.GetWords();
foreach (PdfTextData word in words)
{
int count = 0;
wordCounts.TryGetValue(word.Text, out count);
wordCounts[word.Text] = count++;
}
}
}
return wordCounts;
}
Disclaimer: I work for the vendor of Docotic.Pdf.