iTextSharp PdfCopy makes read-only fields editable - pdf

I'm working on some code that concatenates PDF files using iTextSharp. I'm having a problem with a particular PDF that contains some read-only fields and a field that is editable (I believe they're AcroFields). In the output file all of the fields are editable.
Here is the code that I use (I've simplified it to read only one PDF):
public static void Concat(string outputFilePath, string inputFilePath)
{
using (var document = new Document())
{
using (var fileStream = new FileStream(outputFilePath, FileMode.Create, FileAccess.ReadWrite))
using (var copier = new PdfCopy(document, fileStream))
{
copier.SetMergeFields();
document.Open();
var reader = new PdfReader(inputFilePath);
copier.AddDocument(reader);
copier.AddJavaScript(reader.JavaScript);
copier.Close();
}
document.Close();
}
}
Any ideas on how to preserve the attributes of the fields?

It looks like iText and Adobe Reader interpret the form field structure differently. E.g. look at this parent field with one child:
(Object 24 is referenced from the AcroForm dictionary Fields array. Object 130 is referenced from the Page dictionary ANNOTS array.)
So we have two field objects named PageDataCollection1[0].txtCity, the objects 24 and 130, the widget annotation being merged into the latter.
iText considers the terminal field object (object 130) to be completely in charge, using its Ff value 0 which among other things means not read-only.
Adobe Reader, on the other hand, considers the terminal field object (object 130) only to be partially in charge, using its DA value but not its Ff value. Instead the parent Ff value 1 is used which among other things means read-only.
In the course of copying the document pages, the hierarchies are flattened making the different interpretation visible.
Ad hoc I would say the behavior of iText is correct here.
The behavior of Adobe Reader might be justified with this section from the specification ISO 32000-1:
It is possible for different field dictionaries to have the same fully qualified field name if they are descendants of a common ancestor with that name and have no partial field names (T entries) of their own. Such field dictionaries are different representations of the same underlying field; they should differ only in properties that specify their visual appearance. In particular, field dictionaries with the same fully qualified field name shall have the same field type (FT), value (V), and default value (DV).
(section 12.7.3.2 Field Names)
Maybe Adobe Reader tries to enforce that different representations of the same field only differ in properties that specify their visual appearance, by ignoring other properties in descendant fields without partial field names.
As there are no different representations of the field, though, there is no need for this measure here.
There is an alternative interpretation of the object structure here, #rhens proposed
There aren't 2 fields with the same name: object 24 is the field dictionary, object 130 is the widget annotation.
IMO this interpretation does not match the PDF specification even though it would explain the behavior of Adobe Reader.
While indeed the Kids array of a form field may contain either child fields or widgets, the object 130 in my opinion has to be considered a field (which has its own widget merged into itself) rather than a widget of field object 24.
To check whether some kid dictionary object is a child field or merely a widget, it does not suffice to find widget-specific entries in the kid: such entries can also be in a child field which has its single widget merged into itself. Thus, one instead has to check for field-specific entries in the kid.
In the case at hand the kid object 130 does have field-specific entries (foremost the field type FT but also the field flags Ff) and, therefore, should be considered a child field.
That all been said, it indeed is possible that Adobe does consider that object a mere widget (which, as mentioned above, would explain the behavior). This interpretation would not be inspired by the specification, though, as explained above. But it might be inspired by a non-negligible amount of documents from the wild which erroneously have additional field-specific entries in their plain widgets and require this interpretation to be displayed as designed.

Related

PDF form fields: Separate/Extract widget dictionary from field dictionary

According to the PDF spec it is possible to merge the widget dictionary and the field dictionary when there is only one associated widget annotation. Is there some support by iText / openPDF to separate the two again? (Low level API would suffice).
Update: Ok so there seems to be no convienient method for it. But what about the following entries which does exist in both dictionaries:
AA (additional actions) are defined in (widget) annotation dictionary and in the field dictionary - so when separating where to put it?
Parent - both field and annotation define a parent - so when separating where to put it?

Why is PDF form information stored on both 'Root.AcroForm.Fields' & 'Root.Pages.Kids[0].Annots'

If I update the value of a form in either of these locations, both are affected. Why are they stored twice?
When updating these forms, is one preferred to be used over the other one?
(I'm using Python library pdfrw)
'/Root':{
'/AcroForm': {'/Fields': [(10, 0), (11, 0)] },
'/Pages': { '/Kids': [ {'/Annots': [(10, 0), (11, 0)] }] }
}
EDIT
The AcroForm dictionary references all abstract form fields (directly or indirectly) to allow immediate access to all fields of a document.
Each abstract form field may have any number of widget annotations (except signature fields with at most one annotation).
Widget annotations are for displaying the form field contents. Thus, they must be attached to the page they respectively are displayed upon. So they are referenced from the Annots of the respective page.
If a form field has no widget annotation, you cannot find it from any page.
If a form field has exactly one widget annotation, you can usually find it from exactly one page, the page that annotation is on. In this case the form field object and the widget annotation object may be merged into a single object.
If a form field has more widget annotations, you can usually find it on one or more pages, depending on whether all those annotations are on the same or one different pages.
Thus,
Why are they stored twice?
They are not stored twice, each form field is stored only once, in one PDF object. But that form field object can usually be reached from multiple locations in the object model, from the global AcroForm object and from the Annots of each page that form field has a widget on.

Can indirect object inside PDF be nested?

I'm trying to understand whether indirect objects (declared with the obj/endobj keywords) can reside inside e.g. array, dictionary entries or other indirect objects.
For example
[ 3 0 0 obj (something) end ] would parse an array of [3, <indirect object>] if this was allowed.
From what I can see all indirect objects are always at the top level in a PDF, and the fact that object streams exist suggests me that this can't be possible, but I can't find a definite answer in the ISO standard.
EDIT:
It turns out that the ISO standard was not that clear, but the latest spec from Adobe is a bit clearer:
Note: In the data structures that make up a PDF document, certain
values are required to be specified as indirect object references. Except where this is explicitly called out, any object (other than a stream) may
be specified either directly or as an indirect object reference;
the semantics are entirely equivalent
Even if above it says
Any object in a PDF file may be labeled as an indirect object.
So I'm still not 100% sure.

What is the proper way to encode an AMF0 StrictArray

After overviewing the AMF0 specification I find that I cannot understand the proper way to encode the StrictArray type.
Here is the most immediate section of the specification:
array-count = U32
strict-array-type = array-count *(value-type)
which describes the StrictArray type with Augmented Backus-Naur Form (ABNF) syntax (See RFC2234)
Does the StrictArray type have ordinal indices or simply encoded objects (without ordinal keys) in order of their appearance in the StrictArray object graph?
Also, as an additional question, does the serialization table (from which object reference IDs are generated) contain all objects in the object graph, or only objects which can be potentially encoded via reference (ECMAArray,StrictArray,TypedObject,AnonymousObject)?
See https://github.com/silexlabs/amfphp-2.0/blob/master/Amfphp/Core/Amf/Serializer.php line 329 to 336.
you write the number of objects, then each object.
additional question: same code, look for Amf0StoredObjects.
references ids are only for referencable objects. These vary for AMF0 and AMF3 though.

How to use GtkTreeView correctly

I am using a TreeView with a ListStore as model. When the user clicks on a row I want to take some action but not using the values in the cells, but using the data I created the row from...
Currently I have the TreeView, the TreeModel (ListStore) and my own data (which I ironically call model)..
So the Questions are:
Is it "right" to have a model - an object representation of the data I want to display and fill a ListStore with that data to display in a TreeView, or would it be better to implement an own version of TreeModel (wrapping my data-model) to display the data?
And also:
If someone double-clicks in a row I can get the RowActivated event (using C#/Gtk#) which provides a Path to the activated row. With that I can get a TreeIter and using that I can get the value of a cell. But what is the best practice to find the data object from which the row was constructed in the first place?\
(Somehow this question got me to the first one - by thinking would getting the data object more easy if I tried to implement my own TreeModel...)
It's quite awkward/difficult to implement TreeModel, so most people simply synch the data from their "real" model into a TreeStore or ListStore.
The columns in the store do not have to match the columns in the view in any way. For example, you can have a column that contains your real managed data objects.
When you add a cellrenderer to a TreeView (visual) column, you can add mappings between its properties and the columns of the store. For example, you could map one store column to the font of a text cellrenderer, and another store column to the text property of the same cellrenderer. Each time the cellrenderer is used to render a particular cell, the mappings will be used to retrieve the values from the store and apply them to the properties of the renderer before it renders.
Here's an example of a mapping:
treeView.AppendColumn ("Title", renderer, "text", 0, "editable", 4);
This maps store column 0 to the renderer's text GTK property and maps store column 4 to the editable property. For GTK property names you can check the GTK docs. Note that the example above uses a convenience method that adds a column, adds a renderer to it and add an arbitrary number of mapping via params. To add mappings directly to a column, for example a column with multiple renderers, pack the renderers into the column then use TreeViewColumn.AddAttribute or TreeViewColumn.SetAttributes.
You can also set up a custom data function that will be used instead of mappings. This allows you to set the properties of the renderer directly, given a TreeIter and the store - so, if all the data you want to display is trivially derived from your real data objects, you could even have your store only contain a single column of these objects, and use data funcs for all the view columns.
Here's an example of a data func that does exactly what the mapping example above does:
treeColumn.SetCellDataFunc (renderer, delegate (TreeViewColumn col,
CellRenderer cell, TreeModel model, TreeIter iter)
{
var textCell = (CellRendererText) cell;
textCell.Text = (string) model.GetValue (iter, 0);
textCell.Editable = (bool) model.GetValue (iter, 4);
});
Obviously data functions are much more powerful because they enable you not only to use properties of more complex GTK objects, but also to implement more complex display logic - for example, lazily processing derived values only when the cell is actually rendered.