Using PDFBox how can i extract a pdf's tab order from each field? - pdfbox

I am trying to convert a pdf to html. The PDF i have has tab order configured on several fields. Using Pdfbox how can i extract the tab order value set on each field so that i can then set the tab index within the html?
I've tried iterating over each field (PDField) by doing PDAcroForm.getFields() but that gives me the fields in random order. Then i thought that maybe i can extract tab order information from the field itself but PDField does not hold any tab order information.
Any other ideas??

As already mentioned in a comment: Beware, you iterate over the fields in the abstract form definition in the pdf. Each of these fields may have appearances, widget annotations, on any number of pages. Thus, a field does not have a tab order value, its annotations have. These order values can be derived from the order in the annotation collections of the document pages or determined by their position on the page, depending on Page settings.
The page setting in question is the Tabs entry of the respective page object:
Key
Type
Value
Tabs
name
(Optional; PDF 1.5) A name specifying the tab order that shall be used for annotations on the page. The possible values shall be R (row order), C (column order), and S (structure order). See 12.5, "Annotations" for details.
(ISO 32000-1, Table 30 – Entries in a page object)
In ISO 32000-2 (now Table 31) the following has been added to the value description:
Beginning with PDF 2.0, additional values also include A (annotations array order) and W (widget order). Annotations array order refers to the order of the annotation enumerated in the Annots entry of the Page dictionary (see "Table 31 — Entries in a page object"). Widget order means using the same array ordering but making two passes, the first only picking the widget annotations and the second picking all other annotations.
Interestingly no default is defined for this optional entry, so it appears to be implementation dependent.
For the example document accompanying your parallel PDFBox Jira issue the page has a Tabs value of W, so it's the widgets first annotation array order on that page.

Related

Get annotation from a pdf to add to another document

I am using iTextSharp version 5.0.
For my projet, I need to copy my pdf document into another pdf document using pdfWriter. I can't use pdfCopy nor pdfStamper.
So all the annotations get lost during this operation.
To begin, I started to find how to get the annotations of the "pencil comment drawing markup" as shown below on adobe reader UI:
For my tests, I am using this pdf document with a drawing markup I added my self: https://easyupload.io/3c6i1g
I found how to get the annotation dictionary:
Dim pdfReader As New PdfReader(pdfPath)
Dim page As PdfDictionary = pdfReader.GetPageN(0)
Dim annots As PdfArray = page.GetAsArray(PdfName.ANNOTS)
If annots IsNot Nothing Then
For i = 0 To annots.Size - 1
Dim annotDict As PdfDictionary = annots.GetAsDict(i)
Dim annotContents As PdfString = annotDict.GetAsString(PdfName.CONTENT)
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
Dim annotName As PdfString = annotDict.GetAsString(PdfName.T)
Next
End If
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible? According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?
There are a lot of single questions in your post...
When the loop is parsing my comment the annotName variable returns my name, so I am sure to parse the annotation I am looking for but the annotSubtype is equal Nothing, how is that possible?
According to the pdf specification at section 12.5.2 table 1666 (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf), the subtype parameter is required, so wouldn't it means this should not be at nothing?
According to table 164 in section 12.5.2 of ISO 32000-1, the Subtype entry indeed is required, but it also specified to be a name while you try to retrieve a string instead:
Dim annotSubtype As PdfString = annotDict.GetAsString(PdfName.SUBTYPE)
As the Subtype entry of that annotation in your PDF correctly is a name, GetAsString returns Nothing.
Thus, call GetAsName instead and expect a PdfName return type.
Also, how can I get the image related to this annotation? I thought it would be stored in the content of the annotation dictionary but this is also returning nothing in the code above...
The Contents entry is specified in the same table as above to be optional and (if present) to have a text string value containing a Text that shall be displayed for the annotation or, if this type of annotation does not display text, an alternate description of the annotation’s contents in human-readable form. As the annotation merely is a scribble, what should the annotation have as Contents value?
As your annotation actually is an Ink annotation, you can find the representation of the scribble in the required InkList and optional BS entries of the annotation, see table 182 of section 12.5.6.13 of ISO 32000-1.
The value of InkList is An array of n arrays, each representing a stroked path. Each array shall be a series of alternating horizontal and vertical coordinates in default user space, specifying points along the path. When drawn, the points shall be connected by straight lines or curves in an implementation-dependent way.
The value of the BS (if present) is A border style dictionary (see Table 166) specifying the line width and dash pattern that shall be used in drawing the paths.
Beware, though: The annotation dictionary’s AP entry, if present, takes precedence over the InkList and BS entries. And in your PDF the annotation has an appearance entry. So the actually displayed content is that of the Normal appearance stream which contains vector graphics instructions drawing your scribble.
about why I can't use pdfStamper at the first place : one of the page of my pdf document must be resized (downscaled) in order to add some text at the bottom of the page, so I must use pdfWriter for that.
First of all, this only means that you have to do something special to that special page, there is no need to damage all pages by copying them with a PdfWriter. You could manipulate that single page in a separate document, then use PdfCopy to copy the pages before that page from the original PDF, then that page from the separate PDF, and then all pages after that page from the original again.
Thus, you'd only have to fix the annotations of that special page, the annotations on the other pages could remain untouched.
Furthermore, you can even use the PdfStamper if you are ready to use low level iText routines. In particular before stamping you can apply the static PdfReader method GetPageContent to the page dictionary of the special page to retrieve the page content as byte array, build a new byte array from it in which you prepend an affine transformation which does the downscaling, and set the new byte array as content of the page in question using the SetPageContent method of the underlying PdfReader
Even in this scenario, though, you'd have to adjust the annotation coordinates (both of their rectangles and of other coordinates like the InkList in your case)...
Question: How can I get the drawn line of a comment annotation with iTextSharp 5.0?
See above, the annotation of the scribble is an Ink annotation and the drawn path is specified in the InkList and BS entries of its dictionary and additionally instantiated in its normal appearance stream.

Lucene.vectors: how to set the label field

I am trying to use mahout's lucene.vectors to pull data from a lucene index. The index contains web page content crawled by Nutch. Some of the fields that are indexed are : title, url, id, text and category.
I know I can use lucene.vectors to fetch the data from the index and convert it to vectors. However, what I could not understand is how to tell this tool which field in Lucene contains the label. For my scenario, the category field is the label field.
I am using mahout 0.9.
Thanks in advance,
Ameer
You might need an intermediary step to first convert the lucene index into a sequence file which takes key, values pairs where key represents your label. SequenceFilesFromLuceneStorage.java allows you to do this. The description says the following -
/** * Generates a sequence file from a Lucene index with a specified
id field as the key and a content field as the value. * Configure
this class with a {#link LuceneStorageConfiguration} bean. */
I believe the lucene.vector simply places all the text into a vector (Reference - https://mahout.apache.org/users/basics/creating-vectors-from-text.html). You need a sequence file of the format <Text, VectorWritable> in order to have a vector along with a label.
Then you can simply read through the sequence file and obtain the vector and label. If you want to calculate TFIDF you can use seq2sparse or the SparseVectorsFromSequenceFiles.java
Alternatively you can also do this manually by extracting the label first and sending the rest to lucene.vector.

PDFBox- is the reading order guaranteed with PDFTextStripper's processTextPosition ?

I am using PdfTextStripper (PDFBox 1.8.2) to process every TextPosition in a pdf file. I have tested with a lot of files and I noticed that it processes text in the reading order. However, this does not hold good if a pdf has footers (the docx which I exported as pdf). The pdfTextStripper processes the footer first and then the body of the file.
Is this expected behavior ? Is there a way I can specify the order ? or is there any way I can identify its a footer and I can make the adjustment in my code ?
PdfTextStripper has an attribute SortByPosition (getSortByPosition & setSortByPosition). It's false by default.
If this attribute is false, the PdfTextStripper essentially extracts the text in the order in which it appears in the PDF page content stream.
This order can be totally mangled (because in the content stream you use operators which can position the next printed text anywhere on the page) but often text sections belonging together are kept together (because the operations required for such sections often are inserted in that stream as a block).
Headers and footers, though, often are added at the same time and, therefore, appear together before or after the main body text.
If this attribute is true, though, the PdfTextStripper essentially extracts the text from top to bottom, from left to right (unless the reading order is defined to be right to left). (Ok, ok, it also respects article beads, but you hardly can count on them being used in general.)
This order is good in case of one-column text, and headers come first and footers last, but unless proper article beads are used, multi-column pages get mangled up.
BTW, you can switch off the use of article beads using the attribute ShouldSeparateByBeads (getSeparateByBeads & setShouldSeparateByBeads).

how to make pdf layers(optional content group) in tree structure

I have not long ago post a quesion about how to use the optional content group in pdf. But now I have a new question. How to make these optional content group in a tree structure.
For example. I have 4 different layers. these layers are all OCG layer. 3 layers is text labels, 1 layer have veccotr graphic. So i want it shown as this:
Alllayers
---labels
--layer1
--layer2
--layer3
---layer4
I use a pdf doc as an example
this is in chinese, the chinses character is the name of the layer. Just this meaning.
The answer to this question will depend on which pdf library you are using to generate your files. In general, you will need to produce a file that has an Order array in the optional content configuration dictionary that represents the tree that you want to display.
From the PDF Reference Document:
Key: Order
Type: array
Description: (Optional) An array specifying the order for presentation of optional content groups in a conforming reader’s user interface. The array elements may include the following objects:
-Optional content group dictionaries, whose Name entry shall bedisplayed in the user interface by the conforming reader.
-Arrays of optional content groups which may be displayed by a conforming reader in a tree or outline structure. Each nested array may optionally have as its first element a text string to be used as a non-selectable label in a conforming reader’s user interface.

Displaying/Formatting Tabular Data (web)

In my example I have a table where each row is a user for example. Columns could include their name, address, email address, etc. I now need to add a column for (hypothetical example) their cat's names. While most people will have no cats and some people will have 1- 2 cats there will be the occasional person with 20 cats that create one very long row in the table. This is giving me an issue in presentation and for filtering/searching for cat names. Is there a good solution to displaying this type of data?
Have the first 50 (or whatever) characters of the field displayed as normal then put the rest in a block with its visibility set to hidden through CSS. Include a link / button / icon that will allow the user to toggle the visibility so they can see the entire value.
Several options:
Set a maximum width for the cell and allow the data to wrap
Place the content inside a wrapper tag (such as a div) and set the div with a fixed width/height and style of overflow:hidden to ensure that a particularly long word doesn't force out the width of the cell.
Truncate the output text on the server side
For cases #2 and #3, set the Title attribute of the TD tag to contain the full non-truncated text. This will present itself as a tooltip when hovering over the cell.
I would mention other CSS-based solutions but they're very sparsely supported right now, so not worth mentioning.
You might want to try doing something like what SO does. Namely, once someone reaches a certain point in their Rep, it suffixes the number and appromixates it. Ex. 10k instead of 10,236.
That way the numbers don't get out of hand.