pdfbox getAcroForm return null, while page - annots have field value - pdfbox

In particular pdf file, while trying to fetch getAcroForm, it gives null.
With the pdfbox jar debug options
pages have annots and proper values inside the kid(0)
at Root, field are missing, so getAcroForm gives null
Is there any way to fetch field / annots value bind with page, and not bind with root document.

Related

typo3 9.5 - how to get rid of gray header box?

When I use my h1 and h2 styles in a text content element in typo3 9.5, they get nicely displayed as I want.
However when I use the header field of the element, I get this grey box and not my h1 format.
How can I configure typo3 to show h1 style there?
If you use fluid styled content (FSC) or packages which are using FSC (like bootstrap package) you will find the templates of your content elements (CE) in these extensions, from where you can copy it to your site extension and after adding your path to the paths list your modified template is used to render that CE.
This is the typoscript configuration to modify the rendering of the extension bootstrap_package:
lib {
contentElement {
layoutRootPaths {
// 0 = EXT:bootstrap_package/Resources/Private/Layouts/ContentElements/
10 = EXT:my_site_extension/Resources/Private/Layouts/ContentElements/
}
partialRootPaths {
// 0 = EXT:bootstrap_package/Resources/Private/Partials/ContentElements/
10 = EXT:my_site_extension/Resources/Private/Partials/ContentElements/
}
templateRootPaths {
// 0 = EXT:bootstrap_package/Resources/Private/Templates/ContentElements/
10 = EXT:my_site_extension/Resources/Private/Templates/ContentElements/
}
}
}
The entries with 0 = are set by the ext:bootstrap_package (or similar by ext:fluid_styled_content) and show you the path to the templates which are used without your override.
The entries with 10 = (you could use any higher number to give preference to your templates) should show to the folders in your site extension (ext:my_site_extension), where you hold your modified copies.
You only need to copy templates you modify as the original paths are fallback to any template file which is referenced as template, layout or partial. Keep an eye on the paths as those files can be referenced with a (relative) path.
EDIT:
For FSC the rendering for a specific CE is done with a template of the same name in the template folder configured in typoscript (see above)
These files normally contain a call to the same layout file (Layouts/Default.html) which renders the header with the partial Header/All and different other html for spacing and anchors.
In the partial Header/All we have further partials which render the fields header, subheader and date if given with appropriate partials.
Note the additional arguments to these partials: layout, positionClass, link, default which will influence the appearance of the header.
Maybe your unusual appearance is given because there is a special header_layout in your records.
Or another extension already has overwritten the default templates (partials) to get those boxed headers instead of the h1-h6 HTML tags which are used in the FSC extension.

The method getKids() is undefined for the type PDField

https://issues.apache.org/jira/browse/PDFBOX-2148
When there are multiple copies with the same field name, the getFullyQualifiedName for each kid in the list of PDField objects returns the name of the parent, followed by .null. So if the parent field is called Button2 and it has 4 instances the result of printing out all the names will be:
Button2.null
Button2.null
Button2.null
Button2.null
According to the comments to the question, the OP refers to PDFBox 2.0.x versions, in particular 2.0.6.
getKids()
The method getKids() is undefined for the type PDField
In PDFBox 2.0.6 there are two immediate sub-classes of PDField. Different variants of the former (1.8.x) getKids() method are implemented in there:
PDNonTerminalField - the method retrieving the kids in this class is getChildren() and returns a List<PDField>, a list of form fields.
PDTerminalField - the method retrieving the kids in this class is getWidgets and returns a List<PDAnnotationWidget>, a list of widget annotations.
name of the parent, followed by .null
When there are multiple copies with the same field name, the getFullyQualifiedName for each kid in the list of PDField objects returns the name of the parent, followed by .null
This is not the case in PDFBox 2.0.x.
In the sample document attached to the PDFBox issue PDFBOX-2148 PDFBox now correctly finds only a single field which appropriately is named "Button2". This field is a PDTerminalField and has 4 widget annotations. The class of the latter, PDAnnotationWidget, has no getFullyQualifiedName method, so there are no ".null" names.
Thus, this problem is gone.
FQN of duplicate fields
(from the OP's comment responding to "What exactly is your question?")
how to get Fully Qualified Name of duplicate fields in pdfbox
There are no duplicate fields in (valid) PDFs, for a given name there is at most a single field which may have multiple widgets. Widgets do not have individual FQNs.
Thus, what you call "duplicate fields" in your example document actually is a single field with multiple widgets; the name of that field is "Button2" and can be retrieved using getFullyQualifiedName().
which page which form field
(from the OP's comments to this answer)
but how to get current page no in pdfbox.. for example there are 3 page and in page 2 there is a form field so how can i get which page which form field ?
All PDAnnotation classes, among them PDAnnotationWidget, have a getPage() method returning a PDPage instance.
BUT: As specified in ISO 32000-1, annotations (in particular form field widgets) are not required to have a link to the page on which they are drawn (except for screen annotations associated with rendition actions).
Thus, the above mentioned method getPage() may return null (probably more often than not).
So to determine the respective pages of your widgets, you have to approach the problem the other way around: Iterate over all pages and look for the annotation widgets in the respective annotation array.
For PDFBox 1.8.x you can find example code in this stackoverflow answer. With the information given in the previous parts of this answer it should be easy to port the code to PDFBox 2.0.x.
checkbox and radio button
(also from the OP's comments to this answer)
one more issue if i am using checkbox and radio button both then field.getFieldType() output is Btn for both. how to identify it?
You can identify them by inspecting the field flags which you retrieve via fields.getFieldFlags():
If the Pushbutton flag is set (PDButton.FLAG_PUSHBUTTON), the field is a regular push button.
Otherwise, if the Radio flag is set (FLAG_RADIO), the field is a radio button.
Otherwise, the field is a check box.
Alternatively you can check the class of the field object which for Btn may be PDPushButton, PDRadioButton, or PDCheckBox.
Beware: If a check box field has multiple widgets with differently named on states, this check box field and its widgets act like a radio button group! And not only in theory, I've seen PDFs with such check box fields in the wild.
To really be sure concerning the behavior of the fields, you therefore also should compare the names of the on states of all the widgets of a given check box.

Word to HTML fields in header and footer

I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?

How can I focus to a specific item which is in the bottom of the page in IDE

I am trying to select a specific item in a page which is at the bottom of the page. I want to verify that element is present and the same time I want to focus to that specific item.
How can I do this in the Selenium IDE?
I tried storeEval, but its specific co-ordinated which I don't want. I am looking for some dynamic command. I tried using css:.groupTile:contains("Concentrated") but the focus is not going to that particular item (Concentrated).
Can someone help me with Command, Target and value please?
CSS Selectors have many formats
i) Using id. Put this in Target: css=tag#id
tag = the HTML tag of the element being accessed,
id = the ID of the element being accessed
ii) Using class. Put this in Target: css=tag.class
tag = the HTML tag of the element being accessed,
class = the class of the element being accessed
In value you enter name of the item.

How to generate a report for particular XHTML tag/attributes?

I wan to check whole site's <img> image's for alt text. I want to get a report of, What is written in alt text or alt is defined or not from all images being used on whole site in every page.
Is it possible to get report like this? after getting report i will put alt or if alt is already added but blank, then will write description text.
Otherwise in a big site it will take huge time to go and check each page.
Site is on Intranet and accessible with username and password.
This isn't a direct answer, but since it seems like your motivation here is just to know which img elements don't have alt attributes, I wanted to add that not all img elements need alt attributes.
The HTML5 spec mentions which img elements should have alt attributes:
What an img element represents depends on the src attribute and the alt attribute.
If the src attribute is set and the alt attribute is set to the empty string
The image is either decorative or supplemental to the rest of the content, redundant with some other information in the document.
If the image is available and the user agent is configured to display that image, then the element represents the image specified by the src attribute.
Otherwise, the element represents nothing, and may be omitted completely from the rendering. User agents may provide the user with a notification that an image is present but has been omitted from the rendering.
If the src attribute is set and the alt attribute is set to a value that isn't empty
The image is a key part of the content; the alt attribute gives a textual equivalent or replacement for the image.
If the image is available and the user agent is configured to display that image, then the element represents the image specified by the src attribute.
Otherwise, the element represents the text given by the alt attribute. User agents may provide the user with a notification that an image is present but has been omitted from the rendering.
If the src attribute is set and the alt attribute is not
The image might be a key part of the content, and there is no textual equivalent of the image available.
Note: In a conforming document, the absence of the alt attribute indicates that the image is a key part of the content but that a textual replacement for the image was not available when the image was generated.
If the image is available, the element represents the image specified by the src attribute.
If the src attribute is not set and either the alt attribute is set to the empty string or the alt attribute is not set at all
The element represents nothing.
Otherwise
The element represents the text given by the alt attribute.
If you wanted to do this with Selenium it will be something like this
Dictionary<string,string> details = new Dictionary<string,string>();
int imgcount = selenium.GetXpathCount("//img");
for (i=0;i<10;i++){
details.add(selenium.GetAttribute("//img[i]#src"),selenium.GetAttribute("//img[i]#alt"));
}
foreach (KeyValuePair<string, string> kvp in details)
{
Console.WriteLine("key " + kvp.Key);
Console.WriteLine("Value " + kvp.Value);
}
That will print the src of the image and its ALT text.
Using TestPlan I came up with this quick script:
GotoURL http://stackoverflow.com/questions/2570421/how-to-generate-a-report-for-particular-xhtml-tag-attributes
foreach %Image% in (response //img)
set %src% as combineCurrentURL (selectIn %Image% #src)
set %alt% as trim (selectIn %Image% #alt)
if numComp 0 = (length %alt%)
Notice %src% ALT IS EMPTY
else
Notice %src% : %alt%
end
end
The output looks like below (a CSV report can also be generated if desired)
00000001-00 NOTICE http://sstatic.net/so/img/logo.png : Stack Overflow
00000002-00 NOTICE http://ads.stackoverflow.com/ads/ladywhobig.jpg ALT IS EMPTY
00000003-00 NOTICE http://sstatic.net/so/img/vote-arrow-up.png : vote up
00000004-00 NOTICE http://sstatic.net/so/img/vote-arrow-down.png : vote down
This works in both the HTMLUnit and Selenium backend to TestPlan.