Apache FOP - Extra content at the end - pdf

I have a simple task, which I've found quite complex to implement with Apache FOP.
I have already created some layout. So I do have nice first page, only page, rest pages and last page definitions with well distributed content on it, but now I need sometimes add to the end of the document some extra content (like Terms and conditions, or Agreement conditions) which sometimes takes even few pages. That content shouldn't have any header, footer, page number etc... Just a text flow with paragraphs.
Thank you.
Kind regards.

As I am new to FOP I haven't recognized well FOP capabilities posting my question here.
So I've sorted out the issue with new page-master and new page-sequence definitions.

Related

Apache Tika extract only field names from PDF XFA forms but not the text content

I have tried to use Apache Tika v1.14 to parse text contained in PDF XFA forms. However, after trying different configuration in PDFParserConfig, I can only get the field names but not the text content of those fields.
For example, if there's a field called "Telephone", part of Tika's output may be <li fieldName="Telephone">Telephone: </li> (the same repeats for other fields as well). However, if I use pdfbox API to traverse the DOM tree to reach the Node named "Telephone", then I can use getNodeValue() to obtain the text content I want.
I am aware of the settings setExtractAcroFormContent() and setIfXFAExtractOnlyXFA() in PDFParserConfig and also experimented with them but I still didn't get the text content.
So my questions are
Did I misconfigure Tika so that it does not give the right output? Or,
Is this what Apache Tika's implementation is intended to do? Or,
Is the implementation still under development?
I am sorry that the forms contain medical information of patients so I am not able to attach them as example.
Thank you very much.
p.s. I am also aware of Tika's Jira issues https://issues.apache.org/jira/browse/TIKA-973 and https://issues.apache.org/jira/browse/TIKA-1857 so I thought this feature has been implemented.

How do I include a picture in a page element in Inquisit?

I'd like to include a picture in a page element in an Inquest script: is this possible?
If so, how would you do it?
I know this question was asked 8 years ago...but I recently had the same question. So I thought maybe I could put something here in case someone in the future would have a similar question.
You cannot add a picture to the page element in Inquisit 5, but it would be possible in Inquisit 6. For Inquisit 5, you'll have to use or .
Here's some discussion on this: https://forums.millisecond.com/Topic34836.aspx
There is some discussion here.
In general, it seems that the page element only allows for simple text based instructions.
If you want to present images in instructions, there are a few options.
htmlpage element
You can use the htmlpage element, which allows for instructions to be a complete formatted HTML file that can include images.
The htmlpage element is used to define pages of text to be displayed
as instructions using the preinstructions or postinstructions
attribute. The htmlpage element is useful when complete control over
formatting and content of instruction pages is required, otherwise the
page element provides an easier way to display text with basic
formatting. The actual content of the page is contained in a separate
HTML file located on the local machine or the web. source
Picture or picture and text in a normal trial
The other option is to present instructions as normal stimuli in the main trials of a block.
See for example, the instructions in the sample script for the Iowa Gambling Task.
This can be either done as one integrated picture that includes text, or each image can be positioned as it's own stimuli.

Combining a page map with a PDF so that annotations move with pages in updated PDFs

Has anyone managed to provide an end-user with an updated PDF, allowing that user to transfer his local annotations to the new PDF and keeping the annotations on the correct page, even when there are pages inserted into the update PDF at a point earlier in the PDF than the annotations.
I thought there might be a page map or page guid approach that someone has used.
Sorry - I hope that is clear.
Instead of using the page index as ID of the page, you can use the page content stream instead (after decoding/decompression). Most PDF libraries will give you access to that, so you could compute an MD5 hash from the page content and search for that instead on your "updated" file in order to know where to transfer your annotations.
This is assuming that the page content will be indeed identical, which is not a common scenario.

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.
There is JIRA SOLR-380 with a Patch, which you can check upon.
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

Does changing the order of HTML with Javascript help SEO

On my website, I have a booking widget at the top of each page to allow visitors to enter our booking engine. The code behind it uses quite a bit of HTML, pushing down the content on each page in the source. In an attempt to better my SEO, I decided to have the code placed in a DIV tag at the bottom of the page, and, when the DOM is ready, I use JQuery to physically move the DIV from the bottom of the DOM to the top where it needs to be to render correctly.
My question is if this is really helping SEO? Does Google look at the DOM/Source after all Javascript has run, or before? Does moving these few hundred lines of HTML to the bottom of the HTML source gain me any advantage?
Spiders do not process javascript. So any content that appears/moves or is created by javascript will appear as if it hasn't been moved or created at all.
I'd be really surprised if web crawlers execute the scripts on the page. They probably scan the raw response.
That doesnot have any effect on the SEO.
But placing the javascript at the bottom will defnitely help you to load the webpages faster.
There is no harm for SEO as well, you can defnitely proceed with your approach
There is a distinction between javascript executed on load versus during the user session. The on-load javascript is more times than not indexed by google. The dynamic content or alterations on the client side are not well indexed.
So, it can't be ignored.