Apache Tika extract only field names from PDF XFA forms but not the text content - pdf

I have tried to use Apache Tika v1.14 to parse text contained in PDF XFA forms. However, after trying different configuration in PDFParserConfig, I can only get the field names but not the text content of those fields.
For example, if there's a field called "Telephone", part of Tika's output may be <li fieldName="Telephone">Telephone: </li> (the same repeats for other fields as well). However, if I use pdfbox API to traverse the DOM tree to reach the Node named "Telephone", then I can use getNodeValue() to obtain the text content I want.
I am aware of the settings setExtractAcroFormContent() and setIfXFAExtractOnlyXFA() in PDFParserConfig and also experimented with them but I still didn't get the text content.
So my questions are
Did I misconfigure Tika so that it does not give the right output? Or,
Is this what Apache Tika's implementation is intended to do? Or,
Is the implementation still under development?
I am sorry that the forms contain medical information of patients so I am not able to attach them as example.
Thank you very much.
p.s. I am also aware of Tika's Jira issues https://issues.apache.org/jira/browse/TIKA-973 and https://issues.apache.org/jira/browse/TIKA-1857 so I thought this feature has been implemented.

Related

Apache FOP - Extra content at the end

I have a simple task, which I've found quite complex to implement with Apache FOP.
I have already created some layout. So I do have nice first page, only page, rest pages and last page definitions with well distributed content on it, but now I need sometimes add to the end of the document some extra content (like Terms and conditions, or Agreement conditions) which sometimes takes even few pages. That content shouldn't have any header, footer, page number etc... Just a text flow with paragraphs.
Thank you.
Kind regards.
As I am new to FOP I haven't recognized well FOP capabilities posting my question here.
So I've sorted out the issue with new page-master and new page-sequence definitions.

How to use html tables within tumblr posts

For a public blog we are currently using the default rich text editor.
This as content editors are no html writers.
Now on a certain post / custom page we want to add tabular data using a html table.
The rich text editor in tumblr has no support for creating tables so we switched to the html editor. There we created the table tag and some rows. Everything looks fine until saving the post. On saving the table element seems to be gone...
The Tumblr dashboard strips out many different HTML tags. They're visible in your theme, but on the dash they're either removed without a trace or replaced with a small clickthrough "embedded content" symbol.
If you NEED the table on the dash, you may need to use an image. Personally I'd include the Read More break to direct people into viewing the whole thing on your blog, but that won't work inside the app or slide-in bar, etc.

How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?
You can use the plugin below to extract content based on XPath queries.
If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.
Filter xpath

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.
There is JIRA SOLR-380 with a Patch, which you can check upon.
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

File uploading from within a custom form tag in Spring MVC

Context
Part of the administrator side of our application requires the user to edit various types of content, which involves using a rich text editor or using files to generate content that can be seen by the 'client side' users of the application. It's kind of a domain-specific CMS lite.
Because this 'content' can be used in various parts of the application, it is included as a seperate relation in some of our domain entities. We decided to make our own tag library that defines some form fields that can be used to edit this content when an administrator edits an entity that includes a piece of content.
Question
What we'd like to be able to do is the following.
<form:form modelAttribute=...>
<olo:content-editor path="content"/>
<!-- Other form fields for this entity -->
<form:.../>
<form:.../>
</form:form>
The olo:content-editor tag then generates a number of form fields based on what type of content is needed. This means it may (or, depending on the type of content, may not!) generate the filebased-content tag which contains:
<input type="file" name="file/>
Which can be used to replace the file associated with the file based content.
The problem is that the Spring docs indicate that the file upload requires the form to have the enctype to define that it's sending multipart form data. As the file upload is part of the tag and not the form itself, we find this is undesirable. We would like to be able to use our olo:content-editor tag in forms without having to change the form enctype attribute. Is this possible?
Possible solutions
We can think of two client-side hacks that may resolve our problem, but both seem to be rather ugly solutions:
Include a script in filebased content tag that changes the form enctype when it's loaded, so that it is always set to the appropriate type. (Very ugly.)
Submit the file data as a regular hidden form field, of which the data is set by using the HTML5 File API (administrators use a compliant browser. This seems far less ugly but still not an optimal solution.)