I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.
There is JIRA SOLR-380 with a Patch, which you can check upon.
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
I have not tried it myself.
Approach,
Solr customer connector integrating with Apache Tika parser for indexing PDFs
Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
Enable search on all the “page” attributes
When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.
Related
I have tried to use Apache Tika v1.14 to parse text contained in PDF XFA forms. However, after trying different configuration in PDFParserConfig, I can only get the field names but not the text content of those fields.
For example, if there's a field called "Telephone", part of Tika's output may be <li fieldName="Telephone">Telephone: </li> (the same repeats for other fields as well). However, if I use pdfbox API to traverse the DOM tree to reach the Node named "Telephone", then I can use getNodeValue() to obtain the text content I want.
I am aware of the settings setExtractAcroFormContent() and setIfXFAExtractOnlyXFA() in PDFParserConfig and also experimented with them but I still didn't get the text content.
So my questions are
Did I misconfigure Tika so that it does not give the right output? Or,
Is this what Apache Tika's implementation is intended to do? Or,
Is the implementation still under development?
I am sorry that the forms contain medical information of patients so I am not able to attach them as example.
Thank you very much.
p.s. I am also aware of Tika's Jira issues https://issues.apache.org/jira/browse/TIKA-973 and https://issues.apache.org/jira/browse/TIKA-1857 so I thought this feature has been implemented.
I have a simple task, which I've found quite complex to implement with Apache FOP.
I have already created some layout. So I do have nice first page, only page, rest pages and last page definitions with well distributed content on it, but now I need sometimes add to the end of the document some extra content (like Terms and conditions, or Agreement conditions) which sometimes takes even few pages. That content shouldn't have any header, footer, page number etc... Just a text flow with paragraphs.
Thank you.
Kind regards.
As I am new to FOP I haven't recognized well FOP capabilities posting my question here.
So I've sorted out the issue with new page-master and new page-sequence definitions.
Has anyone managed to provide an end-user with an updated PDF, allowing that user to transfer his local annotations to the new PDF and keeping the annotations on the correct page, even when there are pages inserted into the update PDF at a point earlier in the PDF than the annotations.
I thought there might be a page map or page guid approach that someone has used.
Sorry - I hope that is clear.
Instead of using the page index as ID of the page, you can use the page content stream instead (after decoding/decompression). Most PDF libraries will give you access to that, so you could compute an MD5 hash from the page content and search for that instead on your "updated" file in order to know where to transfer your annotations.
This is assuming that the page content will be indeed identical, which is not a common scenario.
I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?
You can use the plugin below to extract content based on XPath queries.
If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.
Filter xpath
do you know how google recovers the description of a website in their search results? is it the meta-description? the first paragraph?
Their algorithms aren't officially released to the public, but if there is a meta description tag, it takes that. Otherwise it generally depends on where the keywords lie within the body of the webpage. If someone is searching for "foo", a paragraph with foo in it will likely appear, with foo highlighted in bold.
Search Engines (including Google) crawl through the first introductory paragraph of the page or a post and takes that excerpt to put in the description when search results are shown. But there's a protection measure that one should take to be SEO friendly. If you are starting your page/post with an image, it negatively affects the SEO of that page because the search results are in text form and for that search engines won't understand the format of the image since they want a text description. In case of WordPress, use All IN One SEO Pack Plugin to manipulate the description if you are starting your post/page with an image.