umbraco pdf searcher result ranking - lucene

We have used pdf searcher (nuget package) within one of our Umbraco applications. When I see the pdf search results it does not look 100% correct.
The top 2 pdfs in the search result contain the search term, but the 3rd, 4th and remaining other pdfs in the search result do not have search term. Not sure why pdfs not having the search term are being added in the search result.
Can anyone provide some info on how the umbraco pdf searcher works? and ranks the result items?
Is there any way to remove the pdfs from the search result which do not contain the search term at all.

Go and download LUKE (https://code.google.com/archive/p/luke/). This is a tool that allows you to look inside indexes and see what they have indexed etc.
Using LUKE you should be able to see the indexes and see what has been indexed.
You can get Umbraco Examine to output the raw Lucene string it's using to search by calling .ToString on the criteria object. You can paste that into LUKE to run a search and you'll be able to see all sorts of useful details, like the matched terms, and the ranking etc.
:)

Related

CSE - how to get partial words results?

how can I add to the CSE results also partial words? For example if I search for "out of reach" then it's fine. But "out of r" shows 0 results. And I don't want to use synonyms. It would mean thousands of synonyms which is not possible to implement. It must work with partial results by default somehow.
Can you give me some hints?
Thanks.
Several possibly solutions
Turn autocomplete on to get it to suggest full terms for partial, fast and saves on spelling mistakes. You can add common mispellings as synonyms or autocomplete suggestions. You can add the autocomplete full terms manually or upload from a file. Autocomplete is in Edit search engine, Search features, Autocomplete tab.
Users searching with a wildcard, eg out of r* is supported just as in normal google search, you could possibly add to the code so that returning no matches resubmits the query with wildcards at the end of each word, or you could amend what is typed into the input box to append wildcards to each work before searching. To do this you can upload a refinements file that does simple things like add wildcards to one-letter search times. Or you could create an extra input box for the user to type in (hiding the original), and use an EventListener or click event to capture whatever is typed in, pass it to your script and then pass onto the search engine.
Create an XML file of autocomplete terms using a script. Something as simple as using code - even an old fashioned spreadsheet macro- can be used to generate a bunch of nearly-identical lines each with a wildcard at the end. You could add autocomplete terms for all one or two character words (but may would be best to exclude very short words like a, it, do from this list). The user will see wildcards when they type the inital letters in- but they would need to recognize and click on them - not much good it they don't know what r* means.
A better approach with generating an XML file to upload for autocomplete would be to use google webmaster tools (or even a common-word finder tool) to analyze the top 100 words (or top 1000 words) in your website's text. Then create a short piece of code to transform this list into the XML, which you the upload. This won't get every partial match but will get the most common searched-for words in there quickly. It will cut down on misspellings too.

Google CSE limit indexing of single file?

I have been using Google CSE to index several long PDF files for searching (some 500+ pages long). I am noticing that the search will find terms close to the beginning of some of these documents, but not terms that are near the end of the document. Is there a limit to how much of a single file Google will index?
Since no one seems to know, I will provide my experience. We have requested a manual index of the pdf files several times, and still cannot get the search to pick up any search terms past page 10-15. It seems like there is a character limit on how much of a single pdf gets indexed. Google support is not available to confirm this until the business version is purchased, which we won't be doing.

How to boost linked documents in Lucene?

Is it possible to boost found documents based on other found documents?
E.g. if I have document A which has a link to document B and both are found independently, then to boost them both? By link I mean a field with an ID of another document.
Currently I'm doing it "manually" i.e. I post-process the TopDocs looking for documents that have links to other documents in the same result and move those to the top. This is not the best solution as the TopDocs itself is already limited without taking my custom boosting into account.
I would suggest to implement a custom lucene collector or extend an existing one. This way you can store all the doc ids which are retrieved and you can post process them all at the end. Depending on the links between your documents, you may be able to throw away some of the docs during the "collecting" phase which will save you memory.

Tagging documents with predefined labels

I am working with large number of documents and have a set of predefined categories/tags(could be phrases) that would be present in the text of the documents either in the exact or inexact form.
I want to assign each document to exactly one tag among the tags that is closest to its text.
Please give me some directions as to what should I do to address this problem.
You can look at the lucene search engine that tags the documents while indexing. Northernlight search engine used to do a similar task mentioned by you in their searching methodology. You can have a look at its implementation in order to get an idea.

Increasing the weight of particular terms (e.g. headings) when indexing documents in Lucene

I have documents which I am indexing with Lucene. These documents basically have a title (text) and body (text). Currently I am creating an index out of Lucene Documents with (amongst other fields) a single searchable field, which is basically title+" "+body. In this way, if you search for anything which occurs in the title or in the body, you will find the document.
However, now I have learned of the new requirement that matches in the title should cause the document to be "more relevant" than matches in the body. Thus, if there is a document with the title "Software design", and the user searches for "Software design", then that document should be placed higher up in the search results than a document called something else, which mentions software design a lot in the body.
I don't really have any idea how to begin implementing this requirement. I know that Google e.g. treats certain parts of the document as "more relevant" (e.g. text within <h1> tags), everyone here assumes Lucene supports something similar.
However,
The Javadoc for the Document class clearly states that fields contain text, i.e. not structured text where some parts are "more important" than other parts.
This blog post states "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document."
I'm not really sure where to look. What would you suggest?
Any specific information (e.g. links to Lucene documentation) stating flatly that such a thing is not possible would also be helpful, then I needn't spend any further time looking for how to do it. (The software is already written with Lucene, so we won't re-write it now, so if Lucene doesn't support it, then there's nothing anyone (my boss) can do about that.)
Just use two fields, title and body, and while indexing boost 'title' field:
title.setBoost(float)
see here
you probably should split the combine field become title and body separately, then use the run-time boost to give more relevancy for title field
the run-time query will be like
title:apache^20 body:apache
see - http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boosting%20a%20Term