Is Lucene capable of finding the location of matches within a document? - lucene

Say I have 100 documents indexed in Lucene. I want to search for the term "American Airlines". Lucene runs the search and gives me back 10 documents that contain the term "American Airlines". I now want to be able to go through each of these 10 documents in my UI, and highlight/scroll to each of the matches automatically. These are all html documents with uniquely id-ed paragraph tags, so I can scroll using something like http://docurl#p_120 to scroll to <p id="p_120">American Airlines is a big company.</p>. But how do I get Lucene to tell me what paragraph the term is in, and exactly where it is so I can highlight it?

Your question is about highlighting. You ask how to index a text with subdocuments so that you know the id of the subdocument for highlighting.
imho you have three possibilities. But first of all let me remind you that lucene can use the offset (=position in original text) for highlighting
https://lucene.apache.org/core/6_4_0/highlighter/org/apache/lucene/search/highlight/package-summary.html
and that lucene knows the concept of sub-documents as "blocked child documents" or "nested documents" or "embedded documents".
The tree possibilities:
use payloads to store the id of the corresponding subdocument for each occurence of a term.
store the offset of each occurrence of a term and be aware at which offset a new subdocument begins. Store the ids together with the corresponding offsets in an extra field and use this to look-up the id for each hit.
index the document together with all subdocuments as extra child document in a block. Search with http://lucene.apache.org/core/6_4_0/join/index.html?org/apache/lucene/search/join/ToParentBlockJoinCollector.html

Related

Configure SOLR query to find the Plurals word along with Singular word while forming Query String

I'm Using a Solr query to sort the search based on the entered search text, currently my query is only working on singular word like filter, car, floor. if i'm searching for the word filter it's only giving the result for filter but i wanted my query to should give the search for filters, cars, floors also. currently it is giving all the results which having word filter, car , floor not there plurals.
below Solr query i'm using for the sort result -- >
https://searchg2.crownpeak.net/NEI-Blogs-Dev/select/?q=custom_s_brand:nei&fl=custom_s_heading,custom_s_article_summary_Image_url,custom_t_content_summary_Image_url_alt,custom_t_content_summary_Desc,custom_s_local_url,custom_s_local_dba,custom_t_heading,termfreq(custom_t_heading,filter*),sum(termfreq(custom_t_heading,filter*)),termfreq(custom_t_content,filter*),sum(termfreq(custom_t_content,filter*))&qf=custom_t_heading&fl=custom_t_content,termfreq(custom_t_content,filter*),sum(termfreq(custom_t_content,filter*))&qf=custom_t_content&sort=sum(termfreq(custom_t_heading,filter*))%20desc,sum(termfreq(custom_t_content,filter*))%20desc&defType=edismax&fq=custom_s_status:Active
Solr does different kind of searches depending on the type of the field that you are searching on.
For string fields the search performs an exact match, hence if the value stored is "car" it will find "car" but not "cars", not even "CAR".
If the field you search on is a text tokenized field then the search will match certain variations depending on how the value was tokenized and what filters were applied to it. For example if you use a built-in text_en field it will perform certain transformations that are typical for values in English and then a search for "cars" or "CAR" will match if the value stored was "car" because the text_en field stores the stem of the word (e.g. "car" for "cars") and this seems to be what you are after.
It looks like the field that you are searching on (custom_s_brand) is a string field, perhaps you want to create additional tokenized fields for the brand so that your searches capture a wider range of matches rather than only identical matches.

Lucene - exclude fields from being searched

I have a search index and require a lucene query which will conditionally search specified fields. The end result will be that if you're logged into the website, all fields will be searched, or if you're logged out, specified fields will be skipped by modifying the lucene query.
The closest I have at the moment is:
+(term1~ term2~) +_culture:([en-gb TO en-gb] [invariantifieldivaluei TO invariantifieldivaluei]) **-FieldToIgnore1:(term1 term2) -FieldToIgnore2:(term1 term2)**
The problem with this however is if one of the search terms exists in one of the fields not mentioned (FieldToIgnore1 or FieldToIgnore2), then the document is ignored because it's been excluded as one one of the fields to ignore were matched.
How can this be modified so lucene doesn't even match against the fields to ignore?
Instead of qualifying your search via Lucene and the Smart Search Results webpart, have you tried modifying the searchability of the document fields themselves. You can set search parameters on the Page Type or index itself.
Go to Page Types --> [your doc type] --> Search fields, and set what fields are and aren't exposed to searching.
Version 9 gives you these settings in the Smart Search app. See these docs for details.

Automate adding bookmarks to tables and then create an index

I have a program which outputs a collection of tables in a word document which I eventually want to post as an html file with bookmarks and an index. The tables are grouped by "Name:" where there is a 3 row table that contains detailed header information for a section of data, then there is a second table which can span multiple pages which contains the data for that section. There is then a page break so that the next sections header table is on a new page. This can occur for a variable number of sections numbers in the hundreds. I need to write a script that
searches my document for "Name:", which is unique and would not
appear anywhere but the header table,
grabs the text that follows "Name:" within that table cell (for example "Name: Line 1234)
replaces all the blanks in that text string with an underscore to
make it a suitable bookmark name,
creates a bookmark with the name,
goes back and creates an index at the front of the document
Saves the file as an html
I have a passing familiarity with VB for word, I have used it a bit in excel, but am by no means an expert. I would appreciate any advice on functions and objects that I should be using for this script.
Hey MikeV from what I can gather, your problem seems more conceptual, less specific. What I mean is, have you started yet? Or looking at a blank script page?
I'm relatively new to coding, so I get that myself. What I do is make a list of what I need to do (what you have). Then think of the code or psuedo-code that would go with each step. Then you can start to build your script. You don't have to start with step one (as step 2/3 is often the more interesting bit), but let's do that.
Now, you need to search for a text string containing "Name:". I am proficient with VBA in excel, but haven't done anything for word. So I'd look it up. Googling "VBA find word in word document" will bring you to this page, which shows you how to approach step one. So steal their code, alter it to fit your needs and move on to step 2. Repeat the process, and that's how you build your algorithm! :)
Just a FYI, typically StackOverflow is for specific questions with an answer that can be confirmed, whereas you asked for help building an algorithm. I'd reserve those questions for your programming professor or friend who can help.
cheers

How to allow only one find per document searched on Lucene

I only want my Lucene search to give the highest scoring highlighted fragment per document. So say I have 5 documents with the word "performance" on each one three times, I still only want 5 results to be printed and highlighted to the results page. How can I go about doing that? Thanks!
You get only one fragment per document returned from the search by calling getBestFragment, rather than getBestFragments.
If your call to search is returning the same documents more than once, you very likely have more than one copy of the same document in your index. Make sure that if you intend to create a new index, you open your IndexWriter with it's OpenMode set to: IndexWriterConfig.OpenMode.CREATE.

Extract MS Word document chapters to SQL database records?

I have a 300+ page word document containing hundreds of "chapters" (as defined by heading formats) and currently indexed by word. Each chapter contains a medium amount of text (typically less than a page) and perhaps an associated graphic or two. I would like to split the document up into database records for use in an iPhone program - each chapter would be a record consisting of a title, id #, and content fields. I haven't decided yet if I would want the pictures to be a separate field (probably just containing a file name), or HTML or similar style links in the content text. In any case, the end result would be that I could display a searchable table of titles that the user could click on to pull up any given entry.
The difficulty I am having at the moment is getting from the word document to the database. How can I most easily split the document up into records by chapter, while keeping the image associations? I thought of inserting some unique character between each chapter, saving to text format, and then writing a script to parse the document into a database based on that character, but I'm not sure that I can handle the graphics in this scenario. Other options?
To answer my own question:
Given a fairly simply formatted word document
convert it to an Open Office XML document
write a python script to parse the document into a database using the xml.sax python module.
Images are inserted into the record as HTML, to be displayed using a web interface.