How to allow only one find per document searched on Lucene - lucene

I only want my Lucene search to give the highest scoring highlighted fragment per document. So say I have 5 documents with the word "performance" on each one three times, I still only want 5 results to be printed and highlighted to the results page. How can I go about doing that? Thanks!

You get only one fragment per document returned from the search by calling getBestFragment, rather than getBestFragments.
If your call to search is returning the same documents more than once, you very likely have more than one copy of the same document in your index. Make sure that if you intend to create a new index, you open your IndexWriter with it's OpenMode set to: IndexWriterConfig.OpenMode.CREATE.

Related

Is Lucene capable of finding the location of matches within a document?

Say I have 100 documents indexed in Lucene. I want to search for the term "American Airlines". Lucene runs the search and gives me back 10 documents that contain the term "American Airlines". I now want to be able to go through each of these 10 documents in my UI, and highlight/scroll to each of the matches automatically. These are all html documents with uniquely id-ed paragraph tags, so I can scroll using something like http://docurl#p_120 to scroll to <p id="p_120">American Airlines is a big company.</p>. But how do I get Lucene to tell me what paragraph the term is in, and exactly where it is so I can highlight it?
Your question is about highlighting. You ask how to index a text with subdocuments so that you know the id of the subdocument for highlighting.
imho you have three possibilities. But first of all let me remind you that lucene can use the offset (=position in original text) for highlighting
https://lucene.apache.org/core/6_4_0/highlighter/org/apache/lucene/search/highlight/package-summary.html
and that lucene knows the concept of sub-documents as "blocked child documents" or "nested documents" or "embedded documents".
The tree possibilities:
use payloads to store the id of the corresponding subdocument for each occurence of a term.
store the offset of each occurrence of a term and be aware at which offset a new subdocument begins. Store the ids together with the corresponding offsets in an extra field and use this to look-up the id for each hit.
index the document together with all subdocuments as extra child document in a block. Search with http://lucene.apache.org/core/6_4_0/join/index.html?org/apache/lucene/search/join/ToParentBlockJoinCollector.html

Lucene - exclude fields from being searched

I have a search index and require a lucene query which will conditionally search specified fields. The end result will be that if you're logged into the website, all fields will be searched, or if you're logged out, specified fields will be skipped by modifying the lucene query.
The closest I have at the moment is:
+(term1~ term2~) +_culture:([en-gb TO en-gb] [invariantifieldivaluei TO invariantifieldivaluei]) **-FieldToIgnore1:(term1 term2) -FieldToIgnore2:(term1 term2)**
The problem with this however is if one of the search terms exists in one of the fields not mentioned (FieldToIgnore1 or FieldToIgnore2), then the document is ignored because it's been excluded as one one of the fields to ignore were matched.
How can this be modified so lucene doesn't even match against the fields to ignore?
Instead of qualifying your search via Lucene and the Smart Search Results webpart, have you tried modifying the searchability of the document fields themselves. You can set search parameters on the Page Type or index itself.
Go to Page Types --> [your doc type] --> Search fields, and set what fields are and aren't exposed to searching.
Version 9 gives you these settings in the Smart Search app. See these docs for details.

Create multi-page document from a single page template using doc4j

I am planning to use doc4j for search and replace in a template. I'do like to create the page for each member in the list. Basically, I need to replicate the same page from the template. I have done simple search and replace. However, this little complex one for which I need some sample examples. Here is my requirement:
I have a docx template which has the content with place holders.
There is a table with 3 columns in it and I need to replace with different values for each column like first name, last name and etc. The number of rows may vary anywhere from one to 200. So technically this may go beyond one page. If it exceeds more than one page, then I need the table header to repeat in the next page too.
I want to copy the same template on every page and replace the place holder. Basically create a single document with multiple pages each page for one member.
Please provide me with the example.
Appreciate the help.
Thanks.

Automate adding bookmarks to tables and then create an index

I have a program which outputs a collection of tables in a word document which I eventually want to post as an html file with bookmarks and an index. The tables are grouped by "Name:" where there is a 3 row table that contains detailed header information for a section of data, then there is a second table which can span multiple pages which contains the data for that section. There is then a page break so that the next sections header table is on a new page. This can occur for a variable number of sections numbers in the hundreds. I need to write a script that
searches my document for "Name:", which is unique and would not
appear anywhere but the header table,
grabs the text that follows "Name:" within that table cell (for example "Name: Line 1234)
replaces all the blanks in that text string with an underscore to
make it a suitable bookmark name,
creates a bookmark with the name,
goes back and creates an index at the front of the document
Saves the file as an html
I have a passing familiarity with VB for word, I have used it a bit in excel, but am by no means an expert. I would appreciate any advice on functions and objects that I should be using for this script.
Hey MikeV from what I can gather, your problem seems more conceptual, less specific. What I mean is, have you started yet? Or looking at a blank script page?
I'm relatively new to coding, so I get that myself. What I do is make a list of what I need to do (what you have). Then think of the code or psuedo-code that would go with each step. Then you can start to build your script. You don't have to start with step one (as step 2/3 is often the more interesting bit), but let's do that.
Now, you need to search for a text string containing "Name:". I am proficient with VBA in excel, but haven't done anything for word. So I'd look it up. Googling "VBA find word in word document" will bring you to this page, which shows you how to approach step one. So steal their code, alter it to fit your needs and move on to step 2. Repeat the process, and that's how you build your algorithm! :)
Just a FYI, typically StackOverflow is for specific questions with an answer that can be confirmed, whereas you asked for help building an algorithm. I'd reserve those questions for your programming professor or friend who can help.
cheers

Any way to make Word table first row repeat when inside other table?

I'm automating a word document in vb.net. My problem is I need a table within another table to repeat the first row. Is there any way to do this?
The table's textwrap is set to none, and the first row is the only one that has the repeat has header property set.
I CAN'T take the table out of it's containing table. This solution is not an option.
This has nothing to do with the fact that the document is automated too.
Using word 2010.
I just did a quick test with Word 2010.
I created a table, and checked "repeat as header row at the top of each page".
Sure enough, that worked.
Then I created another table, and cut/pasted the first one into it.
The header was not repeated in the Word UI, although the property remained set.
I had a quick look around for properties which might affect the behaviour of nested tables, but couldn't see any.
Then I googled "word nested table repeat header row", which returns quite a few relevant results.
Conclusion: you can't make a header row in a nested table repeat on subsequent pages. Tricks like putting it inside a content control didn't seem to work either.