Extract terms from query for highlighting - lucene

I'm extracting terms from the query calling ExtractTerms() on the Query object that I get as the result of QueryParser.Parse(). I get a HashTable, but each item present as:
Key - term:term
Value - term:term
Why are the key and the value the same? And more why is term value duplicated and separated by colon?
Do highlighters only insert tags or to do anything else? I want not only to get text fragments but to highlight the source text (it's big enough). I try to get terms and by offsets to insert tags by hand. But I worry if this is the right solution.

I think the answer to this question may help.

It is because .Net 2.0 doesnt have an equivalent to java's HashSet. The conversion to .Net uses Hashtables with the same value in key/value. The colon you see is just the result of Term.ToString(), a Term is a fieldname + the term text, your field name is probably "term".
To highlight an entire document using the Highlighter contrib, use the NullFragmenter

Related

PostgreSQL full text search doesn't work in some case (Django)

I notice that in django when there is a sentence containing PLAZA/MASTERPIECE then when we search masterpiece I can't find this sentence. Is this a limitation of PostgreSQL full text search. Or how to solve this?
finalquery = SearchQuery("keyword")
vector = SearchVector('thefieldIwanttosearch')
self.search_results = self.search_results.annotate(search=vector).filter(search=finalquery).annotate(rank=SearchRank(vector, finalquery))
Is there any document about this? Thanks!
Yes, this is all documented.
When you write filter(search=finalquery) you're not specifying a lookup type.
As a convenience when no lookup type is provided (like in Entry.objects.get(id=14)) the lookup type is assumed to be exact.
So you're filtering on an exact match for "masterpiece". What you probably want is contains or icontains.

Get Text Symbol Programmatically With ID

Is there any way of programmatically getting the value of a Text Symbol at runtime?
The scenario is that I have a simple report that calls a function module. I receive an exported parameter in variable LV_MSG of type CHAR1. This indicates a certain status message created in the program, for instance F (Fail), X (Match) or E (Error). I currently use a CASE statement to switch on LV_MSG and fill another variable with a short description of the message. These descriptions are maintained as text symbols that I retrieve at compile time with text-MS# where # is the same as the possible returns of LV_MSG, for instance text-MSX has the value "Exact Match Found".
Now it seems to me that the entire CASE statement is unnecessary as I could just assign to my description variable the value of the text symbol with ID 'MS' + LV_MSG (pseudocode, would use CONCATENATE). Now my issue is how I can find a text symbol based on the String representation of its ID at runtime. Is this even possible?
If it is, my code would look cleaner and I wouldn't have to update my actual code when new messages are added in the function module, as I would simply have to add a new text symbol. But would this approach be any faster or would it in fact degrade the report's performance?
Personally, I would probably define a domain and use the fixed values of the domain to represent the values. This way, you would even get around the string concatenation. You can use the function module DD_DOMVALUE_TEXT_GET to easily access the language-dependent text of a domain value.
To access the text elements of a program, use a function module like READ_TEXT_ELEMENTS.
Be aware that generic programming like this will definitely slow down your program. Whether it would make your code look cleaner is in the eye of the beholder - if the values change rarely, I don't see why a simple CASE statement should be inferior to some generic text access.
Hope I understand you correctly but here goes. This is possible with a little trickery, all the text symbols in a report are defined as variables in the program (with the name text-abc where abc is the text ID). So you can use the following:
data: lt_all_text type standard table of textpool with default key,
lsr_text type ref to textpool.
"Load texts - you will only want to do this once
read textpool sy-repid into lt_all_text language sy-langu.
sort lt_all_Text by entry.
"Find a text, the field KEY is the text ID without TEXT-
read table lt_all_text with key entry = i_wanted_text
reference into lsr_text binary search.
If you want the address you can add:
field-symbols: <l_text> type any.
data l_name type string.
data lr_address type ref to data.
concatenate 'TEXT-' lsr_text->key into l_name.
assign (l_name) to <l_text>.
if sy-subrc = 0.
get reference of <l_text> into lr_address.
endif.
As vwegert pointed out this is probably not the best solution, for error handling rather use message classes or exception objects. This is useful in other cases though so now you know how.

Find All in a Textbox

I am working on an application to search for and build a list of all the times a string (or variable of) is in a text file. Kind of like a Find All function in a text editor that I can build a list with the info that is found, such as
S350
S250
S270
S5000
What can I use to do this search? It will have one value that does not change (The S in this case) followed by up to 4 digits
RegEx seems like a good choice.
Something like.. S(\d{1,4})? might work for you.
Expresso is my preferred regular expression composer.

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?
Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.