Lucene 5.0.0 - search string with special characters

Lucene 5.0.0 - search string with special characters - lucene

I am using Lucene version 5.0.0.
In my search string, there is a minus character like “test-”.
I read that the minus sign is a special character in Lucene. So I have to escape that sign, as in the queryparser documentation:
Escaping Special Characters:
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
To do that I use the QueryParser.escape method:
query = parser.parse(QueryParser.escape(searchString));
I use the classic Analyzer because I noticed that the standard Analyzer has some problems with escaping special characters.
The problem is that the Parser deletes the special characters and so the Query has the term
content:test
How can I set up the parser and searcher to search for the real value “test-“?
I also created my own query with the content test- but that also didn’t work. I recieved 0 results but my index has entries like:
Test-VRF
Test-IPLS
I am really confused about this problem.

While escaping special characters for the queryparser deals with part of the problem, it doesn't help with analysis.
Neither classic nor standard analyzer will keep punctuation in the indexed form of the field. For each of these examples, the indexed form will be in two terms:
test and vrf
test and ipls
This is why a manually constructed query for "test-" finds nothing. That term does not exist in the index.
The goal of these analyzers is to attempt to index words. As such, punctuation is mostly eliminated, and is not searchable. A phrase query for "test vrf" or "test-vrf" or "test_vrf" are all effectively identical. If that is not what you need, you'll need to look to other analyzers.

The goal to fix this issue is to store the value content in an NOT_ANALYZED way.
Field fieldType = new Field(key.toLowerCase(),value, Field.Store.YES, Field.Index.NOT_ANALYZED);
Someone who has the same problem has to take care how to store the contents in the index.
To request the result create a query in this way
searchString = QueryParser.escape(searchString);
and use for example a WhitespaceAnalyzer.

Related

Lucene query syntax in Kibana

Im struggling to search for a simple phrase using Lucene syntax in Kibana.
We have logs that look like the following lines:
API :: GetStatus :: MP181210.1524.O47211 :: Not found.
API :: GetStatus :: MP181210.1144.V12345 :: Found - some random stuff here.
I want to find all the lines that have "Found - " in them, so I figured (since hyphen is a reserved symbol) that I should search for:
"API :: GetStatus ::" AND "Found \-"
However, that for some reason just ignores the trailing hyphen and these are the results I get
Can anyone point me in the right direction?

The problem isn't really your query syntax (hyphens are not reserved characters when quoted in a phrase, by the way, so escaping wouldn't be necessary). Lucene analyzes it's input into tokens, or terms in lucene parlance, which it indexes and makes searchable. The default analyzer (and most analyzers, really) tries to tokenize it into words. The hyphen will be treated as punctuation, so it is not indexed and is not searchable. In order to search for it, you would need to change your analyzer and reindex.

Issue with using wildcards with lucene .net QueryParser

I have following code for Lucene .Net search:
If I use query like:
AccountId:1 AND CompanyId:1 AND CreatedOn:[636288660000000000 TO 636315443990000000] AND AuditProperties.FriendlyName.NewValue:CustomerId|235
It works fine with exact match with CustomerId = 235.
However, if I try to search for a wildcard match like for example:
AccountId:1 AND CompanyId:1 AND CreatedOn:[636288660000000000 TO 636315443990000000] AND AuditProperties.FriendlyName.NewValue:CustomerId|*235*
it doesn't fetch me any results. I think it is still going for an exact match with value "*235*" Am I missing anything here?
Thanks!

As per the QueryParser syntax documentation, the character | is not supported. However, it is not very clear whether you intended it to be a logical OR or a literal character.
Logical OR
The correct syntax for logical OR is either CustomerId OR *235*, CustomerId *235* or CustomerId||*235*.
Also, if this is meant to be a logical OR, you have to allow for a leading wildcard character as pointed out in Howto perform a 'contains' search rather than 'starts with' using Lucene.Net.
parser.AllowLeadingWildcard = true;
Literal |
To search for a literal pipe character, you should escape the character so the parser doesn't confuse it with a logical OR.
CustomerId\|*235*

howto cut text from specific character in sqlite query

SQLITE Query question:
I have a query which returns string with the character '#' in it.
I would like to remove all characters after this specific character '#':
select field from mytable;
result :
text#othertext
text2#othertext
text3#othertext
So in my sample I would like to create a query which only returns :
text
text2
text3
I tried something with instr() to get the index, but instr() was not recognized as a function -> SQL Error: no such function: instr (probably old version of db . sqlite_version()-> 3.7.5).
Any hints howto achieve this ?

There are two approaches:
You can rtrim the string of all characters other than the # character.
This assumes, of course, that (a) there is only one # in the string; and (b) that you're dealing with simple strings (e.g. 7-bit ASCII) in which it is easy to list all the characters to be stripped.
You can use sqlite3_create_function to create your own rendition of INSTR. The specifics here will vary a bit upon how you're using

Zend Lucene and hyphens - searching part numbers

I have set up Zend Lucene to search products_name and part_number.
This works well, however there are issues with hyphenated part numbers.
For example, if I have the part number: 5130193-00
This will return any part number with '00' at the end.
How can I make Lucene only return the exact part number?
I am using Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive()); when indexing and searching (CaseInsensitive does not work, but that's another issue) and the part numbers are indexed as Text.

Try escaping dash with a slash: part_number:5130193\-00.
More information is available here (see Escaping Special Characters).

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?

You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.

Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene 5.0.0 - search string with special characters - lucene

Related

Lucene query syntax in Kibana

Issue with using wildcards with lucene .net QueryParser

howto cut text from specific character in sqlite query

Zend Lucene and hyphens - searching part numbers

Approximate search with openldap

Categories

Resources