Special Characters that can't be indexed using lucene

Special Characters that can't be indexed using lucene - apache

I know the list of special characters that can be indexed using Apache Lucene. Can some one tell me if there are any special characters that cannot be indexed using Apache Lucene library?

From: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping%20Special%20Characters
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
So basically it looks like you can index anything, just have to escape it.

Related

Include certain escapement symbols into ANTLR Lexer rules

I'm creating a parser in Antlr4 and Python. Below is the Lexer rules I created in Antlr.
VARIABLE_ID : [$][a-zA-Z][a-zA-Z0-9_]*;
ARRAY_ID : [*][a-zA-Z][a-zA-Z0-9_]*;
STRINGCONST : ["][/|:.a-zA-Z0-9 ]+["];
WS : [ \r\t\f\n]+ -> skip;
I am looking at the STRINGCONST rule and I'm trying to add symbols such as - and ~, however, since they are escapement characters, Antlr is just throwing errors for me. I've tried escaping them with themselves and I haven't been able to get that to work.
Is there a way to include them in the STRINGCONST rule? The basic idea is that I want a string to be identified as any character between two " " marks however I'm happy to limit it to what's currently in the rule as long as I can get - and ~ in there as well.

You can escape chars by adding a \ in front of them:
STRINGCONST : ["] [/|:.a-zA-Z0-9 \-~]+ ["];
And note that ~ has no special meaning inside a char class (only outside of them), so ~ doesn't need to be escaped.

Remove special characters and alphabets from a string except number in sql query in db2

Hi I tried using Regex_replace and it is still not working.
select CASE WHEN sbbb <> ' ' THEN regexp_replace(sbbb,'[a-zA-Z _-#]','']
ELSE sbbb
AS ABCDF
from Table where sccc=1;
This is the query which I am using to remove alphabets and specials characters from string and have only numbers. but it doesnot work. Query returns me the complete string with numbers,characters and special characters .What is wrong in the above query
I am working on a sql query. There is a column in database which contains characters,special characters and numbers. I want to only keep the numbers and remove all the special characters and alphabets. How can I do it in query of DB2. If a use PATINDEX it is not working. please help here.

The allowed regular expression patterns are listed on this page
Regular expression control characters
Outside of a set, the following must be preceded with a backslash to be treated as a literal
* ? + [ ( ) { } ^ $ | \ . /
Inside a set, the follow must be preceded with a backslash to be treated as a literal
Characters that must be quoted to be treated as literals are [ ] \
Characters that might need to be quoted, depending on the context are - &
So for you, this should work
regexp_replace(sbbb,'[a-zA-Z _\-#]','')

Sol regular expression query error for starting with u

I am using solr 3.I can search starting with attributeValue:\hin* But it fails forattributeValue:\uo*
error is
"error": {
"msg": "org.apache.solr.search.SyntaxError: Non-hex character in Unicode escape sequence: o",
"code": 400
}
Issue is \u I can not exclude u as user can search anything from type+search.

When you're searching for something starting with \u it's treated as a Unicode symbol. Of course o symbol isn't allowed to be in the Unicode symbol. If you want to search for \ you need to escape it. More info on this:
Lucene/Solr supports escaping special characters that are part of the query
syntax. The current list special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character

Lucene 5.0.0 - search string with special characters

I am using Lucene version 5.0.0.
In my search string, there is a minus character like “test-”.
I read that the minus sign is a special character in Lucene. So I have to escape that sign, as in the queryparser documentation:
Escaping Special Characters:
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
To do that I use the QueryParser.escape method:
query = parser.parse(QueryParser.escape(searchString));
I use the classic Analyzer because I noticed that the standard Analyzer has some problems with escaping special characters.
The problem is that the Parser deletes the special characters and so the Query has the term
content:test
How can I set up the parser and searcher to search for the real value “test-“?
I also created my own query with the content test- but that also didn’t work. I recieved 0 results but my index has entries like:
Test-VRF
Test-IPLS
I am really confused about this problem.

While escaping special characters for the queryparser deals with part of the problem, it doesn't help with analysis.
Neither classic nor standard analyzer will keep punctuation in the indexed form of the field. For each of these examples, the indexed form will be in two terms:
test and vrf
test and ipls
This is why a manually constructed query for "test-" finds nothing. That term does not exist in the index.
The goal of these analyzers is to attempt to index words. As such, punctuation is mostly eliminated, and is not searchable. A phrase query for "test vrf" or "test-vrf" or "test_vrf" are all effectively identical. If that is not what you need, you'll need to look to other analyzers.

The goal to fix this issue is to store the value content in an NOT_ANALYZED way.
Field fieldType = new Field(key.toLowerCase(),value, Field.Store.YES, Field.Index.NOT_ANALYZED);
Someone who has the same problem has to take care how to store the contents in the index.
To request the result create a query in this way
searchString = QueryParser.escape(searchString);
and use for example a WhitespaceAnalyzer.

Regex in Postgres - not doing what I'm trying to do (newbie question)

I have this regex in a query in postgres and I cannot figure out why it is not matching anything after the text specified in the regex;
The idea is removing the last part, including the separator characters between.
I have records like these to match:
Villa hermosa, Pilar, PCIA. BS. AS.
Esmeralda - Pilar - BUENOS AIRES.
San Martin, BUENOS AIRES.-
and I'm using this expression:
regexp_replace(location,
'([,\s\.-]*PCIA. BS. AS[,\s\.-]*|
[,\s\.-]*BUENOS. AIRES[,\s\.-]*$|
[,\s\.-]*BS. AS[,\s\.-]*$|
[,\s\.-]*P.B.A[,\s\.-]*$)', '' )
this is working fine from the text PCIA, BUENOS, but it is not taking the ',' '.' the '-' nor spaces after the word. I need help finding where the problem is.

Double your backslashes. \ => \\
Postgres thinks you're doing escapes on the string itself.
In newer PostgreSQL versions where standard_conforming_strings is on by default it is no longer necessary to double backslashes unless you're using an E'string' or have explicitly set standard_conforming_strings to off.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Special Characters that can't be indexed using lucene - apache

I know the list of special characters that can be indexed using Apache Lucene. Can some one tell me if there are any special characters that cannot be indexed using Apache Lucene library?

Related

Include certain escapement symbols into ANTLR Lexer rules

Remove special characters and alphabets from a string except number in sql query in db2

Sol regular expression query error for starting with u

Lucene 5.0.0 - search string with special characters

Regex in Postgres - not doing what I'm trying to do (newbie question)

Categories

Resources