I am using following tokenizer and filters while indexing my data.
Problem: i have some field which have just a single special character as content like $. Now standard tokenizer stripped it out. But if i use white space tokenizer it doesn't split the sentence at any punctuation characters.
solr.StandardTokenizerFactory
solr.StopFilterFactory
solr.ASCIIFoldingFilterFactory
solr.StandardFilterFactory
solr.LowerCaseFilterFactory
solr.PatternReplaceFilterFactory
solr.SnowballPorterFilterFactory
solr.EdgeNGramFilterFactory
Related
The SPARQL function ENCODE_FOR_URI escapes all except unreserved URI characters in the input. How do I change it to ignore certain (non-ASCII characters for use in IRI for example) characters?
This is a non-standard solution, as it requires additional regex support (lookahead) beyond what the SPARQL specification mandates, but it works for some triple stores/SPARQL engines (e.g. Wikidata). Here's the full solution: it also requires to pick a character that should not (and cannot) be replaced (_ in this case) and a character not present in the input (\u0000 cannot be stored in RDF so this is a good pick)
BIND("0/1&2]3%4#5_" AS ?text)
BIND(REPLACE(?text, "[^\u0001-\u005E\u0060-\u007F]+", "") AS ?filtered) # the characters to keep
BIND(REPLACE(?filtered, "(.)(?=.*\\1)", "", "s") AS ?shortened) # leaves only one of each character
BIND(REPLACE(?shortened, "(.)", "_$1", "s") AS ?separated) # separates the characters via _
BIND(CONCAT(?separated, ENCODE_FOR_URI(?separated)) AS ?encoded) # appends the encoded variant after it
BIND(CONCAT("_([^_]*)(?=(?:_[^_]*){", STR(STRLEN(?shortened) - 1), "}_([^_]*))?") AS ?regex)
BIND(REPLACE(?encoded, ?regex, "$1$2\u0000", "s") AS ?replaced) # groups the character and replacement together, separated by \u0000
BIND(REPLACE(?shortened, "([-\\]\\[])", "\\\\$1") AS ?class) # converts the remaining characters to a valid regex class
BIND(CONCAT(?text, "\u0000", ?replaced) AS ?prepared) # appends the replacement groups after the original text
BIND(CONCAT("([", ?class, "])(?=.*?\u0000\\1([^\u0000]*))|\u0000.*") AS ?regex2)
BIND(REPLACE(?prepared, ?regex2, "$2", "s") AS ?result) # replaces each occurrence of the character by its replacement in the group at the end
If you know the precise replacements beforehand, only the last 3 lines are necessary, to form the string.
I have some special characters (e.g #) between my text and I don't want that these character treated as separators. I have added these characters to charset_table:
charset_table = 0..9, english, _, #
I also used this format (U+23) but it didn't work. How can I index this characters?
I am trying to write a query which will tell me if certain record is having only the special characters. e.g- "%^&%&^%&" will error however "%HH678*(*))" is fine (as it's having alphanumeric values as well. I have written following query however, it's working fine only for English alphabets and numbers, if column is having some other characters like mandarin then also it's not giving expected value.Any help is highly appreciated.
SELECT * FROM test WHERE REGEXP_LIKE(sampletext, '[^]^A-Z^a-z^0-9^[^.^{^}^ ]' );
You may try this,
regexp_like(text, '^[^A-Za-z0-9]+$')
This would match the text only if the input text contains special chars ie, only chars which are not of letters or digits.
To detect strings containing only characters other than unaccented alphabetic, numeric, or white space characters try this:
regexp_like(text,'^[^a-zA-Z0-9[:space:]]+$')
If you don't think punctuation characters are special than add [:punct:] to the class of characters to ignore.
If you are looking for a specific set of characters you can use a character class of just those characters of interest, for example some common accented characters (note the lack of a leading ^ in the character class []:
regexp_like(text,'^[àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$')
I am using Lucene version 5.0.0.
In my search string, there is a minus character like “test-”.
I read that the minus sign is a special character in Lucene. So I have to escape that sign, as in the queryparser documentation:
Escaping Special Characters:
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
To do that I use the QueryParser.escape method:
query = parser.parse(QueryParser.escape(searchString));
I use the classic Analyzer because I noticed that the standard Analyzer has some problems with escaping special characters.
The problem is that the Parser deletes the special characters and so the Query has the term
content:test
How can I set up the parser and searcher to search for the real value “test-“?
I also created my own query with the content test- but that also didn’t work. I recieved 0 results but my index has entries like:
Test-VRF
Test-IPLS
I am really confused about this problem.
While escaping special characters for the queryparser deals with part of the problem, it doesn't help with analysis.
Neither classic nor standard analyzer will keep punctuation in the indexed form of the field. For each of these examples, the indexed form will be in two terms:
test and vrf
test and ipls
This is why a manually constructed query for "test-" finds nothing. That term does not exist in the index.
The goal of these analyzers is to attempt to index words. As such, punctuation is mostly eliminated, and is not searchable. A phrase query for "test vrf" or "test-vrf" or "test_vrf" are all effectively identical. If that is not what you need, you'll need to look to other analyzers.
The goal to fix this issue is to store the value content in an NOT_ANALYZED way.
Field fieldType = new Field(key.toLowerCase(),value, Field.Store.YES, Field.Index.NOT_ANALYZED);
Someone who has the same problem has to take care how to store the contents in the index.
To request the result create a query in this way
searchString = QueryParser.escape(searchString);
and use for example a WhitespaceAnalyzer.
I need to create an XPath query to select a JCR node whose name contains a whitespace character.
For instance: /jcr:root/foo bar/
But that results in an invalid query.
How should whitespaces be encoded in an XPath query?
Try using something like this XPath query:
/jcr:root/foo_x0020_bar/
The JSR-170 (JCR 1.0) specification defines how XPath can be used to query a JCR repository, and even though JSR-283 (or JCR 2.0) deprecated XPath as a query language, many of the implementations still support XPath along with the other query languages (including the more powerful JCR-SQL2).
Now, regarding the rules for escaping characters in XPath, JSR-170 states the following in Section 6.6.4.9:
The names of elements and attributes (corresponding to nodes and properties, respectively) within an XPath statement must correspond to the form in which they (notionally) appear in the document view. This means that spaces (and any other non-XML characters) within names must be encoded according to the rules described in 6.4.3 Escaping of Names.
Section 6.4.3 defines how such characters are escaped in names:
The escape character is the underscore (“_”). Any invalid character is escaped as _xHHHH_, where HHHH is the four-digit hexadecimal UTF-16 code for the character. When producing escape sequences the implementation should use lowercase letters for the hex digits a-f. When unescaping, however, both upper and lowercase alphabetic hexadecimal characters must be recognized.
Although you didn't ask about it, you can easily do the same query in JCR-SQL2:
SELECT * FROM [nt:base] WHERE ISSAMENODE('/foo_x0020_bar')