How to index special characters in sphinx

How to index special characters in sphinx - indexing

I have some special characters (e.g #) between my text and I don't want that these character treated as separators. I have added these characters to charset_table:
charset_table = 0..9, english, _, #
I also used this format (U+23) but it didn't work. How can I index this characters?

Related

How to percent-encode only some characters

The SPARQL function ENCODE_FOR_URI escapes all except unreserved URI characters in the input. How do I change it to ignore certain (non-ASCII characters for use in IRI for example) characters?

This is a non-standard solution, as it requires additional regex support (lookahead) beyond what the SPARQL specification mandates, but it works for some triple stores/SPARQL engines (e.g. Wikidata). Here's the full solution: it also requires to pick a character that should not (and cannot) be replaced (_ in this case) and a character not present in the input (\u0000 cannot be stored in RDF so this is a good pick)
BIND("0/1&2]3%4#5_" AS ?text)
BIND(REPLACE(?text, "[^\u0001-\u005E\u0060-\u007F]+", "") AS ?filtered) # the characters to keep
BIND(REPLACE(?filtered, "(.)(?=.*\\1)", "", "s") AS ?shortened) # leaves only one of each character
BIND(REPLACE(?shortened, "(.)", "_$1", "s") AS ?separated) # separates the characters via _
BIND(CONCAT(?separated, ENCODE_FOR_URI(?separated)) AS ?encoded) # appends the encoded variant after it
BIND(CONCAT("_([^_]*)(?=(?:_[^_]*){", STR(STRLEN(?shortened) - 1), "}_([^_]*))?") AS ?regex)
BIND(REPLACE(?encoded, ?regex, "$1$2\u0000", "s") AS ?replaced) # groups the character and replacement together, separated by \u0000
BIND(REPLACE(?shortened, "([-\\]\\[])", "\\\\$1") AS ?class) # converts the remaining characters to a valid regex class
BIND(CONCAT(?text, "\u0000", ?replaced) AS ?prepared) # appends the replacement groups after the original text
BIND(CONCAT("([", ?class, "])(?=.*?\u0000\\1([^\u0000]*))|\u0000.*") AS ?regex2)
BIND(REPLACE(?prepared, ?regex2, "$2", "s") AS ?result) # replaces each occurrence of the character by its replacement in the group at the end
If you know the precise replacements beforehand, only the last 3 lines are necessary, to form the string.

Query to replace special characters in phone number field

Can anyone help with a query on how to replace special/non-numeric/hidden characters from a phone number column.
I've tried
LTRIM(RTRIM(REGEXP_REPLACE(
PHONE_NBR,
'[^[:digit:]][:cntrl:][:alpha:][:graph:][:blank:][:print:][:punct:][:space:]~',
'')))
but no luck, there are still a few records which contain non-numeric values.

Your regex is saying to ONLY replace a string consisting of: a non-numeric character followed by a control character, an alpha, a graph, a blank, a print, a punct, a space, and then a tilde.
You should be able to just use '[^[:digit:]]' as your regex, to remove all non-numeric characters.

Query to get records with special characters only

I am trying to write a query which will tell me if certain record is having only the special characters. e.g- "%^&%&^%&" will error however "%HH678*(*))" is fine (as it's having alphanumeric values as well. I have written following query however, it's working fine only for English alphabets and numbers, if column is having some other characters like mandarin then also it's not giving expected value.Any help is highly appreciated.
SELECT * FROM test WHERE REGEXP_LIKE(sampletext, '[^]^A-Z^a-z^0-9^[^.^{^}^ ]' );

You may try this,
regexp_like(text, '^[^A-Za-z0-9]+$')
This would match the text only if the input text contains special chars ie, only chars which are not of letters or digits.

To detect strings containing only characters other than unaccented alphabetic, numeric, or white space characters try this:
regexp_like(text,'^[^a-zA-Z0-9[:space:]]+$')
If you don't think punctuation characters are special than add [:punct:] to the class of characters to ignore.
If you are looking for a specific set of characters you can use a character class of just those characters of interest, for example some common accented characters (note the lack of a leading ^ in the character class []:
regexp_like(text,'^[àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$')

Can't index and search for special symbol ($ sign) in SOLR

I am using following tokenizer and filters while indexing my data.
Problem: i have some field which have just a single special character as content like $. Now standard tokenizer stripped it out. But if i use white space tokenizer it doesn't split the sentence at any punctuation characters.
solr.StandardTokenizerFactory
solr.StopFilterFactory
solr.ASCIIFoldingFilterFactory
solr.StandardFilterFactory
solr.LowerCaseFilterFactory
solr.PatternReplaceFilterFactory
solr.SnowballPorterFilterFactory
solr.EdgeNGramFilterFactory

Will Alphanumeric contain _ and space?

If a field is defined as alphanumeric, are spaces and underscores (_) allowed?
I hope they are not.
Can anyone confirm?

Alphanumeric characters by definition only comprise the letters A to Z and the digits 0 to 9. Spaces and underscores are usually considered punctuation characters, so no, they shouldn't be allowed.
If a field specifically says "alphanumeric characters, space and underscore", then they're included. Otherwise in most cases you generally assume they're not.

I came here wondering why \w in regex includes underscore, I had assumed \w meant alphanumeric [A-Za-z0-9] but that is not the case in regex.
In most regex engines \w is a shortform for [A-Za-z0-9_].
However in the case of regex python, besides including underscore, \w also includes letters with diacritics, letters from other scripts, etc. Such as the German letter "ö" in "schön".
So now I've learned to use the longform [A-Za-z0-9] if I wanted to be specifically alphanumeric in regex.

Alphanumeric characters are A to Z, a to z and 0 to 9

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to index special characters in sphinx - indexing

I have some special characters (e.g #) between my text and I don't want that these character treated as separators. I have added these characters to charset_table: charset_table = 0..9, english, _, # I also used this format (U+23) but it didn't work. How can I index this characters?

Related

How to percent-encode only some characters

Query to replace special characters in phone number field

Query to get records with special characters only

Can't index and search for special symbol ($ sign) in SOLR

Will Alphanumeric contain _ and space?

Categories

Resources