How to percent-encode only some characters - sparql

The SPARQL function ENCODE_FOR_URI escapes all except unreserved URI characters in the input. How do I change it to ignore certain (non-ASCII characters for use in IRI for example) characters?

This is a non-standard solution, as it requires additional regex support (lookahead) beyond what the SPARQL specification mandates, but it works for some triple stores/SPARQL engines (e.g. Wikidata). Here's the full solution: it also requires to pick a character that should not (and cannot) be replaced (_ in this case) and a character not present in the input (\u0000 cannot be stored in RDF so this is a good pick)
BIND("0/1&2]3%4#5_" AS ?text)
BIND(REPLACE(?text, "[^\u0001-\u005E\u0060-\u007F]+", "") AS ?filtered) # the characters to keep
BIND(REPLACE(?filtered, "(.)(?=.*\\1)", "", "s") AS ?shortened) # leaves only one of each character
BIND(REPLACE(?shortened, "(.)", "_$1", "s") AS ?separated) # separates the characters via _
BIND(CONCAT(?separated, ENCODE_FOR_URI(?separated)) AS ?encoded) # appends the encoded variant after it
BIND(CONCAT("_([^_]*)(?=(?:_[^_]*){", STR(STRLEN(?shortened) - 1), "}_([^_]*))?") AS ?regex)
BIND(REPLACE(?encoded, ?regex, "$1$2\u0000", "s") AS ?replaced) # groups the character and replacement together, separated by \u0000
BIND(REPLACE(?shortened, "([-\\]\\[])", "\\\\$1") AS ?class) # converts the remaining characters to a valid regex class
BIND(CONCAT(?text, "\u0000", ?replaced) AS ?prepared) # appends the replacement groups after the original text
BIND(CONCAT("([", ?class, "])(?=.*?\u0000\\1([^\u0000]*))|\u0000.*") AS ?regex2)
BIND(REPLACE(?prepared, ?regex2, "$2", "s") AS ?result) # replaces each occurrence of the character by its replacement in the group at the end
If you know the precise replacements beforehand, only the last 3 lines are necessary, to form the string.

Related

How to index special characters in sphinx

I have some special characters (e.g #) between my text and I don't want that these character treated as separators. I have added these characters to charset_table:
charset_table = 0..9, english, _, #
I also used this format (U+23) but it didn't work. How can I index this characters?

regex capture middle of url

I'm trying to figure out the base regex to capture the middle of a google url out of a sql database.
For example, a few links:
https://www.google.com/cars/?year=2016&model=dodge+durango&id=1234
https://www.google.com/cars/?year=2014&model=jeep+cherokee+crossover&id=6789
What would be the regex to capture the text to get dodge+durango , or jeep+cherokee+crossover ? (It's alright that the + still be in there.)
My Attempts:
1)
\b[=.]\W\b\w{5}\b[+.]?\w{7}
, but this clearly does not work as this is a hard coded scenario that would only work like something for the dodge durango example. (would extract "dodge+durango)
2) Using positive lookback ,
[^+]( ?=&id )
but I am not fully sure how to use this, as this only grabs one character behind the & symbol.
How can I extract a string of (potentially) any length with any amount of + delimeters between the "model=" and "&id" boundaries?
seems like you could use regexp_replace and access match groups:
regexp_replace(input, 'model=(.*?)([&\\s]|$)', E'\\1')
from here:
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, flags ]). The
source string is returned unchanged if there is no match to the
pattern. If there is a match, the source string is returned with the
replacement string substituted for the matching substring. The
replacement string can contain \n, where n is 1 through 9, to indicate
that the source substring matching the n'th parenthesized
subexpression of the pattern should be inserted, and it can contain \&
to indicate that the substring matching the entire pattern should be
inserted. Write \ if you need to put a literal backslash in the
replacement text. The flags parameter is an optional text string
containing zero or more single-letter flags that change the function's
behavior. Flag i specifies case-insensitive matching, while flag g
specifies replacement of each matching substring rather than only the
first one
I may be misunderstanding, but if you want to get the model, just select everything between model= and the ampersand (&).
regexp_matches(input, 'model=([^&]*)')
model=: Match literally
([^&]*): Capture
[^&]*: Anything that isn't an ampersand
*: Unlimited times

Regular Expression for alphanumeric and some special characters not adjacent

I would like to have a regular expression to make an Oracle SQL REGEXP_LIKE query that checks
if a string starts with one alphanumeric character
if the string ends with one alphanumeric character
if the "body" of the string contains only alphanumeric character OR these authorized characters (written) : hyphen (dash), dot, apostrophe, space
if the authorised characters are NOT adjacent (to avoid something like "he--'''l..'-lo")
I started with this :
^[a-zA-Z0-9]+(a-zA-Z0-9\-\.'|([^\-\.'])\1)*[a-zA-Z0-9]$
I used backslash to escape assuming that dot and hyphen are metacharacters
I think this is what you want:
^[a-zA-Z0-9]+([-.' ][a-zA-Z0-9]|[a-zA-Z0-9])*\w?$
It looks for
at least 1 alphanumeric (alnum),
followed by
either an authorized character followed by an alphanumeric or just an alphanumeric, repeated any number of times (including 0).
optionally followed by
an alnum
This meets your specification. I'm not sure if starts with one alnum and ends with one alnum means that there must be at least 2 alnums, or if they can be the same. If there must be at least 2 of them, remove the last ? (which make the last alnum optional).
Regards
assuming you meant "authorised characters are NOT adjacent to each other"
try something along these lines
^[a-zA-Z0-9]+([a-zA-Z0-9]+[\-\.' ]?)*[a-zA-Z0-9]$
so that the repeating middle part always has one alphanumeric character followed by zero to one special characters.

Import format to intellij idea from JSMin/JSFormat

Does anybody knows which formatting rules uses jsmin/jsformatter plugin of Notepad++? I need this because we are forced to use this formatter but I'm using intellij idea to write js code. So having this rules I can import it some how or, at least, apply manually.
Thanks everyone in advance!
The minimising rules applied are listed here:
http://www.crockford.com/javascript/jsmin.html
JSMin is a filter that omits or modifies some characters. This does
not change the behavior of the program that it is minifying. The
result may be harder to debug. It will definitely be harder to read.
JSMin first replaces carriage returns ('\r') with linefeeds ('\n'). It
replaces all other control characters (including tab) with spaces. It
replaces comments in the // form with linefeeds. It replaces comments
in the /* */ form with spaces. All runs of spaces are replaced with a
single space. All runs of linefeeds are replaced with a single
linefeed.
It omits spaces except when a space is preceded and followed by a
non-ASCII character or by an ASCII letter or digit, or by one of these
characters:
\ $ _
It is more conservative in omitting linefeeds, because linefeeds are
sometimes treated as semicolons. A linefeed is not omitted if it
precedes a non-ASCII character or an ASCII letter or digit or one of
these characters:
\ $ _ { [ ( + -
and if it follows a non-ASCII character or an ASCII letter or digit or
one of these characters:
\ $ _ } ] ) + - " '
No other characters are omitted or modified.
There are other custom formatting rules applied according to the plugin developer's page:
http://www.sunjw.us/jsminnpp/

why ldap search return all results when using %?

When I search one ldap server using the following filter
(cn=%*)
It return all results under the base dn? LDAP treat '%' specially? But I haven't found any description about it.
What is your directory server ?
Are you sure tha '%' is not replace by your command line interpreter or your compiler ?
According to RFC2254 % is not a special character
If a value should contain any of the following characters
Character ASCII value
---------------------------
* 0x2a
( 0x28
) 0x29
\ 0x5c
NUL 0x00
the character must be encoded as the backslash '\' character (ASCII
0x5c) followed by the two hexadecimal digits representing the ASCII
value of the encoded character. The case of the two hexadecimal
digits is not significant.
This simple escaping mechanism eliminates filter-parsing ambiguities
and allows any filter that can be represented in LDAP to be
represented as a NUL-terminated string. Other characters besides the
ones listed above may be escaped using this mechanism, for example,
non-printing characters.
For example, the filter checking whether the "cn" attribute contained
a value with the character "" anywhere in it would be represented as
"(cn=\2a*)".
Note that although both the substring and present productions in the
grammar above can produce the "attr=*" construct, this construct is
used only to denote a presence filter.