Without using plpgsql, I'm trying to urlencode a given text within a pgsql SELECT statement.
The problem with this approach:
select regexp_replace('héllo there','([^A-Za-z0-9])','%' || encode(E'\\1','hex'),'g')
...is that the encode function is not passed the regexp parameter, unless there's another way to call functions from within the replacement expression that actually works. So I'm wondering if there's a replacement expression that, by itself, can encode matches into hex values.
There may be other combinations of functions. I thought there would be a clever regex (and that may still be the answer) out there, but I'm having trouble finding it.
select regexp_replace(encode('héllo there','hex'),'(..)',E'%\\1','g');
This doesn't leave the alphanumeric characters human-readable, though.
Here is pretty short version, and it's even "pure SQL" function, not plpgsql. Multibyte chars (including 3- and 4-bytes emoji) are supported.
create or replace function urlencode(in_str text, OUT _result text) returns text as $$
select
string_agg(
case
when ol>1 or ch !~ '[0-9a-za-z:/#._?#-]+'
then regexp_replace(upper(substring(ch::bytea::text, 3)), '(..)', E'%\\1', 'g')
else ch
end,
''
)
from (
select ch, octet_length(ch) as ol
from regexp_split_to_table($1, '') as ch
) as s;
$$ language sql immutable strict;
Here's a function I wrote that handles encoding using built in functions while preserving the readability of the URL.
Regex matches to capture pairs of (optional) safe characters and (at most one) non-safe character. Nested selects allow those pairs to be encoded and re-combined returning a fully encoded string.
I've run through a test suite with all sorts of permutations (leading/trailing/only/repeated encoded characters and thus far it seems to encode correctly.
The safe special characters are _ ~ . - and /. My inclusion of "/" on that list is probably non-standard, but fits the use case I have where the input text may be a path and I want that to remain.
CREATE OR REPLACE FUNCTION oseberg.encode_uri(input text)
RETURNS text
LANGUAGE plpgsql
IMMUTABLE STRICT
AS $function$
DECLARE
parsed text;
safePattern text;
BEGIN
safePattern = 'a-zA-Z0-9_~/\-\.';
IF input ~ ('[^' || safePattern || ']') THEN
SELECT STRING_AGG(fragment, '')
INTO parsed
FROM (
SELECT prefix || encoded AS fragment
FROM (
SELECT COALESCE(match[1], '') AS prefix,
COALESCE('%' || encode(match[2]::bytea, 'hex'), '') AS encoded
FROM (
SELECT regexp_matches(
input,
'([' || safePattern || ']*)([^' || safePattern || '])?',
'g') AS match
) matches
) parsed
) fragments;
RETURN parsed;
ELSE
RETURN input;
END IF;
END;
$function$
You can use CLR and import the namespace or use the function shown in this link , this creates a T-SQL function that does the encoding.
http://www.sqljunkies.com/WebLog/peter_debetta/archive/2007/03/09/28987.aspx
Related
I'm trying to replace accented characters from a column to "normal" characters.
select 'áááããã'
I'd like some operation which would return 'aaaaaa'.
There is a more general way that uses a built-in JavaScript function to replace them:
Remove Diacritics from string in Snowflake
create or replace function REPLACE_DIACRITICS("str" string)
returns string
language javascript
strict immutable
as
$$
return str.normalize("NFD").replace(/\p{Diacritic}/gu, "");
$$;
select REPLACE_DIACRITICS('ö, é, č => a, o e, c');
Just found a solution with one of my colleagues.
select translate('áááããã','áéíóúãõâêôàç','aeiouaoaeoac')
We can also add a lower() to make it generalized for more cases
select translate(lower('ÁÁÁÃÃÃ'),'áéíóúãõâêôàç','aeiouaoaeoac')
I'm trying to write a PLPGSQL function which obfuscates/censors/redacts text.
-- Obfuscate a body of text by replacing lowercase letters and numbers with # symbols.
CREATE OR REPLACE FUNCTION obfuscate(str text) RETURNS text AS $$
BEGIN
str := replace(str, '\r', E'\r');
str := replace(str, '\n', E'\n');
str := translate(str, 'abcdefghijklmnopqrstuvwxyz0123456789', rpad('#',36,'#'));
str := replace(str, E'\r', '\r');
str := replace(str, E'\n', '\n');
RETURN str;
END
$$ LANGUAGE plpgsql;
This works, but note the dance to convert escaped newlines and carriage returns to their respective byte, and then back again. This is because my dataset contains strings that have been escaped (data which has been serialized to JSON/YAML), and I don't want to clobber those values.
Is there another more convenient way to unescape a string? It would be great to handle other escaped values, like unicode escape sequences, too.
To "unescape" a string, you have to "execute" it - literally. Use the EXECUTE command in plpgsql.
You can wrap this into a function. Naive approach:
CREATE OR REPLACE FUNCTION f_unescape(text, OUT _t text)
LANGUAGE plpgsql STABLE AS
$func$
BEGIN
EXECUTE 'SELECT E''' || $1 || ''''
INTO _t;
END
$func$;
Call:
SELECT f_unescape('\r\nabcdef\t0123\x123\n');
This naive function is vulnerable to single quotes in the original string, which need to be escaped. But that's a bit tricky. Single quotes can be escaped in two ways in a Posix escape string syntax: \' or ''. But we could also have \\' etc. Basics:
Insert text with single quotes in PostgreSQL
We could enclose the string in dollar quoting, but that does not work for Posix escape string syntax. E'\'' cannot be replaced with E$$\'$$. We could add SET standard_conforming_strings = off to the function, then we wouldn't have to prepend strings with E. But that would disable function inlining and interpret escapes everywhere in the function body.
Instead, escape all ' and all (optionally) leading \ with regexp_replace():
regexp_replace($1, '(\\*)(\''+)', '\1\1\2\2', 'g')
(\\*) .. 0 or more leading \
(\''+) .. capture 1 or more '
'\1\1\2\2' .. double up each match
'g' .. replace all occurrences, not just the first
Safe function
CREATE OR REPLACE FUNCTION f_unescape(IN text, OUT _t text)
RETURNS text
LANGUAGE plpgsql STABLE AS
$func$
BEGIN
EXECUTE $$SELECT E'$$ || regexp_replace($1, '(\\*)(\''+)', '\1\1\2\2', 'g') || $$'$$
INTO _t;
END
$func$;
The operation cannot be reversed reliable. There is no way to tell which special character was escaped before and which wasn't. You can escape all or none. Or do it manually like before. But if the same character was included in literal and escape form, you cannot tell them apart any more.
Test case:
SELECT t, f_unescape(t)
FROM (
VALUES
($$'$$)
, ($$''$$)
, ($$'''$$)
, ($$\'$$)
, ($$\\'$$)
, ($$\\\'$$)
, ($$\\\'''$$)
, ($$\r\\'nabcdef\\\t0123\x123\\\\\'''\n$$)
) v(t);
I'm trying to understand the meaning of this regular expression function and it purpose in the select statement.
create or replace FUNCTION REPS_MTCH(string_orig IN VARCHAR2 , string_new IN VARCHAR2, score IN NUMBER)
RETURN PLS_INTEGER AS
BEGIN
IF string_orig IS NULL AND string_new IS NULL THEN
RETURN 0;
ELSIF utl_match.jaro_winkler_similarity(replace(REGEXP_REPLACE(UPPER(string_orig), '[^a-z|A-Z|0-9]+', ''),' ',''),replace(REGEXP_REPLACE(UPPER(string_new), '[^a-z|A-Z|0-9]+', ''),' ','')) >= score THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
//the REPS_MTCH function is being called in this select statement. the select statement is to match names in the the Temp table name as REPS_MTCH_D_STDNT_TMP against the master table named as REPS_MTCH_D_STDNT_MSTR. what is the purpose of the REPS_MTCH function in this select statement?
SELECT
REPS_MTCH(REPS_MTCH_D_STDNT_TMP.FIRST_NAME,REPS_MTCH_D_STDNT_MSTR.FIRST_NAME,85) AS first_match_score,
what is the purpose of the REPS_MTCH function in this select statement?
In the above function the REGEXP_REPLACE is removing all occurrences any non alpha numeric or pipe (|) characters. After that the REGEXP_REPLACE is also wrapped in a redundant call to the regular REPLACE function which simply removes the spaces which were already removed by the REGEXP_REPLACE calls. The test could be rewritten as follows and still behave the identically since the inputs are first UPPERcased before the replace operations occur:
ELSIF utl_match.jaro_winkler_similarity(
REGEXP_REPLACE(UPPER(string_orig), '[^A-Z|0-9]+', '')
,REGEXP_REPLACE(UPPER(string_new) , '[^A-Z|0-9]+', '')
) >= score
THEN RETURN 1;
I simply removed the extra replace operation, the unnecessary lower case a-z and the extra pipe (|) character from the regular expression's character classes.
The JARO_WINKLER_SIMILARITY function just computes a score from 0 not similar to 100 identical of the remaining alpha numeric and pipe characters. You can check out the wikipedia entry on Jaro Winkler distances if you want to know more about them.
I have a table with a structure like this...
the_geom data
geom1 data1+3000||data2+1000||data3+222
geom2 data1+500||data2+900||data3+22232
I want to create a function that returns the records by user request.
Example: for data2, retrieve geom1,1000 and geom2, 900
Till now I created this function (see below) which works quite good but I am facing a parameter substitution problem... (you can see I am not able to substitute 'data2' for $1 in... BUT yes I can use $1 later
regexp_matches(t::text, E'(data2[\+])([0-9]+)'::text)::text)[2]::integer
MY FUNCTION
create or replace function get_counts(taxa varchar(100))
returns setof record
as $$
SELECT t2.counter,t2.the_geom
FROM (
SELECT (regexp_matches(t.data::text, E'(data2[\+])([0-9]+)'::text)::text)[2]::integer as counter,the_geom
from (select the_geom,data from simple_inpn2 where data ~ $1::text) as t
) t2
$$
language sql;
SELECT get_counts('data2') will work **but we should be able to make this substitution**:
regexp_matches(t::text, E'($1... instead of E'(data2....
I think its more a syntaxis issue, as the function execution gives no error, just interprets $1 as a string and gives no result.
thanks in advance,
A E'$1' is a string literal (using the escape string syntax) containing a dollar sign followed by a one. An unquoted $1 is the first parameter to your function. So this:
regexp_matches(t, E'($1[\+])([0-9]+)'))[2]::integer
as you've found, won't interpolate the $1 with the function's first parameter.
The regex is just a string, a string with an internal structure but still just a string. If you know that $1 will be a normal word then you could say:
regexp_matches(t, E'(' || $1 || E'[\+])([0-9]+)'))[2]::integer
to paste your strings together into a suitable regex. However, it is better to be a little paranoid, sooner or later someone is going to call your function with a string like 'ha ha (' so you should be prepared for it. The easiest way that I can think of to add an arbitrary string to a regex is to escape all the non-word characters:
-- Don't forget to escape the escaped escapes! Hence all the backslashes.
str := regexp_replace($1, E'(\\W)', E'\\\\\\1', 'g');
and then paste str into the regex as above:
regexp_matches(t, E'(' || str || E'[\+])([0-9]+)'))[2]::integer
or better, build the regex outside the regexp_matches to cut down on the nested parentheses:
re := E'(' || str || E'[\+])([0-9]+)';
-- ...
select regexp_matches(t, re)[2]::integer ...
PostgreSQL doesn't have Perl's \Q...\E and the (?q) metasyntax applies until the end of the regex so I can't think of any better way to paste an arbitrary string into the middle of a regex as a non-regex literal value than to escape everything and let PostgreSQL sort it out.
Using this technique, we can do things like:
=> do $$
declare
m text[];
s text;
r text;
begin
s = E'''{ha)?';
r = regexp_replace(s, E'(\\W)', E'\\\\\\1', 'g');
r = '(ha' || r || ')';
raise notice '%', r;
select regexp_matches(E'ha''{ha)?', r) into m;
raise notice '%', m[1];
end$$;
and get the expected
NOTICE: ha'{ha)?
output. But if you leave out the regexp_replace escaping step, you'll just get an
invalid regular expression: parentheses () not balanced
error.
As an aside, I don't think you need all that casting so I removed it. The regexes and escaping are noisy enough, there's no need to throw a bunch of colons into the mix. Also, I don't know what your standard_conforming_strings is set to or which version of PostgreSQL you're using so I've gone with E'' strings everywhere. You'll also want to switch your procedure to PL/pgSQL (language plpgsql) to make the escaping easier.
How can I parse the value of "request" in the following string in Oracle?
<!-- accountId="123" activity="add" request="add user" -->
The size and the position of the request is random.
You can use regular expressions to find this:
regexp_replace(str, '.*request="([^"]*)".*', '\1')
Use INSTR(givenstring, stringchartosearch,start_position) to find the position of 'request="' and to find the position of the closing '"'.
Then use substr(string, starting_position, length).
You'd use a combination of instr and substr
THIS EXAMPLE IS FOR EXAMPLE PURPOSES ONLY. DO NOT USE IT IN PRODUCTION CODE AS IT IS NOT VERY CLEAN.
substr(my_str,
-- find request=" then get index of next char.
instr(my_str, 'request="') + 9,
-- This is the second " after request. It does not allow for escapes
instr(substr(my_str,instr(my_str, 'request="')), 2))
Below is my tested variations from cwallenpoole and Craig. For the regexp - note that if "request=" does not exist, the result will be the entire string. user349433 was partly there too, a space before "request=" in the search works just as well:
SET serveroutput ON
DECLARE
l_string VARCHAR2(100) := '<!-- accountId="123" activity="add" request="add user" -->';
l_result_from_substr VARCHAR2(50);
l_result_from_regexp VARCHAR2(50);
BEGIN
SELECT SUBSTR(l_string, instr(l_string, 'request="') + 9, instr(SUBSTR(l_string,instr(l_string, 'request="')), '"', 2)-1),
regexp_replace(l_string, '.* request="([^"]*)".*', '\1')
INTO l_result_from_substr,
l_result_from_regexp
FROM dual;
dbms_output.put_line('Result from substr: '||l_result_from_substr);
dbms_output.put_line('Result from regexp: '||l_result_from_regexp);
END;
/
Please note the equal sign "=" does not necessarily have to come immediately after the request variable in the assignment. As such, it is not entirely correct to search for "request=". You should create a basic finite state machine using INSTR to first find "request", then find "=", ...