Snowflake SQL Regex ~ Extracting Multiple Vals - sql

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()
The values that I want to access are in quotes:
...
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
...
*There is a lot of text above and below this.
I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value.
I have tried:
REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')
which yields:
Type: a: - !<string> val: "A"
and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.

This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.
REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.
SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;

You have to extract the values in two stages;
Extract the section of the document below Type: a: containing all the val: "data".
Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element
SELECT
'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
val_array[1] B
FROM INPUT_STRING
The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.

One more angle -- Use javascript regex capabilities in a UDF.
For example:
create or replace function my_regexp(S text)
returns array
language javascript
as
$$
const re = /(\w+)/g
return [...S.match(re)]
$$
;
Invoked this way:
set S = '
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
';
select my_regexp($S);
Yields:
[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]
Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.
That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

Related

How can I get the type of a JSON value in Presto?

I have a JSON string that I am querying with Presto and I want to aggregate by the types of values. To do this I need to get the value type. Specifically, for JSON like:
{
"a": 1,
"b": "a",
"c": true,
"d": [ 1 ],
"e": { "f": "g" },
}
I would like to get the value at $.a is an integer, the value at $.b is a string, etc. (The information doesn't need to be nested so it would be good enough to know that $.d is an array and $.e is an object).
typeof appears to return only varchar or json, depending on how you extract JSON from a string:
SELECT
typeof(json_extract(j, '$.a')),
typeof(json_extract_scalar(j, '$.a'))
FROM (SELECT '{"a":1,"b":"a","c":true,"d":[1]}' AS j);
gives me:
_col0 _col1
json varchar(32)
How can I determine the JSON value type for one of these fields?
json_extract gets a value in a JSON string as the json type.
json_format takes a json value and converts it back to a JSON string.
I think you can hack these two things together to get the type of a value at a position by examining the formatted string.
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
(where value is json_format(json_extract(json_string, '$.your_json_path')))
Here's an example:
SELECT obj, value,
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
FROM (
SELECT obj, json_format(json_extract(obj, '$.a')) AS value
FROM (
VALUES
('{"a":1}'),
('{"a":2.5}'),
('{"a":"abc"}'),
('{"a":true}'),
('{"a":false}'),
('{"a":null}'),
('{"a":[1]}'),
('{"a":{"h":"w"}}')
) AS t (obj)
)
Which produces this result:
obj value t
{"a":1} 1 number
{"a":2.5} 2.5 number
{"a":"abc"} "abc" string
{"a":true} true boolean
{"a":false} false boolean
{"a":null} null null
{"a":[1]} [1] array
{"a":{"h":"w"}} {"h":"w"} object

How to compare two json objects using Karate where order of element is to be retained?

I am in a need to compare 2 JSON objects where the order has retained while comparing. As Karate match ignores the order of an element, I am just curious to know if there is a way to do so in Karate.
Not directly, it is never needed, since JSON keys can be in any order, like a Map.
But you can do an exact match after converting to a (normalized) string:
* def foo = { a: 1, b: 2 }
* string str1 = foo
* string str2 = { "a": 1, "b": 2 }
* assert str1 == str2
You can also get an ordered list of keys / values at any time:
* def vals = karate.valuesOf(foo)
* match vals == [1, 2]
* def keys = karate.keysOf(foo)
* match keys == ['a', 'b']

Snowflake SQL, how to lookahead until a certain occurrence of a value

Below is a sample of the text that I am working with.
---
info1:
* val: "A"
---
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
---
info2:
* val: "D"
---
And I am trying to select the following text:
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
I was trying to use a look ahead, but haven't had much success.
REGEXP_SUBSTR(col, 'Type:(.*---)')
Here I am trying to look up until the next occurrence of '---', but I think I misunderstanding how it works.
REGEXP_SUBSTR is rather limited in snowflake, in native SQL, so when you tell is to match multiple lines and match newlines via REGEXP_SUBSTR(t, '(Type:.*)---',1,1,'mes',1) the regexp is greddy thus:
SELECT '---
info1:
* val: "A"
---
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
---
info2:
* val: "D"
---' as t
,REGEXP_SUBSTR(t, '(Type:.*)',1,1,'mes',1) as r1
,REGEXP_SUBSTR(t, '(Type:.*)---',1,1,'mes',1) as r2;
gives you too much data:
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
---
info2:
* val: "D"
So one idea is if --- is always a section marker, is to string split on that first and then regex
WITH input as (
select '---
info1:
* val: "A"
---
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
---
info2:
* val: "D"
---' as t
)
select t, c.value::string as part, REGEXP_SUBSTR(part, 'Type:.*',1,1,'mes') as r1
from input,
lateral flatten(input=>split(t, '---')) c;
gives
T PART R1
--- info1: * val: "A" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" --- info2: * val: "D" ---
--- info1: * val: "A" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" --- info2: * val: "D" --- info1: * val: "A"
--- info1: * val: "A" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" --- info2: * val: "D" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C"
--- info1: * val: "A" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" --- info2: * val: "D" --- info2: * val: "D"
--- info1: * val: "A" --- Type: * answers: * - !<string> * val: "B" * - !<string> * val: "C" --- info2: * val: "D" ---
from which you should be able to progress.
Also if you need really complex regexp, you can make a javascript-UDF and use the javascript regexp engine..
You don't need regexp lookahead to get the string you want, it's just eg.
REGEXP_SUBSTR(col, '(^Type:\\s+(^[*].*$\\s+)*)^---', 1, 1, 'm', 1)
If you need regexps with lookahead, etc, use JavaScript RegExps via a function wrapper, eg.
CREATE OR REPLACE FUNCTION RegExp_Match("STRING" VARCHAR, "REGEXP" VARCHAR)
RETURNS VARIANT LANGUAGE JAVASCRIPT STRICT IMMUTABLE AS
'return STRING.match(REGEXP);';
CREATE OR REPLACE FUNCTION RegExp_Match("STRING" VARCHAR, "RX" VARCHAR, "FLAGS" VARCHAR)
RETURNS VARIANT LANGUAGE JAVASCRIPT STRICT IMMUTABLE AS
'return STRING.match(new RegExp(RX, FLAGS));';
SELECT RegExp_Match('<aA>', '(?<=<)(.)\\1(?=>)', 'i');
-- RegExp with lookback, back reference and lookahead ignoring case
=> [ "aA", "a" ]
Snowflake does not support look-aheads or look-behinds, but it does support group extraction (and nested group extraction). These can be used within REGEXP_REPLACE or REGEXP_SUBSTR. In this case, I prefer REGEXP_SUBSTR as you are looking to extract from, as opposed to replace within. In my example below you will see both.
You have 3 dashes (-) that work as your delimiter, the issue becomes that you have a dash within your data. I would advise replacing the delimiter with something that will NOT exist within your data, I chose a tilde (~).
The code sample below will work.
Notes:
Column C2 operates on the KEEPS_LEADING_SPACE column which does not remove the leading space. In this case, the leading space is the first capture group.
Column C3 operates on the actual data column but assumes that the leading space is one or more of a space, carriage return, newline, or vertical space.
Capture Groups:
(~([^~]+))? - captures something starts with a tilde (~) zero or more times, and within it
([^~]+) - captures anything that is not a tilde (~) one or more times
USE ROLE SYSADMIN;
USE WAREHOUSE PUBLIC_WH;
USE UTIL_DB.PUBLIC;
CREATE OR REPLACE TEMP TABLE REGEXP_TEST
AS
SELECT $1::VARIANT AS C1
FROM VALUES
($$
---
info1:
* val: "A"
---
Type:
* answers:
* - !<string>
* val: "B"
* - !<string>
* val: "C"
---
info2:
* val: "D"
---
$$);
SELECT C1
,REPLACE(C1,'---','~') AS KEEPS_LEADING_SPACE
,REGEXP_SUBSTR(KEEPS_LEADING_SPACE,'(~([^~]+))?',1,4,'is') AS C2
,REGEXP_SUBSTR(REGEXP_REPLACE(C1,'[\s\r\n\v]?-{3}','~'),'(~([^~]+))?',1,3,'is') AS C3
FROM REGEXP_TEST
;
Regex in Snowflake does support negatives but I tend to find them hard to work with if you are looking for more than one character, but in this case we have one character to negate the tilde [^~].

JSONB output from object.query() results in syntax error on raw sql

I am quite new to django and JSONB and I use this following syntax to execute a search on JSONB data fields:
obj=SomeModel.objects.filter(data__0__fieldX__contains=search_term)
.. and it works as intended. Now, I print out the obj.query for the above statement and I get:
SELECT * FROM "somemodel_some_model"
WHERE ("somemodel_some_model"."data"
#> ['0', 'fieldX']) #> '"some lane"'
However, when I excecute the above using:
obj=SomeModel.objects.raw(`query statement above`)
I get an error:
django.db.utils.ProgrammingError: syntax error at or near "["
LINE 3: #> ['0', 'fieldX']) #> '"some lane"'
I presume I am not escaping the "[", and I have tried using a backslash before, but it does not seem to help.
what you do is smth like:
with c(j) as (values('
[
{
"fieldX": -20,
"fieldY": 40
},
{
"fieldX":10,
"fieldY": 0
}
]
'::jsonb))
select j #> ['0', 'fieldX'] from c;
ERROR: syntax error at or near "["
LINE 13: select j #> ['0', 'fieldX'] from c;
what you have to do is smth like:
t=# with c(j) as (values('
[
{
"fieldX": -20,
"fieldY": 40
},
{
"fieldX":10,
"fieldY": 0
}
]
'::jsonb))
select j #> '{0,fieldX}' from c;
?column?
----------
-20
(1 row)
https://www.postgresql.org/docs/9.5/static/functions-json.html
text[] is array, but in Postgres array are presented as '{}' or array[], not just []
so
j #> array[0,'fieldX']::text[] from c
would also work

Inverse of `split` function: `join` a string using a delimeter

IN Red and Rebol(3), you can use the split function to split a string into a list of items:
>> items: split {1, 2, 3, 4} {,}
== ["1" " 2" " 3" " 4"]
What is the corresponding inverse function to join a list of items into a string? It should work similar to the following:
>> join items {, }
== "1, 2, 3, 4"
There's no inbuild function yet, you have to implement it yourself:
>> join: function [series delimiter][length: either char? delimiter [1][length? delimiter] out: collect/into [foreach value series [keep rejoin [value delimiter]]] copy {} remove/part skip tail out negate length length out]
== func [series delimiter /local length out value][length: either char? delimiter [1] [length? delimiter] out: collect/into [foreach value series [keep rejoin [value delimiter]]] copy "" remove/part skip tail out negate length length out]
>> join [1 2 3] #","
== "1,2,3"
>> join [1 2 3] {, }
== "1, 2, 3"
per request, here is the function split into more lines:
join: function [
series
delimiter
][
length: either char? delimiter [1][length? delimiter]
out: collect/into [
foreach value series [keep rejoin [value delimiter]]
] copy {}
remove/part skip tail out negate length length
out
]
There is an old modification of rejoin doing that
rejoin: func [
"Reduces and joins a block of values - allows /with refinement."
block [block!] "Values to reduce and join"
/with join-thing "Value to place in between each element"
][
block: reduce block
if with [
while [not tail? block: next block][
insert block join-thing
block: next block
]
block: head block
]
append either series? first block [
copy first block
] [
form first block
]
next block
]
call it like this rejoin/with [..] delimiter
But I am pretty sure, there are other, even older solutions.
Following function works:
myjoin: function[blk[block!] delim [string!]][
outstr: ""
repeat i ((length? blk) - 1)[
append outstr blk/1
append outstr delim
blk: next blk ]
append outstr blk ]
probe myjoin ["A" "B" "C" "D" "E"] ", "
Output:
"A, B, C, D, E"