Parse string as JSON with Snowflake SQL - sql

I have a field in a table of our db that works like an event-like payload, where all changes to different entities are gathered. See example below for a single field of the object:
'---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc'
Since accessing this field with pure SQL is a pain, I was thinking of parsing it as a JSON so that it would look like this:
{
"field_one":"1",
"field_two": "20",
"field_three": "4",
"id": "1234",
"another_id": "5678",
"some_text": "Hey you",
"a_date": "2022-11-29",
"utc": "2022-11-29 15:29:28.159296000 Z",
"another_date": "2022-11-30",
"utc": "2022-11-30 13:34:59.000000000 Z"
}
And then just use a Snowflake-native approach to access the values I need.
As you can see, though, there are two fields that are called utc, since one is referring to the first date (a_date), and the second one is referring to the second date (another_date). I believe these are nested in the object, but it's difficult to assess with the format of the field.
This is a problem since I can't differentiate between one utc and another when giving the string the format I need and running a parse_json() function (due to both keys using the same name).
My SQL so far looks like the following:
select
object,
replace(object, '---\n', '{"') || '"}' as first,
replace(first, '\n', '","') as second_,
replace(second_, ': ', '":"') as third,
replace(third, ' ', '') as fourth,
replace(fourth, ' ', '') as last
from my_table
(Steps third and fourth are needed because I have some fields that have extra spaces in them)
And this actually gives me the format I need, but due to what I mentioned around the utc keys, I cannot parse the string as a JSON.
Also note that the structure of the string might change from row to row, meaning that some rows might gather two utc keys, while others might have one, and others even five.
Any ideas on how to overcome that?

Replace only one occurrence with regexp_replace():
with data as (
select '---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc' o
)
select parse_json(last2)
from (
select o,
replace(o, '---\n', '{"') || '"}' as first,
replace(first, '\n', '","') as second_,
replace(second_, ': ', '":"') as third,
replace(third, ' ', '') as fourth,
replace(fourth, ' ', '') as last,
regexp_replace(last, '"utc"', '"utc2"', 1, 2) last2
from data
)
;

This may not be what you want but it seems to me that your problem could be solved if the UTC timestamps were to replace the dates preceding it where the keys are not duplicated. You can always calculate dates once you have the timestamps. If this is making sense, see if you can apply your parse_json solution to this output instead
set str='---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: 2022-11-29 15:29:28.159296000 Z\nanother_date: 2022-11-30\nutc: 2022-11-30 13:34:59.000000000 Z';
select regexp_replace($str,'[0-9]{4}-[0-9]{2}-[0-9]{2}\nutc:')

Related

Pattern match using regexp_extract_all

I am trying to build a array from this string and need help with pattern on regexp_extract_all.
Here is my input string contains keyword value pairs
BEGIN
DECLARE p_JSON STRING DEFAULT """
{
"instances": [{
"LT_20MN_SalesContrctCnt": 388.0,
"Pyramid_Index": '',
"MARKET": "'Growth Markets','Europe'",
"SERVICE_DIM": "'S&C','F&M'",
"SG_MD": "'All Service Group'"
}]}
""";
SELECT split(x,":")[OFFSET(0)] as keyword, split(x,":")[OFFSET(1)] keyword_value
FROM unnest(split(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'([\'\"\[\]{}])', ''))) as x
END;
The above SQL is failing at SPLIT due to , with in the data.
All I am trying to do here is build a two columns Keyword and value.
The idea here is if I can extract each row using REGEXP_EXTRACT_ALL with out the last "," then I should be able to split into keyword and keyword_value columns. Btw the names or number of keywords/values are not fixed.
Intended output from REGEXP_EXTRACT_ALL:
"LT_20MN_SalesContrctCnt": 388.0
"Pyramid_Index": ''
"MARKET": "'Growth Markets','Europe'"
"SERVICE_DIM": "'S&C','F&M'"
"SG_MD": "'All Service Group'"
Appreciate if you can suggest a better way to handle this.
Thanks in advance.
Using your sample data, I just added an extra REGEXP_REPLACE to replace ," to #" so we can avoid splitting using ,. See approach below:
SELECT
SPLIT(arr,":")[OFFSET(0)] as keyword,
SPLIT(arr,":")[OFFSET(1)] as keyword_value,
FROM sample_data,
UNNEST(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'[\[\]{}]',''),r',"','#"'),'#')) arr
Output:

How can I convert a string to JSON in Snowflake?

I have this string {id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z} and would like to convert it to a JSON, I tried parse_json but got an error, to_variant and converted to "{id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z}"
To Gokhan & Simon's point, the original data isn't valid JSON.
If you're 100% (1000%) certain it'll "ALWAYS" come that way, you can treat it as a string parsing exercise and do something like this, but once someone changes the format a bit it'll have an issue.
create temporary table abc (str varchar);
insert into abc values ('{id: evt_1jopsdgqxhp78yqp7pujesee, created: 2021-08-14t16:38:17z}');
select to_json(parse_json(json_str)) json_json
FROM (
select split_part(ltrim(str, '{'), ',', 1) as part_a,
split_part(rtrim(str, '}'), ',', 2) as part_b,
split_part(trim(part_a), ': ', 1) part_a_name,
split_part(trim(part_a), ': ', 2) part_a_val,
split_part(trim(part_b), ': ', 1) part_b_name,
split_part(trim(part_b), ': ', 2) part_b_val,
'{"'||part_a_name||'":"'||part_a_val||'", "'||part_b_name||'":"'||part_b_val||'"}' as json_str
FROM abc);
which returns a valid JSON
{"created":"2021-08-14t16:38:17z","id":"evt_1jopsdgqxhp78yqp7pujesee"}
Overall this is very fragile, but if you must do it, feel free to.
Your JSON is not valid, as you can validate it using any online tool:
https://jsonlint.com/
This is a valid JSON version of your data:
{
"id": "evt_1jopsdgqxhp78yqp7pujesee",
"created": "2021-08-14t16:38:17z"
}
And you can parse it successfully using parse_json:
select parse_json('{ "id": "evt_1jopsdgqxhp78yqp7pujesee", "created": "2021-08-14t16:38:17z"}');

Fetching attribute from JSON string with JSON_VAL cause "<attribute> is invalid in the used context" error

A proprietary third-party application stores JSON strings in it's database like this one:
{"state":"complete","timestamp":1614776473000}
I need the timestamp and found out that
DB2 offers JSON functions. Since it's stored as string in the PROF_VALUE column, I guess that converting with SYSTOOLS.JSON2BSON is required, before I can use JSON_VAL to fetch the timestamp:
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "timestamp", "f")
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'
This causes an error that timestamp is invalid in the used context ( SQLCODE=-206, SQLSTATE=42703, DRIVER=4.26.14). The same error is thown when I remove the JSON2BSON call like this
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE, "timestamp", "f")
Also not working with the same error (different data-types):
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "state", "s:1000")
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE) "state", "s:1000")
I don't understand this error. My syntax is like the documented JSON_VAL ( json-value , search-string , result-type) and it is the same like in the examples, where they show how to fetch the name field of an object.
I also played around a bit with JSON_TABLE to use raw input data for testing (instead of the database data), but it seems not suiteable for that.
SELECT *
FROM TABLE(SYSTOOLS.JSON_TABLE( SYSTOOLS.JSON2BSON('{"state":"complete","timestamp":1614776473000}'), 'state','s:32')) DATA
This gave me a table with one row: Type = 2 and Value = complete.
I had two problems in my query: First it seems that double quotes " are for object references. I wasn't aware that there is any difference, because in most databases I used yet, both single ' and double quotes " are equal.
The second problem is, that JSON_VAL needs to be called without SYSTOOLS, but the reference is still needed on SYSTOOLS.JSON2BSON(PROF_VALUE).
With those changes, the following query worked:
SELECT JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), 'timestamp', 'f')
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'

How do I Process This String?

I have some results in one of my tables and the results vary, each; represents multiple entries in one column which I need to split out.
Here is my SQL and the results:
select REGEXP_COUNT(value,';') as cnt,
description
from mytable;
1 {Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time
Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};
1 {Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-
16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};
2 {Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28
08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss
Number|}{Time Requested|}{Time Arrived|};
Desired output:
R1:
Managed By: xBoss
Time Requested:2009-10-19 07:53:45.0
Time Arrived: 2009-10-19 07:54:46.0
R2:
Managed By:Own Arrangements
Number: x5876523
Time Requested: 2009-10-19 07:57:46.0
Time Arrived:
R3:
Managed By: xBoss
Time Requested:2009-10-19 08:07:27.0
select
SPLIT_PART(description, '}', 1),
SPLIT_PART(description, '}', 2),
SPLIT_PART(description, '}', 3),
SPLIT_PART(description, '}', 4),
SPLIT_PART(description, '}', 5)
as description_with_tag from mytable;
This is ok when the count is 1, but when there are multiple ; in the description it doesn't give me the results.
Is it possible to put this into an array based on the count?
First, it's worth pointing out that data in this type of format cannot take advantage of all the benefits that Redshift can offer. Amazon Redshift is a columnar database that can provide amazing performance when data is stored in appropriate columns. However, selecting specific text from a text field will always perform poorly.
Therefore, my main advice would be to pre-process the data into normal rows and columns so that Redshift can provide you the best capabilities.
However, to answer your question, I would recommend making a Scalar User-Defined Function:
CREATE FUNCTION f_extract_curly (s TEXT, key TEXT)
RETURNS TEXT
STABLE
AS $$
# List of items in {brackets}
items = s[1:-1].split('}{')
# Dictionary of Key|Value from items
entries = {i.split('|')[0]: i.split('|')[1] for i in items}
# Return desired value
return entries.get(key, None)
$$ LANGUAGE plpythonu;
I loaded sample data with:
CREATE TABLE foo (
description TEXT
);
INSERT INTO foo values('{Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};');
INSERT INTO foo values('{Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};');
INSERT INTO foo values('{Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28 08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss Number|}{Time Requested|}{Time Arrived|};');
Then I tested it with:
SELECT
f_extract_curly(description, 'Managed By'),
f_extract_curly(description, 'Time Requested')
FROM foo
and got the result:
xBoss 2009-04-15 20:47:11.0
Modern Management 2009-04-16 14:01:29.0
xBoss
It doesn't know how to handle lines that have the same field specified twice (with semi-colons between). You did not provide enough sample input and output lines for me to figure out what you wanted in such situations, but feel free to tweak the code for your requirements.
There is no array data type in Redshift. There are 2 options:
1) First split_part by ';', then union results separately for every index of the first split_part output, then split_part results by '}' and finally get what you need.
2) Create a Python UDF and process these strings with Python. I guess this is the best solution for your use case.
3) Transform your data outside Redshift. From your data structure it seems like it's much better to process it before copying to Redshift, unnesting the arrays into rows and extracting keys from your objects into columns.

Calculating total value of Numeric values assigned to string tokens in SQL column

This is kind of complicated, so bear with me.
I've got the basic concept figured out thanks to THIS QUESTION
SELECT LENGTH(col) - LENGTH(REPLACE(col, 'Y', ''))
Very clever solution.
Problem is, I'm trying to count the number of instances of a string token, then take that counter and multiply it by a modifier that represents the string's numeric value. Oh, and I've got a list of 50-ish tokens, each with a different value.
So, take the string "{5}{X}{W}{b/r}{2/u}{pg}"
Looking up the list of tokens and their numeric value, we get this:
{5} 5
{X} 0
{W} 1
{b/r} 1
{2/u} 2
{pg} 1
.... ....
Therefore, the sum value of the string above is 5+0+1+1+2+1 = 10
Now, what I'm really looking for is to a way to do a Join and perform the aforementioned replace-token-get-length trick for each column of the TokenValue table.
Does that make sense?
Psuedo-SQL example:
SELECT StringColumn, TotalTokenValue
???
FROM TableWithString, TokenValueTable
Perhaps this would work better as a custom Function?
EDIT
I think I'm about halfway there, but it's kind of ugly.
SELECT StringColumn, LEN(StringColumn) AS TotalLen, Token,
{ fn LENGTH(Token) } AS TokenLength, TokenValue,
{ fn REPLACE(StringColumn, Token, '') AS Replaced,
{ fn LENGTH(Replaced) } AS RepLen,
{ TotalLen - RepLen / TokenLength} AS TokenCount },
{ TokenCount * TokenValue} CalculatedTokenValue
FROM StringTable CROSS JOIN
TokenTable
Then I need to wrap that in some kind of Group By and get SUM(CalculatedTokenValue)
I can picture it in my head, having trouble getting the SQL to work.
If you create a view like this one :
Create or replace view ColumnsTokens as
select StringColumn, Token, TokenValue,
(length(StringColumn) - length(replace(StringColumn, token, ''))) / length(Token) TokenCount
from StringTable
join TokenTable on replace(StringColumn, Token, '') <> StringColumn ;
that will act as a many-to-many relationship table between columns and tokens, I think you can easily write any query you need. For example,this will give you the total score :
select StringColumn, sum(TokenCount * TokenValue) TotalTokenScore
from ColumnsTokens
group by StringColumn ;
(You're only missing the StringColumns with no tokens)