Parsing JSON file in Snowflake Database - sql

Database :SNOWFLAKE
My table contains JSON data for example:
{
"bucket":"IN_Apps",
"bySeqno":56,
"cas":1527639206906626048,
"content":"eyJoaWdoQmluIjoiNTQ4NTA4MDkiLCJkb2N1bWVudFR5cGUiOiJJSU5ETyIsImNhcmRUeXBlIyayI6Ik1BU1RFUkNBUkQifQ==",
"event":"mutation",
"expiration":0,
"flags":33554432,
"key":"iin54850809",
"lockTime":0,
"partition":948,
"revSeqno":1,
"vBucketUuid":137987627737694
}
when i tried to parse it.
select
parse_json:bucket::string as bucket ,
parse_json:bySeqno::string as bySeqno ,
parse_json:cas::INT as cas ,
parse_json:content::string as content ,
parse_json:event::string as event
,parse_json:expiration::string as expiration
,parse_json:flags::string as flags
,parse_json:key::string as key
,parse_json:lockTime::string as lockTime
,parse_json:partition::string as partition
,parse_json:revSeqno::string as revSeqno
,parse_json:vBucketUuid::string as vBucketUuid
from STG_YS_APPS v
but it is throwing error like.
SQL compilation error: error line 2 at position 0 invalid identifier > >'PARSE_JSON'
may someone please help me.

Answer with a known schema
Update: Since you provided schema, which shows a VAR column of VARIANT type, here's what you need, couldn't be simpler:
select
var:bucket::string as bucket,
var:bySeqno::string as bySeqno,
var:cas::int as cas
...
from STG_YS_APPS v
Below the answer before the schema was known
I'll assume you have a VARCHAR (or similar) column in your table that is called json, and stores the values you presented. You didn't provide the schema, so please adjust the column name as necessary.
You're not using PARSE_JSON as a function in your SQL. You should write something like
select
parse_json(json):bucket::string as bucket,
parse_json(json):bySeqno::string as bySeqno,
parse_json(json):cas::int as cas
...
from STG_YS_APPS v

Related

JSON stored in SUPER type fails to select camelcase element. Too long to be serialized. How can I select?

Summary:
I am working with a large JSON that is stored in a redshift SUPER type.
Context
This issue is near identical to the question posted here for TSQL. My schema:
chainId BIGINT
properties SUPER
Sample data:
{
"chainId": 5,
"$browser": "Chrome",
"token": "123x5"
}
I have this as a column in my table called properties.
Desired behavior
I want to be able to retrieve the value 5 from the chainId key and store it in a BIGINT column.
What I've tried
I have referenced the following aws docs:
https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html
https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html
I have tried the following which haven't worked for me:
SELECT
properties.chainId::varchar as test1
, properties.chainId as test2
, properties.chainid as test3
, properties."chainId" as test4
, properties."chainid" as test5
, json_extract_path_text(json_serialize(properties), 'chainId') serial_then_extract
, properties[0].chainId as testval1
, properties[0]."chainId" as testval2
, properties[0].chainid as testval3
, properties[0]."chainid" as testval4
, properties[1].chainId as testval5
, properties[1]."chainId" as testval6
FROM clean
Of these, the attempt, serial_then_extract returned a not null, correct value, but not all of the values in my properties field are short enough to serialize, so this only works on some of the rows.
All others return null.
Referencing the following docs: https://docs.aws.amazon.com/redshift/latest/dg/query-super.html#unnest I have also attempted to iterate over the super type using partisql:
SELECT ps.*
, p.chainId
from clean ps, ps.properties p
where 1=1
But this returns no rows.
I also tried the following:
select
properties
, properties.token
, properties."$os"
from base
And this returned rows with values. I know that there is a chainId value as I've checked the corresponding key and am working with sample data.
What am I missing? What else should I be trying?
Does anyone know if this has to do with the way that the JSON key is formatted? [camelcase]
You need to enable case sensitive identifiers. By default Redshift maps everything to lower case for table and column names. If you have mixed case identifiers like in your super field you need to enable case sensitivity with
SET enable_case_sensitive_identifier TO true;
See: https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html

timestamp VS TIMESTAMP_NTZ in snowflake sql

I am using an sql script to parse a json into a snowflake table using dbt.
One of the cols contain this datetime value: '2022-02-09T20:28:59+0000'.
What's the correct way to define ISO datetime's data type in Snowflake?
I tried date, timestamp and TIMESTAMP_NTZ like this in my dbt sql script:
JSON_DATA:",my_date"::TIMESTAMP_NTZ AS MY_DATE
but clearly, these aren't the correct one because later on when I test it in snowflake with select * , I get this error:
SQL Error [100040] [22007]: Date '2022-02-09T20:28:59+0000' is not recognized
or
SQL Error [100035] [22007]: Timestamp '2022-02-13T03:32:55+0100' is not recognized
so I need to know which Snowflake time/date data type suits the best for this one
EDIT:
This is what I am trying now.
SELECT
JSON_DATA:"date_transmission" AS DATE_TRANSMISSION
, TO_TIMESTAMP(DATE_TRANSMISSION:text, 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM') AS DATE_TRANSMISSION_TS_UTC
, JSON_DATA:"authorizerClientId"::text AS AUTHORIZER_CLIENT_ID
, JSON_DATA:"apiPath"::text API_PATH
, MASTERCLIENT_ID
, META_FILENAME
, META_LOAD_TS_UTC
, META_FILE_TS_UTC
FROM {{ source('INGEST_DATA', 'TABLENAME') }}
I get this error:
000939 (22023): SQL compilation error: error line 6 at position 4
10:21:46 too many arguments for function [TO_TIMESTAMP(GET(DATE_TRANSMISSION, 'text'), 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM')] expected 1, g
However, if I comment out the the first 2 lines(related to timpstamp types), the other two work perfectly fine. What's the correct syntax of parsing json with TO_TIMESTAMP?
Not that JSON_DATA:"apiPath"::text API_PATH gives the correct value for it in my snowflake tables.
Did some testing and it seems you have 2 options.
You can either get rid of the +0000 at the end: left(column_date, len(column_date)-5)
or try_to_timestamp with format
try_to_timestamp('2022-02-09T20:28:59+0000','YYYY-MM-DD"T"HH24:MI:SS+TZHTZM')
TZH and TZM are TimeZone Offset Hours and Minutes
So there are 2 main points here.
when getting data from JSON to pass to any of the timestamp functions that want a ::TEXT object, but the values to get from JSON are still ::VARIANT so they need to be cast. This is the cause of the error you quote
(22023): SQL compilation error: error line 6 at position 4
10:21:46 too many arguments for function [TO_TIMESTAMP(GET(DATE_TRANSMISSION, 'text'), 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM')] expected 1, g
also your SQL is wrong there it should have been
TO_TIMESTAMP(DATE_TRANSMISSION::text,
How you handle the timezone format.As other have noted you (as I did in your last question) do you want to ignore the timezone values or read them. I forgot about the TZHTZM formatting. Given you have timezone data, you should use the TO_TIMESTAMP_TZ`TRY_TO_TIMESTAMP_TZto make sure the time zone data is keep, given you second example shows+0100`
putting those together (assuming you didn't want an extra date_transmission as a variant in you data) :
SELECT
TO_TIMESTAMP_TZ(JSON_DATA:"date_transmission"::text, 'YYYY-MM-DDTHH24:MI:SS+TZHTZM') AS DATE_TRANSMISSION_TS_UTC
, JSON_DATA:"authorizerClientId"::text AS AUTHORIZER_CLIENT_ID
, JSON_DATA:"apiPath"::text AS API_PATH
, MASTERCLIENT_ID
, META_FILENAME
, META_LOAD_TS_UTC
, META_FILE_TS_UTC
FROM {{ source('INGEST_DATA', 'TABLENAME') }}
You should use timestamp (not date which does not store the time information), but probably the format you are using is not autodetected. You can specify the input format as YYYY-MM-DD"T"HH24:MI:SSTZHTZM as shown here. The autodetected one has a : between the TZHTZM.

How to resolve this sql error of schema_of_json

I need to find out the schema of a given JSON file, I see sql has schema_of_json function
and something like this works flawlessly
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
But if I query for my table name, it gives me the following error
>SELECT schema_of_json(Transaction) as json_data from table_name;
Error in SQL statement: AnalysisException: cannot resolve 'schemaofjson(`Transaction`)' due to data type mismatch: The input json should be a string literal and not null; however, got `Transaction`.; line 1 pos 7;
The Transaction is one of the columns in my table and after checking it manually I can attest that it is of String type(json).
The SQL statement has it to give me the schema of the JSON, how to do it?
after looking further into the documentation that it is clear that the word foldable means that of the static one, and a column from a table JSON won't work
for minimal reroducible example here you go:
SELECT schema_of_json(CAST('{ "a": "b" }' AS STRING))
As soon as the cast is introduced in the above statement, the schema_of_json will fail......... It needs a static JSON as it's input

Fetching attribute from JSON string with JSON_VAL cause "<attribute> is invalid in the used context" error

A proprietary third-party application stores JSON strings in it's database like this one:
{"state":"complete","timestamp":1614776473000}
I need the timestamp and found out that
DB2 offers JSON functions. Since it's stored as string in the PROF_VALUE column, I guess that converting with SYSTOOLS.JSON2BSON is required, before I can use JSON_VAL to fetch the timestamp:
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "timestamp", "f")
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'
This causes an error that timestamp is invalid in the used context ( SQLCODE=-206, SQLSTATE=42703, DRIVER=4.26.14). The same error is thown when I remove the JSON2BSON call like this
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE, "timestamp", "f")
Also not working with the same error (different data-types):
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "state", "s:1000")
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE) "state", "s:1000")
I don't understand this error. My syntax is like the documented JSON_VAL ( json-value , search-string , result-type) and it is the same like in the examples, where they show how to fetch the name field of an object.
I also played around a bit with JSON_TABLE to use raw input data for testing (instead of the database data), but it seems not suiteable for that.
SELECT *
FROM TABLE(SYSTOOLS.JSON_TABLE( SYSTOOLS.JSON2BSON('{"state":"complete","timestamp":1614776473000}'), 'state','s:32')) DATA
This gave me a table with one row: Type = 2 and Value = complete.
I had two problems in my query: First it seems that double quotes " are for object references. I wasn't aware that there is any difference, because in most databases I used yet, both single ' and double quotes " are equal.
The second problem is, that JSON_VAL needs to be called without SYSTOOLS, but the reference is still needed on SYSTOOLS.JSON2BSON(PROF_VALUE).
With those changes, the following query worked:
SELECT JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), 'timestamp', 'f')
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'

Regex comparison in Oracle between 2 varchar columns (from different tables)

I am trying to find a way to capture relevant errors from oracle alertlog. I have one table (ORA_BLACKLIST) with column values as below (these are the values which I want to ignore from
V$DIAG_ALERT_EXT)
Below are sample data in ORA_BLACKLIST table. This table can grow based on additional error to ignore from alertlog.
ORA-07445%[kkqctdrvJPPD
ORA-07445%[kxsPurgeCursor
ORA-01013%
ORA-27037%
ORA-01110
ORA-2154
V$DIAG_ALERT_EXT contains a MESSAGE_TEXT column which contains sample text like below.
ORA-01013: user requested cancel of current operation
ORA-07445: exception encountered: core dump [kxtogboh()+22] [SIGSEGV] [ADDR:0x87] [PC:0x12292A56]
ORA-07445: exception encountered: core dump [java_util_HashMap__get()] [SIGSEGV]
ORA-00600: internal error code arguments: [qercoRopRowsets:anumrows]
I want to write a query something like below to ignore the black listed errors and only capture relevant info like below.
select
dae.instance_id,
dae.container_name,
err_count,
dae.message_level
from
ORA_BLACKLIST ob,
V$DIAG_ALERT_EXT dae
where
group by .....;
Can someone suggest a way or sample code to achieve it?
I should have provided the exact contents of blacklist table. It currently contains some regex (perl) and I want to convert it to oracle like regex and compare with v$diag_alert_ext message_text column. Below are sample perl regex in my blacklist table.
ORA-0(,|$| )
ORA-48913
ORA-00060
ORA-609(,|$| )
ORA-65011
ORA-65020 ORA-31(,|$| )
ORA-7452 ORA-959(,|$| )
ORA-3136(,|)|$| )
ORA-07445.[kkqctdrvJPPD
ORA-07445.[kxsPurgeCursor –
Your blacklist table looks like like patterns, not regular expressions.
You can write a query like this:
select dae.* -- or whatever columns you want
from V$DIAG_ALERT_EXT dae
where not exists (select 1
from ORA_BLACKLIST ob
where dae.message_text like ob.<column name>
);
This will not have particularly good performance if the tables are large.