Convert Big Query DDL to Snowflake DDL - google-bigquery

There is a Big Query table with nested columns, the DDL for the table is:
CREATE TABLE stg.raw_data.user_events
(
pipeline_metadata STRUCT<uuid STRING, timestamp TIMESTAMP, restream_count INT64, pubsub_subscription_name STRING>,
event_name STRING,
insertId STRING,
eventID STRING,
timestamp TIMESTAMP,
metadata STRUCT<userID INT64, uuid STRING, sessionID STRING, platform STRING, prime_status STRING, created_at TIMESTAMP, city STRING, country STRING, embed_url STRING, embed_domain STRING, deviceID STRING, device_type STRING, state STRING, metadata_schema_version STRING, schema_version STRING, ipAddress STRING>,
properties STRING
)
PARTITION BY DATE(timestamp);
Sample data from this table looks like this:
[{
"pipeline_metadata": {
"uuid": "d2aae738-ddbe-43fc-b1a0-a0a3d7f4a66c",
"timestamp": "2022-10-06 00:44:53.804000 UTC",
"restream_count": "0",
"pubsub_subscription_name": "raw-user-events-data-ingestor-subscription"
},
"event_name": "search",
"insertId": "5751196922008325",
"eventID": "rCz_ZG70T-CYlJ7uAgWhvxOeRKqoevLb",
"timestamp": "2022-10-06 00:44:52.792000 UTC",
"metadata": {
"userID": "2235338",
"uuid": "3cf13e3f499339f45bc1a344bd5c83866ebae85b6c89255ff203467552157aaa",
"sessionID": "140b56af96c6331e8e5cc88721143d788a30b3c35fe28dac2c2e3ceeb1ee4948",
"platform": "web-app",
"prime_status": "subscription",
"created_at": "2022-10-06 00:44:52.767000 UTC",
"city": "Woodside",
"country": "United States",
"embed_url": null,
"embed_domain": null,
"deviceID": "db0f3955624d5541b587ac4661a25dfd",
"device_type": "desktop",
"state": "New York",
"metadata_schema_version": "latest",
"schema_version": "latest",
"ipAddress": null
},
"properties": "{\"action\":\"focus-on-input\",\"source\":\"dashboard\",\"page_path\":\"/home/dashboard\"}"
}]
I am trying to create a similar table in Snowflake, the nested column structure should be retained in Snowflake, because this is a migration from Big Query to Snowflake, hence the structure should be the same between both the warehouses.
I was able to prepare the table DDL for Snowflake but columns are not nested similar to the Big Query table.
CREATE or replace TABLE test.dbt."USER_EVENTS"
(
pipeline_metadata variant,
event_name STRING,
insertId STRING,
eventID STRING,
time_stamp TIMESTAMP,
metadata variant,
properties STRING
)
CLUSTER BY (DATE(time_stamp));
How can I build a DDL for Snowflake table, where the structure of the table is similar to that of the Big Query table.

Depending on how you define "similar" (between the DBMSes where you don't have a direct correspondence between a given feature), you could simply flatten out the STRUCTs
CREATE TABLE stg.raw_data.user_events
(
uuid STRING,
meta_timestamp TIMESTAMP,
restream_count INT,
pubsub_subscription_name STRING,
event_name STRING,
insertId STRING,
eventID STRING,
time_stamp TIMESTAMP,
userID INT,
useer_uuid STRING,
sessionID STRING,
platform STRING,
prime_status STRING,
created_at TIMESTAMP,
city STRING,
country STRING,
embed_url STRING,
embed_domain STRING,
deviceID STRING,
device_type STRING,
state STRING,
metadata_schema_version STRING,
schema_version STRING,
ipAddress STRING,
properties VARIANT
)
CLUSTER BY (time_stamp)
that is certainly somewhat similar.. if it is similar enough is not for us to decide.

Related

How do I use BigQuery DML to transform some fields of a struct nested within an array, within a struct, within an array?

I think this is a more complex version of the question in Update values in struct arrays in BigQuery.
I'm trying to update some of the fields in a struct, where the struct is heavily nested. I'm having trouble creating the SQL to do it. Here's my table schema:
CREATE TABLE `my_dataset.test_data_for_so`
(
date DATE,
hits ARRAY<STRUCT<search STRUCT<query STRING, other_column STRING>, metadata ARRAY<STRUCT<key STRING, value STRING>>>>
);
This is what the schema looks like in the BigQuery GUI after I create the table:
Here's the data I've inserted:
INSERT INTO `my_dataset.test_data_for_so` (date, hits)
VALUES (
CAST('2021-01-01' AS date),
[
STRUCT(
STRUCT<query STRING, other_column STRING>('foo bar', 'foo bar'),
[
STRUCT<key STRING, value STRING>('foo bar', 'foo bar')
]
)
]
)
My goal is to transform the "search.query" and "metadata.value" fields. For example, uppercasing them, leaving every other column (and every other struct field) in the row unchanged.
I'm looking for a solution involving either manually specifying each column in the SQL, or preferably, one where I can only mention the columns/fields I want to transform in the SQL, omitting all other columns/fields. This is a minimal example. The table I'm working on in production has hundreds of columns and fields.
For example, that row, when transformed this way, would change from:
[
{
"date": "2021-01-01",
"hits": [
{
"search": {
"query": "foo bar",
"other_column": "foo bar"
},
"metadata": [
{
"key": "foo bar",
"value": "foo bar"
}
]
}
]
}
]
to:
[
{
"date": "2021-01-01",
"hits": [
{
"search": {
"query": "FOO BAR",
"other_column": "foo bar"
},
"metadata": [
{
"key": "foo bar",
"value": "FOO BAR"
}
]
}
]
}
]
preferably, one where I can only mention the columns/fields I want to transform in the SQL ...
Use below approach - it does exactly what you wish - ONLY those fields that are to be updated are in use, all other (tens or hundreds ...) are preserved as is
update your_table
set hits = array(
select as struct *
replace(
(select as struct * replace (upper(query) as query) from unnest([search])) as search,
array(select as struct * replace(upper(value) as value) from unnest(metadata)) as metadata
)
from unnest(hits)
)
where true;
if applied to sample data in your question - result is

postgresql filter data from bytea column

I have a table where i am saving data in a column of type bytea, the data is actually a JSON object.
I need to implement a filter on the JSON data.
SELECT cast(job_data::TEXT as jsonb) FROM job_details where job_data ->> "organization" = "ABC";
This query does not work.
The JSON Object looks like
{
"uid": "FdUR4SB0h7",
"Type": "Reference Data Service",
"user": "hk#ss.com",
"SubType": "Reference Data Task",
"_version": 1,
"Frequency": "Once",
"Parameters": "sdfsdfsdfds",
"organization": "ABC",
"StartDateTime": "2020-01-20T10:30:00Z"
}
You need to predicate on the converted column, also, that conversion may not necessarily work depending on encoding. Try something like this:
SELECT
*
FROM
job_details
WHERE
convert_from(job_data, 'UTF-8')::json ->> 'organization' = 'ABC';

Can I convert a stringified JSON array back to a BigQuery strucutre?

I'm trying to take a STRING field that contains a nested JSON structure from a table called my_old_table, extract a nested array called "alerts" from it, then insert it into a column in a new table called my_new_table. The new column is defined as:
ARRAY<STRUCT<cuid STRING, title STRING, created TIMESTAMP>>
I'm using this SQL:
INSERT INTO my_dataset.my_table(
id, alerts)
SELECT id, JSON_EXTRACT(extra, "$.alerts") AS content_alerts
FROM my_dataset.my_old_table
This gives me:
Query column 2 has type STRING which cannot be inserted into column content_alerts, which has type ARRAY<STRUCT<cuid STRING, title STRING, created TIMESTAMP>> at [4:1]
I don't see a way of parsing the extracted string this back to a structure.... Is there another way to do this?
Edit:
The original value is a json string that looks like this:
{
"id": "bar123",
"value": "Test",
"title": "Test",
"alerts": [
{
"id": "abc123",
"title": "Foo",
"created": "2020-01-17T23:18:59.769908Z"
},
{
"id": "abc124",
"title": "Accepting/Denying Claims",
"created": "2020-01-17T23:18:59.769908Z"
}
]
}
I want to extract $.alerts and insert it into the ARRAY<STRUCT<cuid STRING, title STRING, created TIMESTAMP>> somehow.
Edit #2
To clarify, this reproduces the issue:
CREATE TABLE insights.my_table
(
id string,
alerts ARRAY<STRUCT<cuid STRING, title STRING, created TIMESTAMP>>
);
CREATE TABLE insights.my_old_table
(
id string,
field STRING
);
INSERT INTO insights.my_old_table(id, field)
VALUES("1", "{\"id\": \"bar123\",\"value\": \"Test\",\"title\": \"Test\",\"alerts\":[{\"id\": \"abc123\",\"title\": \"Foo\",\"created\": \"2020-01-17T23:18:59.769908Z\"},{\"id\": \"abc124\",\"title\": \"Accepting/Denying Claims\",\"created\": \"2020-01-17T23:18:59.769908Z\"}]}");
Based on the above setup, I don't know how to extract "alerts" from the STRING field and insert it into the STRUCT field. I thought I could add a JSON PARSE step in there but I don't see any BigQuery feature for that. Or else there would be a way to manipulate JSON as a STRUCT but I don't see that either. As a result, this is as close as I could get:
INSERT INTO insights.my_table(id, alerts)
SELECT id, JSON_EXTRACT(field, "$.alerts") AS alerts FROM insights.my_old_table
I'm sure there's something I'm missing here.
Below for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION JsonToItems(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(input).map(x=>JSON.stringify(x));
""";
)
SELECT
JSON_EXTRACT_SCALAR(extra, "$.id") AS id,
ARRAY(
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(alert, "$.id") AS cuid,
JSON_EXTRACT_SCALAR(alert, "$.title") AS title,
TIMESTAMP(JSON_EXTRACT_SCALAR(alert, "$.created")) AS created
FROM UNNEST(JsonToItems(JSON_EXTRACT(extra, "$.alerts"))) alert
) AS alerts,
FROM `project.dataset.my_old_table`
You can test, play with above using sample data from your question as in example below
#standardSQL
CREATE TEMP FUNCTION JsonToItems(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(input).map(x=>JSON.stringify(x));
""";
WITH `project.dataset.my_old_table` AS (
SELECT '''
{
"id": "bar123",
"value": "Test",
"title": "Test",
"alerts": [
{
"id": "abc123",
"title": "Foo",
"created": "2020-01-17T23:18:59.769908Z"
},
{
"id": "abc124",
"title": "Accepting/Denying Claims",
"created": "2020-01-17T23:18:59.769908Z"
}
]
}
''' extra
)
SELECT
JSON_EXTRACT_SCALAR(extra, "$.id") AS id,
ARRAY(
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(alert, "$.id") AS cuid,
JSON_EXTRACT_SCALAR(alert, "$.title") AS title,
TIMESTAMP(JSON_EXTRACT_SCALAR(alert, "$.created")) AS created
FROM UNNEST(JsonToItems(JSON_EXTRACT(extra, "$.alerts"))) alert
) AS alerts,
FROM `project.dataset.my_old_table`
with result
Obviously, you can then use this in your INSERT INTO my_dataset.my_table statement
You can parse the extracted string back to a BigQuery structure like so:
SELECT STRUCT(ARRAY<STRUCT<cuid STRING, title STRING, created TIMESTAMP>>
[('Rick', 'Scientist', '2020-01-17')]) FROM my_dataset.my_old_table;
I just tried it with your data
I have inserted your data in a BigQuery table:
INSERT INTO dataset.table
VALUES('{"id": "bar123", "value": "Test", "title": "Test", "alerts":
[{ "id": "abc123", "title": "Foo", "created": "2020-01-17T23:18:59.769908Z"},
{"id": "abc124", "title": "Accepting/Denying Claims", "created": "2020-01-17T23:18:59.769908Z"}]}');
and queried it, converting it back to a BigQuery structure:
SELECT STRUCT<cuid STRING, title STRING, created TIMESTAMP>("abc123",
"Foo", "2020-01-17T23:18:59.769908Z"),("abc124", "Accepting/Denying
Claims", "2020-01-17T23:18:59.769908Z") FROM blabla.testingjson;
Output:
Row | f0_.cuid | f0_.title | f0_.created
----------------------------------------
1 | abc123 | Foo | 2020-01-17 23:18:59.769908 UTC

Array of JSON in Athena is read incorrectly and can't be unnested

I have column called uf that contains an array of JSON objects. Here is a mockup:
[
{"type": "browserId", "name": "", "value": "unknown"},
{"type": "campaign", "name": "", "value": "om_227dec0082a5"},
{"type": "custom", "name": "2351350529", "value": "10148"},
{"type": "custom", "name": "9501713387", "value": "true"},
{"type": "custom", "name": "9517735577", "value": "true"},
{"type": "custom", "name": "9507402548", "value": "true"},
{"type": "custom", "name": "9733902068", "value": "true"}
]
I'm trying to get this as child records but for some reason I can't find the right way to unnest it first. Then I noticed that my whole array is wrapped into another JSON object..
This is where I'm at:
I tried simple select and noticed that the result is:
[{type=[{"type": "browserId", "name": "", "value": "ff"}, name=null, value=null}]
The definition for this column is as follows:
{
"Name": "uf",
"Type": "array<struct<type:string,name:string,value:string>>"
}
Is the definition incorrect and that's why I get my whole array wrapped in another json array?
-- edit
Here is the example of my csv file (tab delimited). Spent last two days trying to see if it's something about JSON that makes Glue not recognise column as array of JSON but I created a new column with simple array of JSON that was correctly assigned as array<struct but after querying I was getting exactly the same problem as above
timestamp project_id campaign_id experiment_id variation_id layer_holdback audience_names end_user_id uuid session_id snippet_revision user_ip user_agent user_engine user_engine_version referer global_holdback event_type event_name uf active_views event_features event_metrics event_uuid
1570326511 74971132 11089500404 11097730080 11078120202 false [] oeu1535997971348r0.4399811351004357 AUTO 6540 5.91.170.0 Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G925F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36 js 0.128.0 https://www.zavamed.com/uk/account/ false view_activated 10832783364 [{"type": "browserId", "name": "", "value": "unknown"}, {"type": "device", "name": "", "value": "mobile"}, {"type": "device_type", "name": "", "value": "phone"}, {"type": "referrer", "name": "", "value": "https:\/\/www.google.co.uk\/"}, {"type": "source_type", "name": "", "value": "campaign"}, {"type": "currentTimestamp", "name": "", "value": "-1631518596"}, {"type": "offset", "name": "", "value": "-60"}] [] [] [] 4926a5f1-bbb5-4553-9d0b-b26f773fa0f4
I uploaded a sample csv file onto S3 with the content you provided. Then I ran a glue crawler on it. Here is a table schema I ended up with:
CREATE EXTERNAL TABLE `question_58765672`(
`timestamp` bigint,
`project_id` bigint,
`campaign_id` bigint,
`experiment_id` bigint,
`variation_id` bigint,
`layer_holdback` boolean,
`audience_names` array<string>,
`end_user_id` string,
`uuid` string,
`session_id` string,
`snippet_revision` bigint,
`user_ip` string,
`user_agent` string,
`user_engine` string,
`user_engine_version` string,
`referer` string,
`global_holdback` boolean,
`event_type` string,
`event_name` bigint,
`uf` string,
`active_views` array<string>,
`event_features` array<string>,
`event_metrics` array<string>,
`event_uuid` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://__S3_PATH_IN_MY_BUCKET__/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='some-crawler',
'areColumnsQuoted'='false',
'averageRecordSize'='553',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='\t',
'objectCount'='1',
'recordCount'='2',
'sizeKey'='1109',
'skip.header.line.count'='1',
'typeOfData'='file')
As you can see it identified column uf as string, which I wasn't really surprised about. In order to unnest this column, I had to manually cast it to the correct type ARRAY(JSON):
SELECT
"timestamp",
_unnested_column
FROM
"stackoverflow"."question_58765672",
UNNEST( CAST(json_parse(uf) AS ARRAY(JSON)) ) AS t(_unnested_column)
Result:
timestamp _unnested_column
1 1570326511 {"name":"","type":"browserId","value":"unknown"}
2 1570326511 {"name":"","type":"device","value":"mobile"}
3 1570326511 {"name":"","type":"device_type","value":"phone"}
4 1570326511 {"name":"","type":"referrer","value":"https://www.google.co.uk/"}
5 1570326511 {"name":"","type":"source_type","value":"campaign"}
6 1570326511 {"name":"","type":"currentTimestamp","value":"-1631518596"}
7 1570326511 {"name":"","type":"offset","value":"-60"}
Then I thought of creating a athena views, where, column uf would be casted correctly:
CREATE OR REPLACE VIEW question_58765672_v1_json AS
SELECT
CAST(json_parse(uf) AS ARRAY(JSON)) as uf
-- ALL other columns from your table
FROM
"stackoverflow"."question_58765672"
However, I got the following error:
Invalid column type for column uf: Unsupported Hive type: json
My guess, is that schema for column uf is either too complicated for for glue crawler in order to correctly identify it or just simply not supported by the used Serde, i.e. 'org.apache.hadoop.mapred.TextInputFormat' or 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'.

Timestamptz, same point in time but different representation for same query with 'set local time zone'

I am trying to build my web app that will store the data on a PostgreSQL database server running on some location on Earth and will have users connecting from other locations, so probably different timezones and offsets than my servers.
I need to show the dates and times of the actions like, posts created, posts edited, comments submitted, etc. according to the each connecting user. This is just like the StackExchange. However I am running into problems of timezones and offsets as described as follows:
Everything seems working correct in my pgAdmin3 SQL Editor. When I write the query below in pgAdmin3 SQL Editor with set local time zone 'Europe/Oslo', for example, I get both the posts and tags table created_at fields correct with +2 offset in the output. In the output row, the created_at field of posts table is 2016-08-29 19:15:53.758+02 and for same row created_at for tags table is 2016-08-29T19:15:53.758+02:00.
However, when I put it in a route function in my Nodejs Express.js server with pg-promise as the connection lib, I only get the tags table created_at field correct with the time in Oslo with timezone offset appended as expected, I get created_at field of the posts table in UTC not as expected.
All timestamps are defined as timestamp(3) with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP as shown below. Also, without the setting set local time zone, I get the same behaviour, for first table I get UTC time, for the latter I get timestamp with offset of server appended.
Does not the set local time zone directive bind all the query? What is the missing point in my approach?
an example query I use:
select
q.*, -- created_at timestamp (with time zone) is one of those columns
u.name as author,
u.reputation,
case when count(t.*)=0 then '[]' e json_agg(t.*) end as tags
from posts q
-- authors
join users u
on q.author_id = u.id
-- tags
left join post_has_tag p_h_t
on q.id = p_h_t.post_id
left join tags t
on p_h_t.tag_id = t.id
where q.post_type = 'question'
group by q.id, u.id;
An example express.js route function:
trialRoutes.get('/x', function (req, res) {
db.query(
`
--begin;
SET LOCAL TIME ZONE 'Europe/Oslo';
SELECT
q.*, -- created_at timestamp (with time zone) is already in here
u.name AS author,
u.reputation,
CASE WHEN count(t.*)=0 THEN '[]' ELSE json_agg(t.*) END as tags
FROM posts q
-- authors
JOIN users u
ON q.author_id = u.id
-- tags
left join post_has_tag p_h_t
on q.id = p_h_t.post_id
left join tags t
on p_h_t.tag_id = t.id
WHERE q.post_type = 'question'
group by q.id, u.id;
--commit;
`
)
.then(function (data) {
res.json(data)
})
.catch(function (error) {
console.log("/login, database quesry error.", error);
});
})
The result I get from Express.js http server with pg-promise. Note the different timestamps that actually should point same point in UTC, which is correctly done, and representation which is not correctly done:
[
{
"id": "7",
"created_at": "2016-08-29T21:02:04.153Z", // same point in time, different representation
"title": "AAAAAAAAAAA",
"text": "aaaaa aaaaaaa aaaaaa",
"post_url": "AAAAAAAAAAA",
"score": 0,
"author_id": 1,
"parent_post_id": null,
"post_type": "question",
"is_accepted": false,
"acceptor_id": null,
"timezone": "2016-08-29T20:02:04.153Z",
"author": "Faruk",
"reputation": 0,
"tags": [
{
"id": 4,
"created_at": "2016-08-29T23:02:04.153+02:00", // same point in time, different representation
"label": "physics",
"description": null,
"category": null
}
]
},
{
"id": "6",
"created_at": "2016-08-29T17:24:10.151Z",
"title": "Ignoring timezones altogether in Rails and PostgreSQL",
"text": "Ignoring timezones altogether in Rails and PostgreSQL",
"post_url": "Ignoring-timezones-altogether-in-Rails-and-PostgreSQL",
"score": 0,
"author_id": 2,
"parent_post_id": null,
"post_type": "question",
"is_accepted": false,
"acceptor_id": null,
"timezone": "2016-08-29T16:24:10.151Z",
"author": "Selçuk",
"reputation": 0,
"tags": [
{
"id": 3,
"created_at": "2016-08-29T19:24:10.151+02:00",
"label": "sql",
"description": null,
"category": null
}
]
}
]
The definition of the posts and tags tables used here:
-- questions and answers
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp(3) with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
post_url character varying(100),
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
parent_post_id integer REFERENCES posts (id),
post_type varchar(30),
is_accepted boolean DEFAULT FALSE,
acceptor_id integer REFERENCES users (id) DEFAULT NULL
--seen_by_parent_post_author boolean DEFAULT false
--view_count
--accepted_answer_id
--answer_count
);
CREATE TABLE tags
(
id bigserial PRIMARY KEY,
created_at timestamp(3) with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
label character varying(30) NOT NULL,
description character varying(200),
category character varying(50)
);