BigQuery UPDATE of column within nest from GoogleAnalytics export - google-bigquery

I have a website's Google Analytics tracking data exported in BigQuery to a ga_session_* table. It's the standard setup of any GA export to BQ I've come across.
For simplicity's sake I'll refer to only ga_session_20220801.
This table has the following fields:
- visitorId INTEGER NULLABLE
- visitNumber INTEGER NULLABLE
- visitId INTEGER NULLABLE
- visitStartTime INTEGER NULLABLE
- date STRING NULLABLE
- totals RECORD NULLABLE
- trafficSource RECORD NULLABLE
- device RECORD NULLABLE
- geoNetwork RECORD NULLABLE
- customDimensions RECORD REPEATED
- hits RECORD REPEATED
- fullVisitorId STRING NULLABLE
- userId STRING NULLABLE
- clientId STRING NULLABLE
- channelGrouping STRING NULLABLE
- socialEngagementType STRING NULLABLE
- privacyInfo RECORD NULLABLE
Field hits is a repeated field i.e. it contains multiple records for each table record.
My question
How can I execute an update statements on field hits.sourcePropertyInfo.sourcePropertyTrackingId, where sourcePropertyInfo is another nest inside hits (but not a repeated record)?
My attempts
update `my_project.my_dataset.ga_sessions_20220801`
set hits.sourcePropertyInfo.sourcePropertyTrackingId = 'some value'
where 1=1
Cannot access field sourcePropertyInfo on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [2:10]
update `my_project.my_dataset.ga_sessions_20220801`
set (select sourcePropertyInfo.sourcePropertyDisplayName from unnest(hits)) = 'some value'
where 1=1
Syntax error: Expected keyword DELETE or keyword INSERT or keyword UPDATE but got keyword SELECT at [2:6]
The following is my only attempt that was not stopped by syntax. I ended up trying to re-create the whole nest by listing each field. But still returns an error at runtime due to the record being repeated.
update `my_project.my_dataset.ga_sessions_20220801`
set hits = ARRAY(
SELECT AS STRUCT
hits.hitNumber, hits.time, hits.hour, hits.minute, hits.isSecure, hits.isInteraction, hits.isEntrance, hits.isExit, hits.referer, hits.page,
(SELECT AS STRUCT hits.transaction.transactionId, hits.transaction.transactionRevenue, hits.transaction.transactionTax, hits.transaction.transactionShipping, hits.transaction.affiliation,
hits.transaction.currencyCode,
hits.transaction.localTransactionRevenue, hits.transaction.localTransactionTax, hits.transaction.localTransactionShipping, hits.transaction.transactionCoupon)
as transaction,
hits.item, hits.contentInfo, hits.appInfo, hits.exceptionInfo, hits.eventInfo, hits.product, hits.promotion, hits.promotionActionInfo, hits.refund, hits.eCommerceAction, hits.experiment, hits.publisher, hits.customVariables, hits.customDimensions, hits.customMetrics, hits.type, hits.social, hits.latencyTracking,
(SELECT AS STRUCT 'some value' as sourcePropertyDisplayName, hits.sourcePropertyInfo.sourcePropertyTrackingId)
as sourcePropertyInfo,
hits.contentGroup, hits.dataSource, hits.publisher_infos, hits.uses_transient_token
FROM gas.hits
)
where 1=1
Scalar subquery produced more than one element

You're so close with the third query. You should recreate the hits column, but by preserving the original data structure.
In the query below, I get all the hit rows and replace their sourcePropertyInfo key in the struct.
Then, since I unnested the hits, to gather it again, I used array_agg so it becomes an array again.
update `my_project.my_dataset.ga_sessions_20220801`
set hits =
(
select array_agg(t)
from (
select
hit.* replace(
struct(
hit.sourcePropertyInfo.sourcePropertyDisplayName,
'some value' as sourcePropertyTrackingId
) as sourcePropertyInfo
)
from unnest(hits) as hit
) as t
)
where true

Related

Flink Window Aggregation using TUMBLE failing on TIMESTAMP

We have one table A in database. We are loading that table into flink using Flink SQL JdbcCatalog.
Here is how we are loading the data
val catalog = new JdbcCatalog("my_catalog", "database_name", username, password, url)
streamTableEnvironment.registerCatalog("my_catalog", catalog)
streamTableEnvironment.useCatalog("my_catalog")
val query = "select timestamp, count from A"
val sourceTable = streamTableEnvironment.sqlQuery(query) streamTableEnvironment.createTemporaryView("innerTable", sourceTable)
val aggregationQuery = select window_end, sum(count) from TABLE(TUMBLE(TABLE innerTable, DESCRIPTOR(timestamp), INTERVAL '10' minutes)) group by window_end
It throws following error
Exception in thread "main" org.apache.flink.table.api.ValidationException: SQL validation failed. The window function TUMBLE(TABLE table_name, DESCRIPTOR(timecol), datetime interval[, datetime interval]) requires the timecol is a time attribute type, but is TIMESTAMP(6).
In short we want to apply windowing aggregation on an already existing column. How can we do that
Note - This is a batch processing
Timestamp columns used as time attributes in Flink SQL must be either TIMESTAMP(3) or TIMESTAMP_LTZ(3).
Column should be TIMESTAMP(3) or TIMESTAMP_LTZ(3) but also the column should be marked as ROWTIME.
Type this line in your code
sourceTable.printSchema();
and check the result. The column should be marked as ROWTIME as shown below.
(
`deviceId` STRING,
`dataStart` BIGINT,
`recordCount` INT,
`time_Insert` BIGINT,
`time_Insert_ts` TIMESTAMP(3) *ROWTIME*
)
You can find my sample below.
Table tableCpuDataCalculatedTemp = tableEnv.fromDataStream(streamCPUDataCalculated, Schema.newBuilder()
.column("deviceId", DataTypes.STRING())
.column("dataStart", DataTypes.BIGINT())
.column("recordCount", DataTypes.INT())
.column("time_Insert", DataTypes.BIGINT())
.column("time_Insert_ts", DataTypes.TIMESTAMP(3))
.watermark("time_Insert_ts", "time_Insert_ts")
.build());
watermark method makes it ROWTIME

How to write my clickhouse sql query correctly?

I have written a sql query:
WITH
type = "id" AND match(value, '*** ID WAS ACTIVATED ***') as id_activ
SELECT
multiIf(
id_activ, 'id_activ',
NULL) as case,
date,
value
FROM my.data.base
However it doesn't work. A problem occurs because of this part: match(value, '*** ID WAS ACTIVATED ***'). When i try this, it works perfectly:
WITH
type = "id" AND value LIKE '%*** ID WAS ACTIVATED ***%' as id_activ
SELECT
multiIf(
id_activ, 'id_activ',
NULL) as case,
date,
value
FROM my.data.base
Here is a full message with that pattern that I use in match:
*** ID WAS ACTIVATED *** : Values expired
How could i fix it and write query with match right?
Same thing with query which has square brackets in it:
WITH
type = "id" AND match(value, 'Values processed [NUMBERS]') as id_number
SELECT
multiIf(
id_number, 'id_number',
NULL) as case,
date,
value
FROM my.data.base
match() is matching regular expression
"Equivalent" to
LIKE '%*** ID WAS ACTIVATED ***%'
is
match(value, '.*\*\*\* ID WAS ACTIVATED \*\*\*.*')
https://regex101.com/r/XkRmEY/1

Checking against value in a STRING_TABLE in a WHERE clause

I have a procedure with the parameter IT_ATINN:
IMPORTING
REFERENCE(IT_ATINN) TYPE STRING_TABLE
IT_ATINN contains a list of characteristics.
I have the following code:
LOOP AT values_tab INTO DATA(value).
SELECT ( #value-INSTANCE ) AS CUOBJ
FROM IBSYMBOL
WHERE SYMBOL_ID = #value-SYMBOL_ID
AND ATINN ??? "<======== HERE ???
APPENDING TABLE #DATA(ibsymbol_tab).
ENDLOOP.
How can I check if ATINN (in the WHERE clause) is equal to any entry in IT_ATINN?
To achieve what you want (and I assume you want dynamic SELECT fields) you cannot use inline declarations here, both in LOOP and in SELECT:
The structure of the results set must be statically identifiable. The SELECT list and the FROM clause must be specified statically and host variables in the SELECT list must not be generic.
So either you use inline or use dynamics, not both.
Here is the snippet that illustrates Sandra good suggestion:
TYPES: BEGIN OF ty_value_tab,
instance TYPE char18,
symbol_id TYPE id,
END OF ty_value_tab.
DATA: it_atinn TYPE string_table.
DATA: rt_atinn TYPE RANGE OF atinn,
value TYPE ty_value_tab,
values_tab TYPE RANGE OF ty_value_tab,
ibsymbol_tab TYPE TABLE OF ibsymbol.
rt_atinn = VALUE #( FOR value_atinn IN it_atinn ( sign = 'I' option = 'EQ' low = value_atinn ) ).
APPEND VALUE ty_value_tab( instance = 'ATWRT' ) TO values_tab.
LOOP AT values_tab INTO value.
SELECT (value-instance)
FROM ibsymbol
WHERE symbol_id = #value-symbol_id
AND atinn IN #rt_atinn
APPENDING CORRESPONDING FIELDS OF TABLE #ibsymbol_tab.
ENDLOOP.
Overall, it makes no sense select ibsymbol in loop, 'cause it has only 8 fields, so you can easily collect all necessary fields from values_tab and pass them as dynamic fieldstring.
If you wanna use alias CUOBJ for your dynamic field you should add it like this:
LOOP AT values_tab INTO value.
DATA(aliased_value) = value-instance && ` AS cuobj `.
SELECT (aliased_value)
...
Remember, that your alias should exists among ibsymbol fields, otherwise in case of static ibsymbol_tab declaration this statement will throw a short dump.

Using Array(Tuple(LowCardinality(String), Int32)) in ClickHouse

I have a table
CREATE TABLE table (
id Int32,
values Array(Tuple(LowCardinality(String), Int32)),
date Date
) ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (id, date)
but when executing the request
SELECT count(*)
FROM table
WHERE (arrayExists(x -> ((x.1) = toLowCardinality('pattern')), values) = 1)
I get an error
Code: 49. DB::Exception: Received from clickhouse:9000. DB::Exception: Cannot capture column 3 because it has incompatible type: got String, but LowCardinality(String) is expected..
If I replace the column 'values'
values Array(Tuple(String, Int32))
then the request is executed without errors.
What could be the problem when using Array(Tuple(LowCardinality(String), Int32))?
Until it will be fixed (see bug 7815), can be used this workaround:
SELECT uniqExact((id, date)) AS count
FROM table
ARRAY JOIN values
WHERE values.1 = 'pattern'
For the case when there are more than one Array-columns can be used this way:
SELECT uniqExact((id, date)) AS count
FROM
(
SELECT
id,
date,
arrayJoin(values) AS v,
arrayJoin(values2) AS v2
FROM table
WHERE v.1 = 'pattern' AND v2.1 = 'pattern2'
)
values Array(Tuple(LowCardinality(String), Int32)),
Do not use Tuple. It brings only cons.
It's still *2 files on the disk.
It gives twice slowdown then you extract only one tuple element
https://gist.github.com/den-crane/f20a2dce94a2926a1e7cfec7cdd12f6d
valuesS Array(LowCardinality(String)),
valuesI Array(Int32)

Select rows from table with jsonb column based on arbitrary jsonb filter expression

Test data
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3}')
, ('{"a":11,"b":12, "c":13}')
, ('{"a":21,"b":22, "c":23}')
Problem statement: I want to receive an arbitrary JSONB parameter which acts as a filter on column t.data, such as
{ "b":{ "from":0, "to":20 }, "c":13 }
and use this to select matching rows from my test table t.
In this example, I want rows where b is between 0 and 20 and c = 13.
No error is required if the filter specifies a "column" (or "tag") which does not exist in t.data - it just fails to find a match.
I've used numeric values for simplicity but would like an approach which generalises to text as well.
What I have tried so far. I looked at the containment approach, which works for equality conditions, but am stumped on a generic way of handling range conditions:
select * from t
where t.data#> '{"c":13}'::jsonb;
Background: This problem arose when building a generic table-preview page on a website (for Admin users).
The page displays a filter based on various columns in whichever table is selected for preview.
The filter is then passed to a function in Postgres DB which applies this dynamic filter condition to the table.
It returns a jsonb array of the rows matching the filter specified by the user.
This jsonb array is then used to populate the Preview resultset.
The columns which make up the filter may change.
My Postgres version is 9.6 - thanks.
if you want to parse { "b":{ "from":0, "to":20 }, "c":13 } you need a parser. It is out of scope of json functions, but you can write "generic" query using AND and OR to filter by such json, eg:
https://www.db-fiddle.com/f/jAPBQggG3p7CxqbKLMbPKw/0
with filt(f) as (values('{ "b":{ "from":0, "to":20 }, "c":13 }'::json))
select *
from t
join filt on
(f->'b'->>'from')::int < (data->>'b')::int
and
(f->'b'->>'to')::int > (data->>'b')::int
and
(data->>'c')::int = (f->>'c')::int
;
Thanks for the comments/suggestions.
I will definitely look at GraphQL when I have more time - I'm working under a tight deadline at the moment.
It seems the consensus is that a fully generic solution is not achievable without a parser.
However, I got a workable first draft - it's far from ideal but we can work with it. Any comments/improvements are welcome ...
Test data (expanded to include dates & text fields)
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3, "d":"2018-03-10", "e":"2018-03-10", "f":"Blah blah" }')
, ('{"a":11,"b":12, "c":13, "d":"2018-03-14", "e":"2018-03-14", "f":"Howzat!"}')
, ('{"a":21,"b":22, "c":23, "d":"2018-03-14", "e":"2018-03-14", "f":"Blah blah"}')
First draft of code to apply a jsonb filter dynamically, but with restrictions on what syntax is supported.
Also, it just fails silently if the syntax supplied does not match what it expects.
Timestamp handling a bit kludgey, too.
-- Handle timestamp & text types as well as int
-- See is_timestamp(text) function at bottom
with cte as (
select t.data, f.filt, fk.key
from t
, ( values ('{ "a":11, "b":{ "from":0, "to":20 }, "c":13, "d":"2018-03-14", "e":{ "from":"2018-03-11", "to": "2018-03-14" }, "f":"Howzat!" }'::jsonb ) ) as f(filt) -- equiv to cross join
, lateral (select * from jsonb_each(f.filt)) as fk
)
select data, filt --, key, jsonb_typeof(filt->key), jsonb_typeof(filt->key->'from'), is_timestamp((filt->key)::text), is_timestamp((filt->key->'from')::text)
from cte
where
case when (filt->key->>'from') is null then
case jsonb_typeof(filt->key)
when 'number' then (data->>key)::numeric = (filt->>key)::numeric
when 'string' then
case is_timestamp( (filt->key)::text )
when true then (data->>key)::timestamp = (filt->>key)::timestamp
else (data->>key)::text = (filt->>key)::text
end
when 'boolean' then (data->>key)::boolean = (filt->>key)::boolean
else false
end
else
case jsonb_typeof(filt->key->'from')
when 'number' then (data->>key)::numeric between (filt->key->>'from')::numeric and (filt->key->>'to')::numeric
when 'string' then
case is_timestamp( (filt->key->'from')::text )
when true then (data->>key)::timestamp between (filt->key->>'from')::timestamp and (filt->key->>'to')::timestamp
else (data->>key)::text between (filt->key->>'from')::text and (filt->key->>'to')::text
end
when 'boolean' then false
else false
end
end
group by data, filt
having count(*) = ( select count(distinct key) from cte ) -- must match on all filter elements
;
create or replace function is_timestamp(s text) returns boolean as $$
begin
perform s::timestamp;
return true;
exception when others then
return false;
end;
$$ strict language plpgsql immutable;