BQ - Materialised Views and ARRAY_AGG - google-bigquery

I am trialling materialised views in our BQ eventing system but have hit a roadblock.
For context:
Our source event ingest tables use streaming inserts only (append only), partitioned by event time (more or less true time but always in order WRT the entity involved in the event stream), and we extract a given entities 'latest' / most recent full state. I feel with data being append only, and history immutable, there could be benefits here but currently cannot get it to work (yet)
Alot of our base BQ code is spent determining what the 'latest' state of the entity is. This latest entity is baked into the payloads of the most recent event ingested in that table e.g OrderAccepted then later OrderItemsDespatched events (for the same OrderId), the OrderItemsDespatched event will have the most up to date snapshot of the order (post processing an items dispatch).
Thus in BQ for BI, we need to surface the most current state of that order. e.g we need to extract the order struct from the OrderItemsDespatched event since it is the most recent event.
This could involve an analytic function:
(ROW_NUMBER() OVER entityId ORDER BY EventOccurredTimestamp DESC)
and pick row=1 - however analytic functions are not supported by MVs and is not as efficient anyway as ARRAY_AGG below
CREATE MATERIALIZED VIEW dataset.order_events_latest_mv
PARTITION BY EventOccurredDate
CLUSTER BY OrderId
AS
WITH ord_events AS (
SELECT
oe.*,
orderEvent.latestOrder.id AS OrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime) AS EventOccurredTimestamp,
EXTRACT(DATE FROM PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime)) AS EventOccurredDate,
FROM
`project.dataset.order_events` oe
),
ord_events_latest AS (
SELECT
ARRAY_AGG(
e ORDER BY EventOccurredTimestamp DESC LIMIT 1
)[OFFSET(0)].*
FROM
ord_events e
GROUP BY
e.OrderId
)
SELECT
*
FROM
ord_events_latest
However there is an error
Materialized view query contains unsupported feature.
Fundamentally, we could save a heck of alot of current processing and cost only processing changed data, rather then scanning all the data everytime, which given its an append only, partitioned source table, seems feasible?
The logic would be quite similar for deduping our events as well, which we do alot as well with slightly different query but using ARRAY_AGG as well.
Any advice welcome, hopefully the feature the error message mentions isnt supported is not far off. Thanks!

I hope it works:
WITH
latest_records AS
(
SELECT entityId, SPLIT(MAX(COALESCE(EventOccurredTimestamp, '||', Col1, '||', Col2, '||', Col3)), '||') values
FROM `project.dataset.order_events`
GROUP BY entityId
)
SELECT
entityId,
CAST(values[offset(0)]) as timestamp) as EventOccurredTimestamp,
values[offset(1)] as Col1, -- let's say it's string
CAST(values[offset(2)] as bool) as Col2, -- it's bool
CAST(values[offset(3)] as int64) as Col3 -- it's int64
FROM latest_records

Related

What SQL query operations can change their value dependent on order?

[see addendum below]
Recently I was going through an SQL script as part of a task to check functionality of a data science process. I had a copy of a section of the script which had multiple sub queries and I refactored it to put the sub queries up the top in a with-clause. I usually think of this as an essentially syntactic refactoring operation that is semantically neutral. However, the operation of the script changed.
Investigation showed that it was due to the use of a row number over a partition
in which the ordering within the partition was not complete. Changing the structure
of the code changed something in the execution plan, that changed the order within
the slack left by the incomplete ordering.
I made a note of this point and became less confident of this refactoring, although
I hold the position that order should not affect the semantics, at least as long as
it can be avoided.
My question is ...
other than assigning a row number, what operations have a value that is changed
by the ordering?
I realize now that the question was a bit too open - both answers below were useful to me, but I cannot pick one over the other as THE right answer. I have up-voted both. Thanks. [I rethought that, and will pick one of the answers, rather than none. The one I pick was a bit more on target].
I also realize that the core of the problem was my not having strongly enough in mind that any refactoring can potentially change the contingent order in which the rows are returned. From now on, if I refactor and it changes the result - I will look for issues with ordering.
When windowed functions are involved, especially the ROW_NUMBER() the first thing to check is if the columns used for ordering produce a stable sort.
For instance:
CREATE TABLE t(id INT, grp VARCHAR(100), d DATE, val VARCHAR(100));
INSERT INTO t(id, grp, d, val)
VALUES (1, 'grpA', '2021-10-16', 'b')
,(2, 'grpA', '2021-10-16', 'a')
,(3, 'grpA', '2021-10-15', 'c')
,(4, 'grpA', '2021-10-14', 'd')
,(5, 'grpB', '2021-10-13', 'a')
,(6, 'grpB', '2021-10-13', 'g')
,(7, 'grpB', '2021-10-12', 'h');
-- the sort is not stable, d column has a tie
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY d DESC) AS rn
FROM t) sub
WHERE sub.rn = 1 AND sub.val = 'a';
Depending of the order of operation it could return:
0 rows
1 row (id: 2)
1 row (id: 5)
2 rows(id: 2 and 5)
When query is refactored it could cause choosing a differnt path to access the data thus different result.
To check if sort is stable windowed COUNT could be used using all available columns:
SELECT *
FROM (
SELECT t.*, COUNT(*) OVER(PARTITION BY grp, d ) AS cnt
FROM t) sub
WHERE cnt > 1;
db<>fiddle demo
So are you saying to had gaps in the row_numbers()? or duplicate row_numbers? or just row numbers jumped around (unstable?)
Which functions are altered by incomplete/unstable order by functions, all the ones where you put OBER BY in the window function. Thus ROW_NUMBER or LAG or LEAD
But in general a sub-select and a CTE (with clause) are the same, the primary difference is multiple things can JOIN the same CTE (thus the Common part) this can be good/bad as you might save on some expensive calculation, but you might also slow down a critical path, and make the whole execution time slower.
Or the data might be a little more processed (due to JOIN's etc) and then the incomplete ODERBY/instability might be exposed.
SQL is not based on set theory but on list theory. It is true that many join-based operations have an output such that the underlying bag of elements is a function of the underlying bag of elements in the input - but there are operations, such as row_number() as mentioned, in which this is not the case.
I would like to add a more obscure effect not mentioned in the other answers so far. Floating point arithmetic. Since the order of adding up floating point numbers does actually make a difference, it is possible that using a different ordering clause can produce different floating point values.
In the case mentioned in the posted question, this did actually happen - although only in the 10th decimal place. But that can be enough to change which value is bigger than another, and so make a discrete and significant change to the result of the outermost query.
Another example would be LISTAGG. I inherited some code that used LISTAGG, but didn't give consistent answers when I tweaked it, because it didn't include the ordering clause: WITHIN GROUP ( ORDER BY ...).
From the Snowflake docs:
If you do not specify the WITHIN GROUP (<orderby_clause>), the order
of elements within each list is unpredictable. (An ORDER BY clause
outside the WITHIN GROUP clause applies to the order of the output
rows, not to the order of the list elements within a row.)

Why isn't FIRST_VALUE and LAST_VALUE an aggregation function in SQL?

Is there any special reason that SQL only implements FIRST_VALUE and LAST_VALUE as a windowing function instead of an aggregation function? I find it quite common to encounter problems such as "find the item with highest price in each category". While other languages (such as python) provide MIN/MAX functions with keywords such that
MAX(item_name, key=lambda x: revenue[x])
is possible, In SQL the only way to tackle this problem seems to be:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Is there a special reason that there is no "aggregation version" of FIRST_VALUE such that
SELECT category, FIRST_VALUE(item_name, revenue)
FROM catalog
GROUP BY
category
or is it just the way it is?
That’s just the way it is, as far as I’m concerned. I suspect the only real answer would be “because it’s not in the SQL spec” and the only people who could really answer as to why it’s not in the spec are the people who write it. Questions of the form “what was (name of relevant external authority) thinking when they mandated that (name of product) should operate like this” are actually typically off topic here because very few people can reliably and factually answer.. I don’t even like my own answer here, as it feels like an extended comment on a question that cannot realistically be answered
Aggregate functions work on sets of data and while some of them might require some implied ordering operation such as median, the functions are always about the column they’re operating on, not a “give me the value of this column based on the ordering of that column”.
There are plenty of window/analytic functions that don’t have a corollary aggregation version, and window functions have a different end use intent than aggregation. You could conceive that some of them perform aggregation and then join the aggregation result back to the main data in order to relate the agg result to the particular row, but I wouldn’t assume the two facilities (agg vs window) are related at all
As far as I understand the python (not a python dev), it is not doing any aggregation, it's searching a list of item_name strings and looking each up in a dictionary that returns the revenue for that item, and returning the item_name that has the largest revenue. There wasn't any grouping there, it's much more like a SELECT TOP 1 item_name ORDER BY revenue and is only really good for returning a single item, rather than a load of items that are all maxes within their group, unless it's used within a loop that is processing a different list of item name each time
I know your question wasn't exactly about this particular SQL query but it may be helpful for you if I mention a couple of things on it. I'm not really sure what:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Gives you over something like:
SELECT DISTINCT category, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
The analytic/window will produce the same value for every category (the partition) so it seems that really all the extra group by is doing is reducing the repeated values - which could be more simply answered by just getting the values you want and using distinct to quash the duplicates (one of the few cases where I would advocate such)
In the more general sense of "I want the entire most X row as determined by highest/lowest Y" we typically use row number for that:
WITH temp as(
SELECT *, ROW_NUMBER(item_name) OVER(PARTITION BY category ORDER BY revenue) as rn
FROM catalog)
SELECT *
FROM temp
WHERE rn = 1;
Though I find it more compact/readable to dispense with the CTE and just use a sub query but YMMV

How do I query GA RealtimeView with Standard SQL in BigQuery?

When exporting Google Analytics data to Google BigQuery you can setup a realtime table that is populated with Google Analytics data in real time. However, this table will contain duplicates due to the eventual consistent nature of distributed computing.
To overcome this Google has provided a view where the duplicated are filtered out. However, this view is not queryable with Standard SQL.
If I try querying with standard:
Cannot reference a legacy SQL view in a standard SQL query
We have standardized on Standard, and I am hesitant to rewrite all our batch queries to legacy for when we want to use them on realtime data. Is there a way to switch the realtime view to be a standard view?
EDIT:
This is the view definition (which is recreated every day by Google):
SELECT *
FROM [111111.ga_realtime_sessions_20190625]
WHERE exportKey IN (SELECT exportKey
FROM
(SELECT
exportKey, exportTimeUsec,
MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
FROM
[111111.ga_realtime_sessions_20190625])
WHERE exportTimeUsec >= maxexportTimeUsec );
You can create a logical view like this using standard SQL:
CREATE VIEW dataset.realtime_view_20190625 AS
SELECT
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM dataset.ga_realtime_sessions_20190625 AS t
GROUP BY visitKey
This selects the most recent row for each visitKey. If you want to generalize this across days, you can do something like this:
CREATE VIEW dataset.realtime_view AS
SELECT
CONCAT('20', _TABLE_SUFFIX) AS date,
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM `dataset.ga_realtime_sessions_20*` AS t
GROUP BY date, visitKey

SQL Eliminate Duplicates with NO ID

I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table
Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.
here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1

bigquery resource limited exeeded due to order by

Whem I am running the following query, I get a 'resource limited exceeded'-error. If I remove the last line (the order by clause) it works:
SELECT
id,
INTEGER(-position / (CASE WHEN fallback = 0 THEN 2 ELSE 1 END)) AS major_sort
FROM (
SELECT
id,
fallback,
ROW_NUMBER() OVER(PARTITION BY fallback) AS position
FROM
[table] AS r
ORDER BY
r.score DESC ) AS r
ORDER BY major_sort DESC
Actually the entire last line would be:
ORDER BY major_sort DESC, r.score DESC
But neither that would probably make things even worse.
Any idea how I could change the query to circumvent this problem?
((If you wonder what this query does: the table contains a 'ranking' with multiple fallback strategies and I want to create an ordering like this: 'AABAABAABAAB' with 'A' and 'B' being the fallback strategies. If you have a better idea how to achieve this; please feel free to tell me :D))
A top-level ORDER BY will always serialize execution of your query: it will force all computation onto a single node for the purpose of sorting. That's the cause of the resources exceeded error.
I'm not sure I fully understand your goal with the query, so it's hard to suggest alternatives, but you might consider putting an ORDER BY clause within the OVER(PARTITION BY ...) clause. Sorting a single partition can be done in parallel and may be closer to what you want.
More general advice on ordering:
Order is not preserved during BQ queries, so if there's an ordering that you want to preserve on the input rows, make sure it's encoded in your data as an extra field.
The use cases for large amounts of globally-sorted data are somewhat limited. Often when users run into resource limitations with ORDER BY, we find that they're actually looking for something slightly different (locally ordered data, or "top N"), and that it's possible to get rid of the global ORDER BY completely.