Using "match_recognize" in a Common Table Expression in Snowflake - sql

Update: This was answered here.
I am putting together a somewhat complex query to do event detection, join(s), and time-based binning with a large time-series dataset in Snowflake. I recently noticed that match_recognize lets me eloquently detect time-series events, but whenever I try to use a match_recognize expression within a Common Table Expression (with .. as ..), I receive the following error:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
I've done a lot of searching/reading, but haven't found any documented limitations on match_recognize in CTEs. Here's my query:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
And I get the same error as above with this.
Is this a limitation that I'm not seeing, or am I doing something wrong?

Non-recursive cte could be always rewritten as inline view:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
It is not ideal, but at least it will allow query to run.

Per this thread from a comment by Filipe Hoffa: MATCH_RECOGNIZE with CTE in Snowflake
This seemed to be a non-documented limitation of Snowflake at the time. A two or three step solution has worked well for me:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
I prefer to use a variable here, because I can re-run the second query multiple times during development/debugging without having to re-run the first query.
It is also important to note that cached GEOGRAPHY objects in Snowflake are converted to GEOJSON, so when retrieving these with result_scan, you must typecast them back to the GEOGRAPHY type.

Related

What else do I need to add to my SQL query to bring related information in other columns if using MIN() GROUP BY

There is a table with the following column headers: indi_cod, ries_cod, date, time and level. Each ries_cod contains more than one indi_cod, and these indi_cod are random consecutive numbers.
Which SQL query would be appropriate to build if the aim is to find the smallest ID of each ries_cod, and at the same time bring its related information corresponding to date, time and level?
I tried the following query:
SELECT MIN (indi_cod) AS min_indi_cod
FROM my-project-01-354113.indi_cod.second_step
GROUP BY ries_cod
ORDER BY ries_cod
And, indeed, it presented me with the minimum value of indi_cod for each group of ries_cod, but I couldn't write the appropriate query to bring me the information from the date, time and level columns corresponding to each indi_cod.
I usually use some kind of ranking for this type of thing. you can use row_number, rank, or dense_rank depending on your rdbms. here is an example.
with t as(select a.*,
row_number() over (partition by ries_cod, order by indi_cod) as rn
from mytable)
select * from t where rn = 1
in addition if you are using oracle you can do this without two queries by using keep.
https://renenyffenegger.ch/notes/development/databases/SQL/select/group-by/keep-dense_rank/index
I think you just need to group by with the other columns
SELECT MIN (indi_cod), ries_cod, date, time, level AS min_indi_cod
FROM mytavke p
GROUP BY ries_cod, date, time, level
ORDER BY ries_cod

Partition by rearranges table on each query run

The below query always rearranges my table (2021-01-01 not followed by 2021-01-02 but any other random date ) at each run and messes up the average calculation. If I remove the partition by the table will get ordered by EventTime(date) correctly...but I have 6 kinds of Symbols I would like the average of. What am I doing wrong?
select ClosePrice, Symbol, EventTime, AVG(ClosePrice) over(
partition by Symbol
order by EventTime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) [SMA]
from ClosePrices
The query is missing an ORDER BY clause for the final results. The ORDER BY inside the OVER() expression only applies to that window.
The SQL language is based on relational set theory concepts, which explicitly deny any built-in order for tables. That is, there is no guaranteed order for your tables or queries unless you deliberately set one via an ORDER BY clause.
In the past it may have seemed like you always get rows back in a certain order, but if so it's because you've been lucky. There are lots of things that can cause a database to return results in a different order, sometimes even for different runs of the same query.
If you care about the order of the results, you MUST use ORDER BY:
select ClosePrice, Symbol, EventTime, AVG(ClosePrice) over(
partition by Symbol
order by EventTime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) [SMA]
from ClosePrices
ORDER BY EventTime

Creating a partitioned table from query in Big Query does not yield same as without partitioning

When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.

I want to generate unique ids while inserting into Bigquery table.

I want to generate unique ids while inserting into Bigquery table. ROW_NUMBER()OVER() fails with resources exceeded. Forums recommend using ROW_NUMBER()OVER(PARTITION BY). Unfortunately, partition by can't be used as it may produce same row_numbers for the partition by key. Please note that the data that I am trying to insert is at least few hundreds of millions every day.
Unfortunately, partition by can't be used as it may produce same row_numbers for the partition by key
yes - you will get same numbers for different partitions - so you can just use compound key like in below much simplified example - just to show approach - you should be able to tweak it to your specific case
#standardSQL
WITH `project.dataset.table` AS (
SELECT value, CAST(10*RAND() AS INT64) partitionid
FROM UNNEST(GENERATE_ARRAY(1, 100)) value
)
SELECT
partitionid,
value,
CONCAT(
CAST(1000 + partitionid AS STRING),
CAST(10000 + ROW_NUMBER() OVER(PARTITION BY partitionid ORDER BY value) AS STRING)
) id
FROM `project.dataset.table`
-- ORDER BY id

Filtering out duplicate subsequent records in a SELECT

(PostgreSQL 8.4)
Table "trackingMessages" stores tracking events between mobile devices (tm_nl_mobileid) and fixed devices (tm_nl_fixedId).
CREATE TABLE trackingMessages
(
tm_id SERIAL PRIMARY KEY, -- PK
tm_nl_mobileId INTEGER, -- FK to mobile
tm_nl_fixedId INTEGER, -- FK to fixed
tm_date INTEGER, -- Network time
tm_messageType INTEGER, -- 0=disconnect, 1=connect
CONSTRAINT tm_unique_row
UNIQUE (tm_nl_mobileId, tm_nl_fixedId, tm_date, tm_messageType)
);
Problem here is that it's possible that the same mobile will connect to the same fixed twice (or more times) subsequently. I don't want to see the subsequent ones, but it's OK to see a mobile connected to a same fixed at a later date, provided there was a connection to a different fixed in between.
I think I'm close but not quite. I've been using the following CTE (found here on Stack Overflow)
WITH cte AS
(
SELECT tm_nl_fixedid, tm_date, Row_number() OVER (
partition BY tm_nl_fixedid
ORDER BY tm_date ASC
) RN
FROM trackingMessages
)
SELECT * FROM cte
WHERE tm_nl_mobileid = 150 AND tm_messagetype = 1
ORDER BY tm_date;
Gives me the following results
32;1316538756;1
21;1316539069;1
32;1316539194;2
32;1316539221;3
21;1316539235;2
The problem here is that the last column should be 1, 1, 1, 2, 1, because that third "32" is in fact a duplicate tracking event (twice in a row at the same fixed) and that last connection to "21" is OK because "32" was in between.
Please don't suggest a cursor, this is what I am currently trying to move away from. The cursor solution does work, but it's too slow given the amount of records I have to deal with. I'd much rather fix the CTE and only select where RN = 1 ... unless you have a better idea!
Well, you're not that close because row_number() cannot track sequences by two groups at the same time. PARTITION BY tm_nl_fixedid ORDER BY date RESTART ON GAP does not exist, there's no such thing.
Itzik Ben-Gan has a solution for the islands and gaps problem you are facing (several solutions, actually). The idea is to order rows by the main criteria (date) and then by partitioning criteria + main criteria. Difference between ordinals will remain the same as they belong to the same partitioning criteria and date series.
with cte as
(
select *,
-- While order by date and order by something-else, date
-- run along, they belong to the same sequence
row_number() over (order by tm_date)
- row_number() over (order by tm_nl_fixedid, tm_date) grp
from trackingMessages
)
select *,
-- Now we can get ordinal number grouped by each sequence
row_number() over (partition by tm_nl_fixedid, grp
order by tm_date) rn
from cte
order by tm_date
Here is Sql Fiddle with example.
And here is chapter 5 of Sql Server MVP Deep Dives with several solutions to islands and gaps problem.
This should be simpler with the window function lag():
WITH cte AS (
SELECT *
,lag(tm_nl_fixedId) OVER (PARTITION BY tm_nl_mobileId
ORDER BY tm_date) AS last_fixed
FROM trackingmessages
)
SELECT *
FROM cte
WHERE last_fixed IS DISTINCT FROM tm_nl_fixedId
ORDER BY tm_date
Explain
In the CTE, lag() gets the last fixed device to which a mobile connected (NULL for the first row per mobile - that's why I use IS DISTINCT FROM later, see a different approach here).
Then simply exclude all rows where the last fixed device was the same as this one, thereby excluding all "subsequent ones". All done.