Creating windowed materialized view in Clickhouse - sql

Events from a MQTT server are inserted into this table:
CREATE TABLE IF NOT EXISTS mqtt_float(
topic LowCardinality( String ) NOT NULL ,
ts DateTime64(2,'Europe/Paris') NOT NULL DEFAULT now() ,
value Float64 NOT NULL ,
) ENGINE=ReplacingMergeTree -- eliminates duplicates with same (topic,ts) which should not occur
ORDER BY (topic,ts)
PRIMARY KEY (topic,ts);
Rows contain sensor data, topic is the name of a sensor, ts is the timestamp, and value is what was measured by the sensor at this time. To speed things up when plotting data over long periods of time, I have created a materialized view to summarize per minute. This works well:
CREATE MATERIALIZED VIEW IF NOT EXISTS mqtt_float_minute
ENGINE = AggregatingMergeTree
ORDER BY (topic, ts)
POPULATE
AS SELECT
topic,
toStartOfMinute(ts) AS ts,
avg(value) AS favg,
min(value) AS fmin,
max(value) AS fmax
FROM mqtt_float
GROUP BY topic,ts;
Now I want to learn how to use the Window View feature, which supposedly does the same thing in a more efficient manner. Here's my attempt:
CREATE WINDOW VIEW IF NOT EXISTS mqtt_float_minute_w
ENGINE = AggregatingMergeTree
ORDER BY (topic, ts)
POPULATE
AS SELECT
topic,
toStartOfMinute(ts) AS ts,
avg(value) OVER window1 AS favg,
min(value) OVER window1 AS fmin,
max(value) OVER window1 AS fmax
FROM mqtt_float
WHERE is_int=0
GROUP BY topic,ts
WINDOW window1 AS (PARTITION BY topic, toStartOfMinute(ts));
It fails:
Received exception from server (version 22.12.3):
Code: 215. DB::Exception: Received from localhost:9000. DB::Exception: Column value is not under aggregate function and not in GROUP BY. Have columns: ['toStartOfMinute(ts)','topic']: While processing topic, toStartOfMinute(ts) AS ts, avg(value) OVER window1 AS favg, min(value) OVER window1 AS fmin, max(value) OVER window1 AS fmax. (NOT_AN_AGGREGATE)
I'm quite puzzled by this error message. Any ideas how to make it work?
The WINDOW clause seems mandatory for a window view, which makes sense. Basically what I want is to compute the aggregates of column value, grouping by (topic,toStartOfMinute(ts)). When the current minute window closes, there will be no more records inserted with timestamps belonging to that window, so the point of this type of view is: clickhouse can discard the aggregation states which will never be used again, and save some storage.

Related

Results within Bigquery do not remain the same as in GA4

I'm inside BigQuery performing the query below to see how many users I had from August 1st to August 14th, but the number is not matching what GA4 presents me.
with event AS (
SELECT
user_id,
event_name,
PARSE_DATE('%Y%m%d',
event_date) AS event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY TIMESTAMP_MICROS(event_timestamp) DESC) AS rn,
FROM
`events_*`
WHERE
event_name= 'push_received')
SELECT COUNT ( DISTINCT user_id)
FROM
event
WHERE
event_date >= '2022-08-01'
Resultado do GA4
Result BQ = 37024
There are quite a few reasons why your GA4 data in the web will not match when compared to the BigQuery export and the Data API.
In this case, I believe you are running into the Time Zone issue. event_date is the date that the event was logged in the registered timezone of your Property. However, event_timestamp is a time in UTC that the event was logged by the client.
To resolve this, simply update your query with:
EXTRACT(DATETIME FROM TIMESTAMP_MICROS(`event_timestamp`) at TIME ZONE 'TIMEZONE OF YOUR PROPERTY' )
Your data should then match the WebUI and the GA4 Data API. This post that I co-authored goes into more detail on this and other reasons why your data doesn't match: https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
You cannot simply compare totals. Divide it into daily comparisons and look at details.

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

Stuck on what seems like a simple SQL dense_rank task

Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!
Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp

Obtain latest record for a given second Postgres

I have data with millisecond precision timestamp. I want to only filter for the most recent timestamp within a given second. Ie. records (2020-07-13 5:05.38.009, event1), (2020-07-13 5:05.38.012, event2) should only retrieve the latter.
I've tried the following:
SELECT
timestamp as time, event as value, event_type as metric
FROM
table
GROUP BY
date_trunc('second', time)
But then I'm asked to group by event as well and I see all the data (as if no group by was provided)
In Postgres, you can use distinct on:
select distinct on (date_trunc('second', time)) t.*
from t
order by time desc;

BigQuery - counting number of events within a sliding time frame

I would like to count the number of events within a sliding time frame.
For example, say I would like to know how many bids were in the last 1000 seconds for the Google stock (GOOG).
I'm trying the following query:
SELECT
symbol,
start_date,
start_time,
bid_price,
count(if(max(start_time)-start_time<1000,1,null)) over (partition by symbol order by start_time asc) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
The logic is as follow: the partition window (by symbol) is ordered with the bid time (leaving alone the bid date for sake of simplicity).
For each window (defined by the row at the "head" of the window) I would like to count the number of rows which have start_time that is less than 1000 seconds than the "head" row time.
I'm trying to use max(start_time) to get the top row in the window. This doesn't seem to work and I get an error:
Error: MAX is an analytic function and must be accompanied by an OVER clause.
Is it possible to have two an analytic functions in one column (both count and max in this case)? Is there a different solution to the problem presented?
Try using the range function.
SELECT
symbol,
start_date,
start_time,
bid_price,
count(market_center) over (partition by symbol order by start_time RANGE 1000 PRECEDING) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
order by 2, 3
I used market_center just as a counter, additional fields can be used as well.
Note: the RANGE function is not documented in BigQuery Query Reference, however it's a standard SQL function which appears to work in this case