Select unique IDs and divide result into X minute intervals based on given timespan - sql

I'm trying to knock some dust off my good old SQL queries, but I'm afraid I need a push in the right direction into taking those dusty skills and transform them into something useful when it comes to BigQuery statements.
I'm currently working with a single table schema looking like this:
In the query I would like to be able to supply the following in my where clause:
The date of which I would like the results to stem from.
A time range - in the above result example this range would be from 20:00 to 21:00. If 1. and 2. in this list should be merged together that's also fine.
The eventId I would like to find records for.
Optionally to be able to determine the interval frequency - should it be divided into each ie. 5, 10 or 15 minute intervals.
Also I would like to count the unique userIds for each interval. If one user is present during the entire session he/she should be taken into the count in every interval.
So think of it as the following:
How many unique users did we have every 5 minutes at X event, between 20:00 and 21:00 at Y day?
How should my query look if I want a result looking (something) like the following pseudo result:
time_interval number_of_unique_userIds
1 2022-03-16 20:00:00 10
2 2022-03-16 20:05:00 12
3 2022-03-16 20:10:00 15
4 2022-03-16 20:15:00 20
5 2022-03-16 20:20:00 30
6 ... etc.
If the time of the query is before the provided end time in the timespan, it should fill out the rest of the interval rows with 0 unique userIds.
In the following result we've executed mentioned query earlier than the provided end date - let's say that it's executed at 20:49:
time_interval number_of_unique_userIds
X 2022-03-16 20:50:00 0
X 2022-03-16 20:55:00 0
X 2022-03-16 21:00:00 0
Here's what I have so far, but it gives me several of the same interval records with what looks like each userId:
SELECT
TIMESTAMP_SECONDS(5*60 * DIV(UNIX_SECONDS(creationTime), 5*60)) time_interval,
COUNT(DISTINCT(userId)) number_of_unique_userIds
FROM `bigquery.table`
WHERE eventId = 'xyz'
AND creationTime > '2022-03-16 20:00:00' AND creationTime < '2022-03-16 21:00:00'
GROUP BY time_interval
ORDER BY time_interval DESC
This gives me somewhat what I expect - but I think the number_of_unique_userIds seems too low, so I'm a little worried that I'm not getting unique userIds for each interval. What I'm thinking is, that userIds that were counted into the first 5 minute interval is not counted in the next. So I'm not sure this query is sufficient for my needs. Also it's not filling the blanks with 0 number_of_unique_userIds.
I hope you can help me out here.
Thanks!

Related

Running a Count on an Interval

I'm trying to do an alert of sorts for customers joining. The alert needs to run on an interval of one hour, which is possible with an integration we have.
The sample data is this:
Name
Time
John
2022-04-21T13:49:51
Mary
2022-04-23T13:49:51
Dave
2022-04-25T13:49:51
Gregg
2022-04-27T13:49:51
so the problem with the below output is this only captures the "count" within the hour. And will yield no results. But I'm trying to determine the moment (well, within the hour) the threshold crosses above a count of 3. Is there something I'm missing?
SELECT COUNT (name)
FROM Table
WHERE Time >= TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -60 MINUTE)
HAVING COUNT (NAME) > 3

Convert single row into multiple rows Bigquery SQL

Convert a row into multiple rows in bigQuery SQL.
The number of rows depend on a particular column value (in this case, the value of delta_unit/60):
Source table:
ID time delta_unit
101 2019-06-18 01:00:00 60
102 2019-06-18 01:01:00 60
103 2019-06-18 01:03:00 120
The ID 102 does recorded a time at 01:01:00 and the next record was at 01:03:00.
So, we are missing a record that should have been 01:02:00 and the delta_unit = 60
Expected table:
ID time delta_unit
101 2019-06-18 01:00:00 60
102 2019-06-18 01:01:00 60
104 2019-06-18 01:02:00 60
103 2019-06-18 01:03:00 60
A new row is created based on the delta_unit. The number of rows that need to be created will depend on the value delta_unit/60 (in this case, 120/60 = 2)
I have found a solution to your problem. I have done the following, first run
SELECT max(delta/60) as max_a FROM `<projectid>.<dataset>.<table>`
to compute the maximum number of steps. Then run the following loop
DECLARE a INT64 DEFAULT 1;
WHILE a <= 2 DO --2=max_a (change accordingly)
INSERT INTO `<projectid>.<dataset>.<table>` (id,time,delta)
SELECT id+1,TIMESTAMP_ADD(time, INTERVAL a MINUTE),delta-60*a
FROM
(SELECT id,time,delta
FROM `<projectid>.<dataset>.<table>`
)
WHERE delta > 60*a;
SET a = a + 1;
END WHILE;
Of course this is not efficient enough but it gets the Job done. The IDs and deltas do not finish at the right values yet, they should not be needed. The deltas would end up all at 60 (the column can be deleted) and the IDs can be recreated using the timestamp to get them ordered.
You might try using a conditional expression in here to avoid the loop and only going through the table once.
I have tried
INSERT INTO `<projectid>.<dataset>.<table>` (id,time,delta)
SELECT id+1, CASE
WHEN delta>80 THEN TIMESTAMP_ADD(time, INTERVAL 1 MINUTE)
WHEN delta>150 THEN TIMESTAMP_ADD(time, INTERVAL 2 MINUTE)
END
,60
FROM
(SELECT id,time,delta
FROM `<projectid>.<dataset>.<table>`
)
WHERE delta > 60;
but fails because only returns the first condition where the when is True. So, I am not sure if it is possible to do it all at once. If you have small tables I would stick to the first one which works fine.

Exasol Extrackt from Timestamp Hours AND Minutes

I have a Exasol database with Login values of datatype TIMESTAMP like:
2015-10-01 13:00:34.0
2015-11-02 13:10:10.0
2015-10-06 13:20:03.0
2016-02-01 14:15:34.0
2016-04-03 14:08:10.0
2016-07-01 11:05:07.0
2016-09-03 10:08:12.0
2016-11-15 09:03:30.0
and many many more. I want to do a SQL (SQLite) query where I get like
Logins from 09:00:00 to 09:15:00 and logins from 09:15:00 to 09:30:00 and so on in separate tables (no matter what date it is). I already had success with selecting on 1 hour interval with:
...EXTRACT(HOUR FROM entryTime ) BETWEEN 8 and 8
that way i get entries of my database (no matter what date it is) within 1 hour, but i need smaller intervals, like every 09:00:00 - 09:15:00 minutes. Any ideas how to solve this in Exasol (SQLite)?
You can simply convert the time part of your timestamp to a string and do a between, something like:
WHERE to_char(entryTime, 'hhmi') BETWEEN '0900' AND '0915'
If you want to use extract and numeric values, I suggest this:
WHERE (EXTRACT(HOUR FROM entryTime) * 100) + EXTRACT(MINUTE FROM entryTime)
BETWEEN 900 and 915
I'm not in front of my computer now, but this (or something pretty similar) should work.
But I suspect that in both cases EXASOL will create an expression index for the first part of the WHERE clause. Because, I guess, you use EXASOL because you have a huge amount of data and you want fast performance, my suggestion is to have an additional column in your table where you store the time part of entryTime as a numeric value, that will create a proper index give you better performance.
I found a workaround. The solution is to interleave the SQL states. So in first step, you select the hours, and around that SELECT state, you wrap another, where you specify the minutes.
SELECT * FROM
(SELECT * FROM MY_SCHEMA.EXA_LOCAL WHERE EXTRACT(HOUR FROM TIMESTMP) BETWEEN 9 and 9)
where EXTRACT(MINUTE FROM TIMESTMP) BETWEEN 0 and 15;

Creating a Time Range in SQL

I want to make a table (Table A) in Hive that has three columns. This table has times starting from 5AM and ending at 2AM the next day. Each row is a 5 minute increment from the previous row.
The first two columns are this (and I don't know how to generate this).
start_time | end_time
5:00:00 | 5:05:00
5:05:01 | 5:10:00
...
23:55:01 | 00:00:00
...
1:55:01 | 02:00:00
Does anyone know how to do the above?
To shed some background:
Once I have Table A created, I want to use use another table (Table B) that I have with epoch times for each record that represents a visit of a customer, extract the necessary hour/minute/second information, and then provide a sum count of visitors for each time interval in a third column of Table A, say, "customer_count".
I think I know to do the calculation for "customer_count" column for Table A, however, what I need help with is making the first two columns in Table A.
You could do it the other way around:
Crop from table B the dates you are interested in
Group by 5 minute increments (calculated by (time-start_time) / 60 / 5 assuming the epoch is in seconds)
Then turn the increments back into dates and calculate the second end_time column
Something like this:
select from_unixtime(<start time> + period*60*5),
from_unixtime(<start time> + (period+1)*60*5),
count from
(select (time-<start time>)/(60*5) as period,count(*) as count from tableB
where time >= <start time> and time <= <end time>
group by (time-<start time>)/(60*5) ) inner
Note that you won't receive times with zero count (no visits during a period)

Time averaging non-continuous data with PostgreSQL9.2

I have multiple datasets of 1 second time resolution real data. This data will often have gaps in the time-series where the instrument dropped data, or when the instrument was turned off, resulting in a patchy (albeit still very useful) dataset. The resulting data might look like the following
Timestamp [timestamp] : datastream1 [double precision] : datastream2 [double precision] : etc
2011-01-01 00:00:01 153.256 1255.325
2011-01-01 00:00:02 152.954 1254.288
2011-01-01 00:00:03 151.738 1248.951
2011-01-01 00:00:04 150.015 1249.185
2011-01-01 00:10:08 179.132 1328.115
2011-01-01 00:10:09 178.051 1323.125
2011-01-01 00:10:10 180.870 1336.983
2011-01-04 09:19:02 152.198 1462.814
2011-01-04 09:19:03 158.014 1458.122
2011-01-04 09:19:04 156.070 1464.174
Please note: this data are generally continuous but will have random gaps which must be dealt with.
I need to write code to take the average and stdev of a given time interval, "timeInt", that is able to deal with these gaps. For example, if I wanted a 10 min average of data, my required output would be:
Timestamp_10min : avg_data1 : med_data1 : count_data1
where avg_data1 would be the average of all the data points within a given 10 minute period, and count_data1 would be the number of points used in the calculation of that average (i.e. 600 if there was no missing data, 300 if every second point is missing, etc etc).
This code needs to work with any desired input interval (i.e. x minutes, y days, z weeks, months, years, etc).
Currently I am only able output minute averages using the following code.
CREATE OR REPLACE VIEW "DATATABLE_MIN" AS
SELECT MIN("DATATABLE"."Timestamp") AS "Timestamp_min",
avg("DATATABLE"."datastream1") AS "datastream1_avg_min",
stddev("DATATABLE"."datastream1") AS "datastream1_stdev_min",
count("DATATABLE"."datastream1") AS "datastream1_avg_min"
FROM "DATATABLE"
GROUP BY to_char("DATATABLE"."Timestamp",'YYYY-MM-DD HH24:MI'::text);
Thanks in advance for the help!
To group by 10 minutes, you can do this by using the "epoch":
SELECT MIN(dt."Timestamp") AS "Timestamp_min",
avg(dt."datastream1") AS "datastream1_avg_min",
stddev(dt."datastream1") AS "datastream1_stdev_min",
count(dt."datastream1") AS "datastream1_avg_min"
FROM "DATATABLE" dt
GROUP BY trunc(extract(epoch from dt."TimeStamp") / (60*10));
This is the number of seconds since a fixed time in the past. If you divide it by 600, you get the number of 10 minute intervals -- what you need for the aggregation.