Correcting sql query to exclude sessions under 2 min - sql

Im am trying to write a code that will remove any entires that are within 2 mins of each other on the same day for each user_id.
for example here is the table:
user_id
day
time
x
1
00:55:54
x
1
00:55:55
x
1
00:56:01
x
2
16:11:43
x
2
16:12:01
x
2
16:15:02
x
2
16:30:07
x
2
16:31:08
x
2
16:40:09
x
2
16:41:02
So if within the same day there was some times that didn't last more than 2 mins i would like to exclude does 2 entires.
Note: day and time were gotten by using the day() and time() on a datetime column called timestamp
The code i have is:
WITH frames AS (
SELECT
user_id, day(timestamp), time(timestamp) AS starttime, COALESCE(
LEAD(time(timestamp)) OVER(PARTITION BY user_id, day(timestamp)),
'23:59:59'
) AS final
FROM events
)
SELECT user_id, day(timestamp), starttime, final, TIMEDIFF(final, starttime) AS duration
FROM frames
WHERE TIMEDIFF(final, starttime) < 120;
but i get this error Error Code: 1054. Unknown column 'timestamp' in 'field list'

I am guessing that you want rows that are more than 2 minutes from the previous row:
select e.*
from (select e.*,
lag(timestamp) over (partition by date(timestamp) order by timestamp) as prev_timestamp
from events e
) e
where prev_timestamp is null or
prev_timestamp < timestamp - interval '2 minute';
You do not specify the database you are using, so this uses reasonable database syntax that might need to be modified for your database.
Also note that in databases that support the function, day() typically returns the day of the month. You want a function that removes the time component, which is usually date() or cast( as date).

Related

SQL Big Query - How to write a COUNTIF statement applied to an INTERVAL column

I have a trip_duration column in interval format. I want to remove all observations less than 90 seconds and count how many observations match this condition.
My current SQL query is
WITH
org_table AS (
SELECT
ended_at - started_at as trip_duration
FROM `cyclistic-328701.12_month_user_data_cyclistic.20*`
)
SELECT
COUNTIF(x < 1:30) AS false_start
FROM trip_duration AS x;
I returns Syntax error: Expected ")" but got ":" at [8:16]
I have also tried
SELECT
COUNTIF(x < "0-0 0 0:1:30") AS false_start
FROM trip_duration AS x
It returns Table name "trip_duration" missing dataset while no default dataset is set in the request.
I've read through other questions and have not been able to write a solution.
My first thought is to cast the trip_duration from INTERVAL to TIME format so COUNT IF statements can reference a TIME formatted column instead of INTERVAl.
~ Marcus
Below example shows you the way to handle intervals
with trip_duration as (
select interval 120 second as x union all
select interval 10 second union all
select interval 2 minute union all
select interval 50 second
)
select
count(*) as all_starts,
countif(x < interval 90 second) as false_starts
from trip_duration
with output
To filter the data without the durations less than 90 secs:
SELECT
* # here is whatever field(s) you want to return
FROM
`cyclistic-328701.12_month_user_data_cyclistic.20*`
WHERE
TIMESTAMP_DIFF(ended_at, started_at, SECOND) > 90
You can read about the TIMESTAMP_DIFF function here.
To count the number of occurrences:
SELECT
COUNTIF(TIMESTAMP_DIFF(ended_at, started_at,SECOND) < 90) AS false_start,
COUNTIF(TIMESTAMP_DIFF(ended_at, started_at,SECOND) >= 90) AS non_false_start
FROM
`cyclistic-328701.12_month_user_data_cyclistic.20*`

Is there a way to change this BigQuery self-join to use a window function?

Let's say I have a BigQuery table "events" (in reality this is a slow sub-query) that stores the count of events per day, by event type. There are many types of events and most of them don't occur on most days, so there is only a row for day/event type combinations with a non-zero count.
I have a query that returns the count for each event type and day and the count for that event from N days ago, which looks like this:
WITH events AS (
SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
UNION ALL SELECT '2019-06-09', 'a', 2
UNION ALL SELECT '2019-06-10', 'a', 3
UNION ALL SELECT '2019-06-07', 'b', 4
UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2
The query is slow. BigQuery best practices recommend using window functions instead of self-joins. Is there a way to do this here? I could use the LAG function if there was a row for each day, but there isn't. Can I "pad" it somehow? (There isn't a short list of possible event types. I could of course join to SELECT DISTINCT type FROM events, but that probably won't be faster than the self-join.)
A brute force method is:
select e.*,
(case when lag(day) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt) over (partition by type order by date)
when lag(day, 2) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt, 2) over (partition by type order by date)
end) as prev_day2_count
from events e;
This works fine for a two day lag. It gets more cumbersome for longer lags.
EDIT:
A more general form uses window frames. Unfortunately, these have to be numeric so there is an additional step:
select e.*,
(case when min(day) over (partition by type order by diff range between 2 preceding and current day) = date_add(day, interval -2 day)
then first_value(cnt) over (partition by type order by diff range between 2 preceding and current day)
end)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
And duh! The case expression is not necessary:
select e.*,
first_value(cnt) over (partition by type order by diff range between 2 preceding and 2 preceding)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count
FROM `project.dataset.events`
WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)
If t apply to sample data from your question - result is:
Row day type count prev_count
1 2019-06-08 a 1 0
2 2019-06-09 a 2 0
3 2019-06-10 a 3 1
4 2019-06-07 b 4 0
5 2019-06-09 b 5 4

Grouping data in SQL by difference in column values

I have following data in my logs table in postgres table:
logid => int (auto increment)
start_time => bigint (stores epoch value)
inserted_value => int
Following is the data stored in the table (where start time actual is not a column, just displaying start_time value in UTC format in 24 hour format)
logid user_id start_time inserted_value start time actual
1 1 1518416562 15 12-Feb-2018 06:22:42
2 1 1518416622 8 12-Feb-2018 06:23:42
3 1 1518417342 9 12-Feb-2018 06:35:42
4 1 1518417402 12 12-Feb-2018 06:36:42
5 1 1518417462 18 12-Feb-2018 06:37:42
6 1 1518418757 6 12-Feb-2018 06:59:17
7 1 1518418808 11 12-Feb-2018 07:00:08
I want to group and sum values according to difference in start_time
For above data, sum should be calculated in three groups:
user_id sum
1 15 + 8
1 9 + 12 + 18
1 6 + 11
So, values in each group has 1 minute difference. This 1 can be considered as any x minutes difference.
I was also trying LAG function but could not understand it fully. I hope I'm able to explain my question.
You can use a plain group by to achieve what you want. Just make all start_time values equal that belong to the same minute. For example
select user_id, start_time/60, sum(inserted_value)
from log_table
group by user_id, start_time/60
I assume your start_time column contains integers representing milliseconds, so /60 will properly truncate them to minutes. If the values are floats, you should use floor(start_time/60).
If you also want to select a human readable date of the minute you're grouping, you can add to_timestamp((start_time/60)*60) to the select list.
You can use LAG to check if current row is > 60 seconds more than previous row and set group_changed (a virtual column) each time this happens.
In next step, use running sum over that column. This creates a group_number which you can use to group results in the third step.
WITH cte1 AS (
SELECT
testdata.*,
CASE WHEN start_time - LAG(start_time, 1, start_time) OVER (PARTITION BY user_id ORDER BY start_time) > 60 THEN 1 ELSE 0 END AS group_changed
FROM testdata
), cte2 AS (
SELECT
cte1.*,
SUM(group_changed) OVER (PARTITION BY user_id ORDER BY start_time) AS group_number
FROM cte1
)
SELECT user_id, SUM(inserted_value)
FROM cte2
GROUP BY user_id, group_number
SQL Fiddle

Standard deviation of a set of dates

I have a table of transactions with columns id | client_id | datetime and I have calculated the mean of days between transactions to know how often this transactions are made by each client:
SELECT *, ((date_last_transaction - date_first_transaction)/total_transactions) AS frequency
FROM (
SELECT client_id, COUNT(id) AS total_transactions, MIN(datetime) AS date_first_transaction, MAX(datetime) AS date_last_transaction
FROM transactions
GROUP BY client_id
) AS t;
What would be the existing methods to calculate the standard deviation (in days) in a set of dates with postgresql? Preferably with only one query, if it is posible :-)
I have found this way:
SELECT extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 1 THEN
0
ELSE
SUM(time_since_last_invoice)/(COUNT(*)-1)
END
) * '1 day'::interval)) AS days_between_purchases,
extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 2 THEN
0
ELSE
STDDEV(time_since_last_invoice)
END
) * '1 day'::interval)) AS range_of_days
FROM (
SELECT client_id, datetime, COALESCE(datetime - lag(datetime)
OVER (PARTITION BY client_id ORDER BY client_id, datetime
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
), 0
) AS time_since_last_invoice
FROM my_table
GROUP BY client_id, datetime
ORDER BY client_id, datetime
)
Explanation:
This query groups by client and date and then calculates the difference between each pair of transaction dates (datetime) by client_id and returns a table with these results. After this, the external query processes the table and calculates de average time between differences greater than 0 (first value in each group is excluded because is the first transaction and therefore the interval is 0).
The standard deviation is calculated when there existe 2 o more transaction dates for the same client, to avoid division by zero errors.
All differences are returned in PostgreSQL interval format.

Count occurrences of combinations of columns

I have daily time series (actually business days) for different companies and I work with PostgreSQL. There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days. If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company. Let me refer to that as [-2,1] window with the event day being day 0.
I am using the following query
CREATE TABLE test AS
WITH cte AS (
SELECT *
, MAX(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN 1 preceding AND 2 following) Lead1
FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1
ORDER BY day,company
The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event.
The query does that for all events.
This is a small section of the resulting table.
day company flag
2012-01-23 A 0
2012-01-24 A 0
2012-01-25 A 1
2012-01-25 B 0
2012-01-26 A 0
2012-01-26 B 0
2012-01-27 B 1
2012-01-30 B 0
2013-01-10 A 0
2013-01-11 A 0
2013-01-14 A 1
Now I want to do further calculations for every [-2,1] window separately. So I need a variable that allows me to identify each [-2,1] window. The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause
GROUP BY company, occur
Therefore my desired output looks like that:
day company flag occur
2012-01-23 A 0 1
2012-01-24 A 0 1
2012-01-25 A 1 1
2012-01-25 B 0 1
2012-01-26 A 0 1
2012-01-26 B 0 1
2012-01-27 B 1 1
2012-01-30 B 0 1
2013-01-10 A 0 2
2013-01-11 A 0 2
2013-01-14 A 1 2
In the example, the company B only occurs once (occur = 1). But the company A occurs two times. For the first time from 2012-01-23 to 2012-01-26. And for the second time from 2013-01-10 to 2013-01-14. The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range.
As I said I am working with business days. I don't care for holidays, I have data from monday to friday. Earlier I wrote the following function:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
WITH alldates AS (
SELECT i,
$1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
SELECT i, date, EXTRACT('dow' FROM date) AS dow
FROM alldates
),
businessdays AS (
SELECT i, date, d.dow FROM days d
WHERE d.dow BETWEEN 1 AND 5
ORDER BY i
)
-- adding business days to a date --
SELECT date FROM businessdays WHERE
CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
THEN date <=$1 ELSE date =$1 END
LIMIT 1
offset ABS($2)
$BODY$
LANGUAGE 'sql' VOLATILE;
It can add/substract business days from a given date and works like that:
select * from addbusinessdays('2013-01-14',-2)
delivers the result 2013-01-10. So in Jakub's approach we can change the second and third last line to
w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)
and can deal with the business days.
Function
While using the function addbusinessdays(), consider this instead:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$func$
SELECT day
FROM (
SELECT i, $1 + i * sign($2)::int AS day
FROM generate_series(0, ((abs($2) * 7) / 5) + 3) i
) sub
WHERE EXTRACT(ISODOW FROM day) < 6 -- truncate weekend
ORDER BY i
OFFSET abs($2)
LIMIT 1
$func$ LANGUAGE sql IMMUTABLE;
Major points
Never quote the language name sql. It's an identifier, not a string.
Why was the function VOLATILE? Make it IMMUTABLE for better performance in repeated use and more options (like using it in a functional index).
(ABS($2) + 5)*2) is way too much padding. Replace with ((abs($2) * 7) / 5) + 3).
Multiple levels of CTEs were useless cruft.
ORDER BY in last CTE was useless, too.
As mentioned in my previous answer, extract(ISODOW FROM ...) is more convenient to truncate weekends.
Query
That said, I wouldn't use above function for this query at all. Build a complete grid of relevant days once instead of calculating the range of days for every single row.
Based on this assertion in a comment (should be in the question, really!):
two subsequent windows of the same firm can never overlap.
WITH range AS ( -- only with flag
SELECT company
, min(day) - 2 AS r_start
, max(day) + 1 AS r_stop
FROM tbl t
WHERE flag <> 0
GROUP BY 1
)
, grid AS (
SELECT company, day::date
FROM range r
,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
WHERE extract('ISODOW' FROM d.day) < 6
)
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING
AND 2 following) AS window_nr
FROM (
SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
ROWS BETWEEN 1 preceding
AND 2 following) in_window
FROM grid g
LEFT JOIN tbl t USING (company, day)
) sub
WHERE in_window > 0 -- only rows in [-2,1] window
AND day IS NOT NULL -- exclude missing days in [-2,1] window
ORDER BY company, day;
How?
Build a grid of all business days: CTE grid.
To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range.
LEFT JOIN actual rows to it. Now the frames for ensuing window functions works with static numbers.
To get distinct numbers per flag and company (window_nr), just count flags from the start of the grid (taking buffers into account).
Only keep days inside your [-2,1] windows (in_window > 0).
Only keep days with actual rows in the table.
Voilá.
SQL Fiddle.
Basically the strategy is to first enumarate the flag days and then join others with them:
WITH windows AS(
SELECT t1.day
,t1.company
,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)
SELECT t1.day
,t1.company
,t1.flag
,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
t1.company = w.company
AND
w.day BETWEEN
t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;
Fiddle.
However there is a problem with work days as those can mean whatever (do holidays count?).