Grouping data in SQL by difference in column values - sql

I have following data in my logs table in postgres table:
logid => int (auto increment)
start_time => bigint (stores epoch value)
inserted_value => int
Following is the data stored in the table (where start time actual is not a column, just displaying start_time value in UTC format in 24 hour format)
logid user_id start_time inserted_value start time actual
1 1 1518416562 15 12-Feb-2018 06:22:42
2 1 1518416622 8 12-Feb-2018 06:23:42
3 1 1518417342 9 12-Feb-2018 06:35:42
4 1 1518417402 12 12-Feb-2018 06:36:42
5 1 1518417462 18 12-Feb-2018 06:37:42
6 1 1518418757 6 12-Feb-2018 06:59:17
7 1 1518418808 11 12-Feb-2018 07:00:08
I want to group and sum values according to difference in start_time
For above data, sum should be calculated in three groups:
user_id sum
1 15 + 8
1 9 + 12 + 18
1 6 + 11
So, values in each group has 1 minute difference. This 1 can be considered as any x minutes difference.
I was also trying LAG function but could not understand it fully. I hope I'm able to explain my question.

You can use a plain group by to achieve what you want. Just make all start_time values equal that belong to the same minute. For example
select user_id, start_time/60, sum(inserted_value)
from log_table
group by user_id, start_time/60
I assume your start_time column contains integers representing milliseconds, so /60 will properly truncate them to minutes. If the values are floats, you should use floor(start_time/60).
If you also want to select a human readable date of the minute you're grouping, you can add to_timestamp((start_time/60)*60) to the select list.

You can use LAG to check if current row is > 60 seconds more than previous row and set group_changed (a virtual column) each time this happens.
In next step, use running sum over that column. This creates a group_number which you can use to group results in the third step.
WITH cte1 AS (
SELECT
testdata.*,
CASE WHEN start_time - LAG(start_time, 1, start_time) OVER (PARTITION BY user_id ORDER BY start_time) > 60 THEN 1 ELSE 0 END AS group_changed
FROM testdata
), cte2 AS (
SELECT
cte1.*,
SUM(group_changed) OVER (PARTITION BY user_id ORDER BY start_time) AS group_number
FROM cte1
)
SELECT user_id, SUM(inserted_value)
FROM cte2
GROUP BY user_id, group_number
SQL Fiddle

Related

How do I use SQL to perform a cumulative sum where the increments have an expiration?

Say the scenario is this:
I have a database of student infractions. When a student is late to class, or misses a homework assignment they get an infraction.
student_id
infraction_type
day
1
tardy
0
2
missed_assignment
0
1
tardy
29
2
missed_assignment
15
1
tardy
99
2
missed_assignment
29
The school has three strike system, at each infraction disciplinary action is taken. Call them D0,D1,D2.
Infractions expire after 30 days.
I want to be able to perform a query to calculate the total counts of disciplinary actions taken in a given time period.
So the number of disciplinary actions taken in the last 100 days (at day 99) would be
disciplinary_action
count
D0
3
D1
2
D2
1
A table generated showing the disciplinary actions taken would look like:
student_id
infraction_type
day
disciplinary_action_gen
1
tardy
0
D0
2
missed_assignment
0
D0
1
tardy
29
D1
2
missed_assignment
15
D1
1
tardy
99
D0
2
missed_assignment
29
D2
What SQL query could I use to do such a cumulative sum?
You can solve your problem by checking in the following order:
if <30 days have passed from the last two infractions, assign D2
if <30 days have passed from last infraction, assign D1
assign D0 (given its the first infraction)
This will work assuming your DBMS supports the tools used for this solution, namely:
the CASE expression, to conditionally assign infraction values
the LAG window function, to retrieve the previous "day" values
SELECT *,
CASE WHEN day - LAG(day,2) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D2'
WHEN day - LAG(day,1) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D1'
ELSE 'D0'
END AS disciplinary_action_gen
FROM tab
Check a MySQL demo here.
A similar approach using COUNT() as a window function and a frame definition -
SELECT
*,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action_gen
FROM infractions;
The frame definition (RANGE BETWEEN 30 PRECEDING AND CURRENT ROW) tells the server that we want to include all rows with a day value between (current row's value of day - 30) and (the current row's value of day). So, if the current row has a day value of 99, the count will be for all rows in the partition with a day value between 69 and 99.
To get the disciplinary counts, we can simply wrap this in a normal GROUP BY -
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
) t
GROUP BY disciplinary_action;
If your infractions are stored with a date, as opposed to the days in your example, this can be easily updated to use a date interval in the frame definition. And, if looking at counts of disciplinary actions in the last 100 days we need to include the previous 30 days, as these could impact the action (D0, D1 or D2) on the first day we are interested in.
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
`date`,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY `date` ASC
RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
WHERE `date` >= CURRENT_DATE - INTERVAL 130 DAY
) t
WHERE `date` >= CURRENT_DATE - INTERVAL 100 DAY
GROUP BY disciplinary_action;
Here's a db<>fiddle

Correcting sql query to exclude sessions under 2 min

Im am trying to write a code that will remove any entires that are within 2 mins of each other on the same day for each user_id.
for example here is the table:
user_id
day
time
x
1
00:55:54
x
1
00:55:55
x
1
00:56:01
x
2
16:11:43
x
2
16:12:01
x
2
16:15:02
x
2
16:30:07
x
2
16:31:08
x
2
16:40:09
x
2
16:41:02
So if within the same day there was some times that didn't last more than 2 mins i would like to exclude does 2 entires.
Note: day and time were gotten by using the day() and time() on a datetime column called timestamp
The code i have is:
WITH frames AS (
SELECT
user_id, day(timestamp), time(timestamp) AS starttime, COALESCE(
LEAD(time(timestamp)) OVER(PARTITION BY user_id, day(timestamp)),
'23:59:59'
) AS final
FROM events
)
SELECT user_id, day(timestamp), starttime, final, TIMEDIFF(final, starttime) AS duration
FROM frames
WHERE TIMEDIFF(final, starttime) < 120;
but i get this error Error Code: 1054. Unknown column 'timestamp' in 'field list'
I am guessing that you want rows that are more than 2 minutes from the previous row:
select e.*
from (select e.*,
lag(timestamp) over (partition by date(timestamp) order by timestamp) as prev_timestamp
from events e
) e
where prev_timestamp is null or
prev_timestamp < timestamp - interval '2 minute';
You do not specify the database you are using, so this uses reasonable database syntax that might need to be modified for your database.
Also note that in databases that support the function, day() typically returns the day of the month. You want a function that removes the time component, which is usually date() or cast( as date).

Count the number of the time records appears in 48 hrs- SQL

How do we select the count of the record that appears more than once in 48hrs?
for eg
ID DATE
1 9/24/2018
1 9/23/2018
1 9/20/2018
2 9/20/2018
ID 1 APPEARED MORE THAN ONCE IN 48 HOURS
please let me know how to write a sql query to do this
There are lots of ways, but I'd start with using LAG() and a date comparison. Assuming your DATE column is a date data-type?
WITH
entity_summary AS
(
SELECT
ID,
CASE
WHEN LAG("DATE") OVER (PARTITION BY ID ORDER BY "DATE") >= "DATE" - INTERVAL '2' DAY
THEN 1
ELSE 0
END
AS occurence_within_2_day
FROM
Table1
)
SELECT
ID,
SUM(occurence_within_2_day)
FROM
entity_summary
GROUP BY
ID
HAVING
SUM(occurence_within_2_day) >= 1

Get moving average over time frame in PostgreSQL with inconsistent data

I have a table called answers with columns created_at and response, response being an integer 0 (for 'no'), 1 (for 'yes'), or 2 (for 'don't know'). I want to get a moving average for the response values, filtering out 2s for each day, only taking in to account the previous 30 days. I know you can do ROWS BETWEEN 29 AND PRECEDING AND CURRENT ROW but that only works if you have data for each day, and in my case there might be no data for a week or more.
My current query is this:
SELECT answers.created_at, answers.response,
AVG(answers.response)
OVER(ORDER BY answers.created_at::date ROWS
BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_average
FROM answers
WHERE answers.user_id = 'insert_user_id''
AND (answers.response = 0 OR answers.response = 1)
GROUP BY answers.created_at, answers.response
ORDER BY answers.created_at::date
But this will return an average based on the previous rows, if a user responded with a 1 on 2018-3-30 and a 0 on 2018-5-15, the rolling average on 2018-5-15 would be 0.5 instead of 0 as I want. How can I create a query that will only take in to account the responses that were created within the last 30 days for the rolling average?
Since Postgres 11 you can do this:
SELECT created_at,
response,
AVG(response) OVER (ORDER BY created_at
RANGE BETWEEN '29 day' PRECEDING AND current row) AS rolling_average
FROM answers
WHERE user_id = 1
AND response in (0,1)
ORDER BY created_at;
Try something like this:
SELECT * FROM (
SELECT
d.created_at, d.response,
Avg(d.response) OVER(ORDER BY d.created_at::date rows BETWEEN 29 PRECEDING AND CURRENT row) AS rolling_average
FROM (
SELECT
COALESCE(a.created_at, d.dates) AS created_at, response, a.user_id
FROM
(SELECT generate_series('2018-01-01'::date, '2018-05-31'::date, '1day'::interval)::date AS dates) d
LEFT JOIN
(SELECT * FROM answers WHERE answers.user_id = 'insert_user_id' AND ( answers.response = 0 OR answers.response = 1)) a
ON d.dates = a.created_at::date
) d
GROUP BY d.created_at, d.response
) agg WHERE agg.response IS NOT NULL
ORDER BY agg.created_at::date
generate_series creates list of days - you have to set reasonable boundaries
this list of days is LEFT JOINed with preselected answers
this result is used for rolling average calculation
after it I select only records with response and I get:
created_at | response | rolling_averagte
2018-03-30 | 1 | 1.00000000000000000000
2018-05-15 | 0 | 0.00000000000000000000

How to update date column in partitions of n rows?

I am trying to update my date field.
Table structure is like:
date id
2016-11-14 1
2016-11-14 2
2016-11-14 3
2016-11-14 4
-
-
-
2016-11-14 100
How to update first ten records with different date, second ten records with different date and so on?
UPDATE tbl t
SET "date" = date '2016-11-14' + sub.rn::int / 10 -- integer division
FROM (
SELECT id, row_number() OVER (ORDER BY id) AS rn
FROM tbl
) sub
WHERE t.id = sub.id;
The subquery computes a gapless row number, since nothing in your question says the id is actually guaranteed to be without gaps.
You can just add an integer to an actual date to add a days. (Forgot the cast to int in my first version.)
For timestamp use instead:
timestamp '2016-11-14' + interval '1 day' * (sub.rn / 10)
You could use a CASE
UPDATE yourTable
SET "date" = CASE WHEN id <= 10 then '2016-11-01'::timestamp
WHEN id <= 20 then '2016-11-02'::timestamp
....
WHEN id <= 100 then '2016-11-10'::timestamp
END;