Count distinct sql and break by day with timestamp over midnight - sql

I have a time series of data that has a trip_id and time stamp. I'm trying to write a SQL query to give me the number of unique trip_id's that occur on one day.
The problem is that the trip's extend across midnight, as the next day comes the trip is treated as a new distinct value and counted twice using this code select date(Timestamp), COUNT(DISTINCT trip_id) . Any help or the appropriate point in the correct direction would be very much appreciated.
Data:
trip_id Timestamp
47585 "2015-11-05 09:22:23"
16935 "2015-11-05 12:34:28"
16935 "2015-11-05 20:40:28"
16935 "2015-11-05 23:09:24"
16935 "2015-11-05 23:21:58"
16935 "2015-11-06 00:22:05"
15434 "2015-11-06 21:23:28"
Desired Outcome
date count
2015-11-05 2
2015-11-06 1

Use the minimum of the timestamp for each trip:
select dte, count(*)
from (select trip_id, min(date_trunc('day', timestamp)) as dte
from t
group by trip_id
) t
group by dte
order by dte;
That is, count the day when the trip begins.

Related

SQL Server : count distinct every 30 minutes or more

We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.
This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);

How to get average of earliest datetime per day?

I need to get average time based on their earliest check in for each day.
For example, these are my check in datetime.
Person | Checkin
-----------------
Jack 2022-01-06 16:42:34.000
Jack 2022-01-06 17:30:34.000
Jack 2022-01-07 10:22:34.000
Jack 2022-01-07 12:12:54.000
Jack 2022-01-08 11:08:53.000
When I want to calculate my average check in time based on earliest check in, I should only consider these datetime.
2022-01-06 16:42:34.000
2022-01-07 10:22:34.000
2022-01-08 11:08:53.000
This is my current sql to get the average check in time. But this consider all the datetime above and gets the average time.
Select Person,(CAST(DATEADD(SS, AVG(CAST(DATEDIFF(SS, '00:00:00', CAST(Checkin AS TIME)) AS BIGINT)), '00:00:00' ) AS TIME)) AS 'AvgCheckInTime' from #Tab_CheckIn
group by Person
How can I only consider the earliest datetime for each day and get the average time?
This is a little messy.
Firstly you need to group the data to get the earliest checkin my day. That's quite simple, using CONVERT to date and MIN.
Then you need to get the "average" of those times. You can't average a time in SQL Server, however, as time represents a point in time during the day not an interval, so it doesn't make sense to average them like it would a timespan. As a result you need to "convert" the values to a numerical value.
In this case I use DATEDIFF to get the number of seconds since midnight to the checkin converted to a time. Then I can average those values, and finally add those second to midnight:
WITH Earliest AS(
SELECT Person,
MIN(Checkin) AS EarliestCheckin
FROM dbo.YourTable
GROUP BY Person,
CONVERT(date,Checkin))
SELECT Person,
DATEADD(SECOND,AVG(DATEDIFF(SECOND,'00:00',CONVERT(time,EarliestCheckin))),CONVERT(time,'00:00')) AS AverageCheckin
FROM Earliest
GROUP BY Person;
Assuming you get desired result(except filtering the undesired dates)
Select t.Person,(CAST(DATEADD(SS, AVG(CAST(DATEDIFF(SS, '00:00:00', CAST(t.Checkin AS TIME)) AS BIGINT)), '00:00:00' ) AS TIME)) AS 'AvgCheckInTime'
from (SELECT Person, min(Checkin) as Checkin FROM #Tab_CheckIn
GROUP BY Person, cast(Checkin As Date)) t
group by t.Person
SELECT Person,(CAST(DATEADD(SS, AVG(CAST(DATEDIFF(SS, '00:00:00', CAST(Checkin AS TIME)) AS BIGINT)), '00:00:00' ) AS TIME)) AS 'AvgCheckInTime'
from (
SELECT person, checkin, row_number() over(PARTITION by person, CAST(checkin as DATE) order by checkin asc) as row
FROM #Tab_CheckIn) as p
where p.row =1
group by p.person

Getting counts for overlapping time periods

I have a table data in PostgreSQL with this structure:
created_at. customer_email status
2020-12-31 xxx#gmail.com opened
...
2020-12-24 yyy#gmail.com delivered
2020-12-24 xxx#gmail.com opened
...
2020-12-17 zzz#gmail.com opened
2020-12-10 xxx#gmail.com opened
2020-12-03 hhh#gmail.com enqueued
2020-11-27 xxx#gmail.com opened
...
2020-11-20 rrr#gmail.com opened
2020-11-13 ttt#gmail.com opened
There are many rows for each day.
Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.
Desired output:
period active
2021-W01 1539
2020-W53 1480
2020-W52 1630
2020-W51 1820
2020-W50 1910
2020-W49 1890
2020-W48 2000
How can I do that?
Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.
Instead, get distinct counts from a LATERAL subquery:
WITH weekly_dist AS (
SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
FROM tbl
WHERE status = 'opened'
)
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM (
SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
, date_trunc('week', now()::timestamp)
, interval '1 week') AS wk
FROM tbl
) t
LEFT JOIN LATERAL (
SELECT count(DISTINCT customer_email) AS active
FROM weekly_dist d
WHERE d.wk >= t.wk - interval '91 days'
AND d.wk < t.wk
) ct ON true;
db<>fiddle here
I operate with timestamp, not timestamptz, might make a corner case difference.
The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.
The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:
PostgreSQL: running count of rows for a query 'by minute'
Generating time series between two dates in PostgreSQL
But I do skip the first week since I count active emails before each start of the week.
Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.
Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.
You can combine the date_part() function with a group by like this:
SELECT
DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
your_table
GROUP BY 1
ORDER BY created_at DESC

Average over rolling date period

I have 4 dimensions, which one of them is date. I need to calculate for each date, the average in the last 30 days, per each dimension value.
I have tried to run average over a partition by the 4 dimensions in a form of:
SELECT
Date, Produce,Company, Song, Revenues,
Average(case when Date between Date -Interval '31' day and Date - Interval '1' Day then Revenues else null End) over (partition by Date,Company,Song,Revenues order by Date) as "Running Average"
From
Base_Table
I get only nulls with every aggregation I tried.
Help is appreciated. Thanks
You can try below -
SELECT
Date, Produce,Company, Song, Revenues,
Average(Revenues) over (partition by Company,Song rows between 30 preceding and current row) as "Running Average"
From
Base_Table

How to calculate moving average value using every nth row (e.g. 24th,48th and 72nd) in sql?

Here is the snip of my database I want to calculate average energy consumption for the last three days in the exact hour. So if I have consumption at 24.10.2016. 10h, I want to add column with average consumption for the last three days at the same hour, so for 23.10.2016. 10h, 22.10.2016. 10h and 21.10.2016. 10h. My records are measured every hour, so in order to calculate this average I have to look at every 24th row and haven't found any way. How can I modify my query to get what I want:
select avg(consumption) over (order by entry_date rows between 72
preceding and 24 preceding) from my_data;
Or is there some other way?
Maybe try this one:
select entry_date, EXTRACT(HOUR FROM entry_date),
avg(consumption) over (PARTITION BY EXTRACT(HOUR FROM entry_date)
order by entry_date rows between 72 preceding and 24 preceding)
from my_data;
and you may use RANGE BETWEEN INTERVAL '72' HOUR PRECEDING AND INTERVAL '24' HOUR PRECEDING instead of ROWS. This covers situation when you have gaps or duplicate time values.
I think you can do this another way by using filters.
Select avg(consumption) from my_data
where
entry_date between #StartDate and #EndDate
and datepart(HOUR, entry_date)=#hour
If you're on MySQL
Select avg(consumption) from my_data
where
entry_date between #StartDate and #EndDate
and HOUR(entry_date)=#hour