SQL: Joining 2 tables on first matching row condition

SQL: Joining 2 tables on first matching row condition - sql

I have two tables that are correctly
user journeys
id timestamp bus
1 00:10 12
1 16:10 12
2 14:00 23
bus
id timestamp price
12 00:00 1.3
12 00:10 1.5
12 00:20 1.7
12 18:00 2.0
13 00:00 3.0
My goal is to find how much each user spent on travel today.
In our case, the user took bus number 12 at 00:10 and paid 1.5, and another one at 16:10 where the price increased to 1.7. In total, this person paid 3.2 today. We always take the latest updated price.
I've done this using a massive subquery and it looks inefficient. Does anyone have a slick solution?
Sample Data For Reproduction:
Please see http://sqlfiddle.com/#!17/10ad6/2
Or Build Schema:
drop table if exists journeys;
create table journeys(
id numeric,
timestamp timestamp without time zone,
bus numeric
);
truncate table journeys;
insert into journeys
values
(1, '2018-08-22 00:10:00', 12),
(1, '2018-08-22 16:10:00', 12),
(2, '2018-08-22 14:00:00', 23);
-- Bus Prices
drop table if exists bus;
create table bus (
bus_id int,
timestamp timestamp without time zone,
price numeric
);
truncate table bus;
insert into bus
values
(12, '2018-08-22 00:10:00', 1.3),
(12, '2018-08-22 00:10:00', 1.5),
(12, '2018-08-22 00:20:00', 1.7),
(12, '2018-08-22 18:00:00', 2.0),
(13, '2018-08-22 00:00:00', 3.0);

I don't know that this is faster than your solution (which you don't show). A correlated subquery seems like a reasonable solution.
But another method is:
SELECT j.*, b.price
FROM journeys j LEFT JOIN
(SELECT b.*, LEAD(timestamp) OVER (PARTITION BY bus_id ORDER BY timestamp) as next_timestamp
FROM bus b
) b
ON b.bus_id = j.bus AND
j.timestamp >= b.timestamp AND
(j.timestamp < b.next_timestamp OR b.next_timestamp IS NULL);

You may also do this using an inner join and windowing functions:
SELECT user_id, SUM(price)
FROM
(
SELECT user_id, journey_timestamp, bus_id, price_change_timestamp,
COALESCE(LEAD(price_change_timestamp) OVER(PARTITION BY bus_id ORDER BY price_change_timestamp), CAST('2100-01-01 00:00:00' AS TIMESTAMP)) AS next_price_timestamp, price
FROM
(
SELECT a.id AS user_id, a.timestamp AS journey_timestamp, a.bus AS bus_id, b.timestamp AS price_change_timestamp, b.price
FROM journeys a
INNER JOIN bus b
ON a.bus = b.bus_id
) a1
) a2
WHERE journey_timestamp >= price_change_timestamp AND journey_timestamp < next_price_timestamp
GROUP BY user_id
This is essentially what is happening:
1) The inner query joins the tables, ensuring that each journey transaction is matched to all price fares the bus has had at all points of time.
2) The LEAD function partitions by bus_id ordered by the times when the bus fares changed, to create a "window" for which that fare is valid. The COALESCE hack is to work around the NULLs that are generated in the process.
3) We filter by those rows where the journey timestamp lies within the "window", and find the fares for each user with a groupby.

Related

Calculate weekly hours of operation in T-SQL from overlapping date/time data?

I am trying to calculate the number of hours of operation per week for each facility in a region. The part I am struggling with is that there are multiple programs each day that overlap which contribute to the total hours.
Here is a sample of the table I am working with:
location
program
date
start_time
end_time
a
1
09-22-21
14:45:00
15:45:00
a
2
09-22-21
15:30:00
16:30:00
b
88
09-22-21
10:45:00
12:45:00
b
89
09-22-21
10:45:00
14:45:00
I am hoping to get:
location
hours of operation
a
1.75
b
4
I've tried using SUM DATEDIFF with some WHERE statements but couldn't get them to work. What I have found is how to identify the overlapping ranges(Detect overlapping date ranges from the same table), but not how to sum the difference to get the desired outcome of total non-overlapping hours of operation.

Believe you are trying to identify the total hours of operation for each location. Now because some programs can overlap, you want to rule those out. To do this, I generate a tally table of each possible 15 minute increment in the date and then count the time periods that have a program operating
Identify Total Hours of Operation per Date
DROP TABLE IF EXISTS #OperationSchedule
CREATE TABLE #OperationSchedule (ID INT IDENTITY(1,1),Location CHAR(1),Program INT,OpDate DATE,OpStart TIME(0),OpEnd TIME(0))
INSERT INTO #OperationSchedule
VALUES ('a',1,'09-22-21','14:45:00','15:45:00')
,('a',2,'09-22-21','15:30:00','16:30:00')
,('b',88,'09-22-21','10:45:00','12:45:00')
,('b',89,'09-22-21','10:45:00','14:45:00');
/*1 row per 15 minute increment in a day*/
;WITH cte_TimeIncrement AS (
SELECT StartTime = CAST('00:00' AS TIME(0))
UNION ALL
SELECT DATEADD(minute,15,StartTime)
FROM cte_TimeIncrement
WHERE StartTime < '23:45'
),
/*1 row per date in data*/
cte_DistinctDate AS (
SELECT OpDate
FROM #OperationSchedule
GROUP BY OpDate
),
/*Cross join to generate 1 row for each time increment*/
cte_DatetimeIncrement AS (
SELECT *
FROM cte_DistinctDate
CROSS JOIN cte_TimeIncrement
)
/*Join and count each time interval that has a match to identify times when location is operating*/
SELECT Location
,A.OpDate
,HoursOfOperation = CAST(COUNT(DISTINCT StartTime) * 15/60.0 AS Decimal(4,2))
FROM cte_DatetimeIncrement AS A
INNER JOIN #OperationSchedule AS B
ON A.OpDate = B.OpDate
AND A.StartTime >= B.OpStart
AND A.StartTime < B.OpEnd
GROUP BY Location,A.OpDate

Here is an alternative method without having to round to nearest 15 minute increments:
Declare #OperationSchedule table (
ID int Identity(1, 1)
, Location char(1)
, Program int
, OpDate date
, OpStart time(0)
, OpEnd time(0)
);
Insert Into #OperationSchedule (Location, Program, OpDate, OpStart, OpEnd)
Values ('a', 1, '09-22-21', '14:45:00', '15:45:00')
, ('a', 2, '09-22-21', '15:30:00', '16:30:00')
, ('b', 88, '09-22-21', '10:45:00', '12:45:00')
, ('b', 89, '09-22-21', '10:45:00', '14:45:00')
, ('c', 23, '09-22-21', '12:45:00', '13:45:00')
, ('c', 24, '09-22-21', '14:45:00', '15:15:00')
, ('3', 48, '09-22-21', '09:05:00', '13:55:00')
, ('3', 49, '09-22-21', '14:25:00', '15:38:00')
;
With overlappedData
As (
Select *
, overlap_op = lead(os.OpStart, 1, os.OpEnd) Over(Partition By os.Location Order By os.ID)
From #OperationSchedule os
)
Select od.Location
, start_date = min(od.OpStart)
, end_date = max(iif(od.OpEnd < od.overlap_op, od.OpEnd, od.overlap_op))
, hours_of_operation = sum(datediff(minute, od.OpStart, iif(od.OpEnd < od.overlap_op, od.OpEnd, od.overlap_op)) / 60.0)
From overlappedData od
Group By
od.Location;

Divide monthly spend into daily spend in BigQuery

I have monthly data in BigQuery in the following form:
CREATE TABLE if not EXISTS spend (
id int,
created_at DATE,
value float
);
INSERT INTO spend VALUES
(1, '2020-01-01', 100),
(2, '2020-02-01', 200),
(3, '2020-03-01', 100),
(4, '2020-04-01', 100),
(5, '2020-05-01', 50);
I would like a query to translate it into daily data in the following day:
One row per day.
The value of each day should be the monthly value divided by the number of days of the month.
What's the simplest way of doing this in BigQuery?

You can make use of GENERATE_DATE_ARRAY() in order to get an array between the desired dates (in your case, between 2020-01-01 and 2020-05-31) and create a calendar table, and then divide the value of a given month among the days in the month :)
Try this and let me know if it worked:
with calendar_table as (
select
calendar_date
from
unnest(generate_date_array('2020-01-01', '2020-05-31', interval 1 day)) as calendar_date
),
final as (
select
ct.calendar_date,
s.value,
s.value / extract(day from last_day(ct.calendar_date)) as daily_value
from
spend as s
cross join
calendar_table as ct
where
format_date('%Y-%m', date(ct.calendar_date)) = format_date('%Y-%m', date(s.created_at))
)
select * from final

My recommendation is to do this "locally". That is, run generate_date_array() for each row in the original table. This is much faster than a join across rows. BigQuery also makes this easy with the last_day() function:
select t.id, u.date,
t.value / extract(day from last_day(t.created_at))
from `table` t cross join
unnest(generate_date_array(t.created_at,
last_day(t.created_at, month)
)
) u(date);

PostgreSQL difference between values with different time stamps in same table

I'm relatively new to working with PostgreSQL and I could use some help with this.
Suppose I have a table of forecasted values (let's say temperature) are stored, which are indicated by a dump_date_time . This dump_date_time is the date_time when the values were stored in the table. The temperature forecasts are also indicated by the date_time to which the forecast corresponds. Lets say that every 6 hours a forecast is published.
Example:
At 06:00 today the temperature for tomorrow at 16:00 is published and stored in the table. Then at 12:00 today the temperature for tomorrow at 16:00 is published and also stored in the table. I now have two forecasts for the same date_time (16:00 tomorrow) which are published at two different times (06:00 and 12:00 today), indicated by the dump_date_time.
All these values are stored in the same table, with three columns: dump_date_time, date_time and value. My goal is to SELECT from this table the difference between the temperatures of the two forecasts. How do I do this?

One option uses a join:
select date_time, t1.value - t2.value value_diff
from mytable t1
inner join mytable t2 using (date_time)
where t1.dump_date = '2020-01-01 06:00:00'::timestamp
and t2.dump_date = '2020-01-01 16:00:00'::timestamp

Something like:
create table forecast(dump_date_time timestamptz, date_time timestamptz, value numeric)
insert into forecast values ('09/24/2020 06:00', '09/25/2020 16:00', 50), ('09/24/2020 12:00', '09/25/2020 16:00', 52);
select max(value) - min(value) from forecast where date_time = '09/25/2020 16:00';
?column?
----------
2
--Specifying dump_date_time range
select
max(value) - min(value)
from
forecast
where
date_time = '09/25/2020 16:00'
and
dump_date_time <#
tstzrange(current_date + interval '6 hours',
current_date + interval '12 hours', '[]');
?column?
----------
2
This is a very simple case. If you need something else you will need to provide more information.
UPDATE
Add example that uses timestamptz range to select dump_date_time in range.

SQL: Calculate day-1 retention rate from user registration table and event log

I need to calculate the day-1 retention by user registration date. Day-1 retention is defined as the number of users who return 1 day after the registration date divided by the number of users who registered on the registration date.
Here's the user table
CREATE TABLE registration (
user_id SERIAL PRIMARY KEY,
user_name VARCHAR(255) NOT NULL,
registrationDate TIMESTAMP NOT NULL
);
INSERT INTO registration (user_id, user_name, registrationDate)
VALUES
(0, 'John', '2018-01-01 00:01:00'),
(1, 'David', '2018-01-01 00:04:30'),
(2, 'Cassy', '2018-01-02 10:00:00'),
(3, 'Winka', '2018-01-02 14:30:00')
;
CREATE TABLE log (
user_id INTEGER,
eventDate TIMESTAMP
);
INSERT INTO log (user_id, eventDate)
VALUES
(0, '2018-01-01 01:00:00'),
(0, '2018-01-02 04:00:00'),
(0, '2018-01-04 06:00:00'),
(1, '2018-01-01 00:30:00'),
(3, '2018-01-02 14:40:00'),
(3, '2018-01-04 12:20:00'),
(3, '2018-01-06 13:30:00'),
(2, '2018-01-12 10:10:00'),
(2, '2018-01-13 09:00:00')
I tried to join the registration table to log table, so I can compare the date difference.
select registration.user_id, registrationDate, log.eventDate,
(log.eventDate - registration.registrationDate) as datediff
from log left join registration ON log.user_id = registration.user_id
I think I somehow need to perform below tasks.
select the users with datediff = 1 and count them.
I added a where statement, but getting an error saying "datediff does not exist Position"
where datediff = 1
do the Group By registrationDate.
This also gave me an error: "ERROR: column "registration.user_id" must appear in the GROUP BY clause or be used in an aggregate function"
I am new to SQL and learning it as I am solving the problem. Any help/advice will be appreciated
The expected outcome should return a table with two columns (registrationDate and retention) with rows for each date any user registered.

Day-1 retention is defined as the number of users who return 1 day after the registration date divided by the number of users who registered on the registration date.
This interprets the definition as being based on calendar days. I would express this as:
What ratio of users come back on the day after they register?
I think this is the simplest method:
select count(distinct l.user_id) * 1.0 / count(distinct r.user_id)
from registration r left join
log l
on l.user_id = r.user_id and
l.eventDate::date = r.registrationDate::date + interval '1 day';
The count(distinct) is only needed if multiple events can happen on a single day.
Here is a db<>fiddle.
I'm not sure the definition is 100% useful. If you have another definition in mind, I would suggest that you ask a new question, with appropriate sample data and desired results.

I am not quiet sure if this is your expected result:
For registrationdate = 2018-01-01 all two users have been logged within the first day, so the result is 1. For registrationdate = 2018-01-02 only one of two users have been logged within this range, so the result is 0.5
Step-by-step demo: db<>fiddle
SELECT
registrationdate,
COUNT(*) FILTER (WHERE is_in_one_day) / daily_regs::decimal -- 6
FROM (
SELECT DISTINCT ON (l.user_id) -- 4
l.user_id,
eventdate::date AS eventdate,
registrationdate::date AS registrationdate,
daily_regs,
eventdate - registrationdate < interval '1 day' AS is_in_one_day -- 3
FROM log l
JOIN ( -- 2
SELECT
*,
COUNT(user_id) OVER (PARTITION BY registrationdate::date) AS daily_regs --1
FROM
registration
) r
ON l.user_id = r.user_id
ORDER BY l.user_id, eventdate
) s
GROUP BY registrationdate, daily_regs -- 5
Count the total number of registrations per registration date. This can be done using a partioned window function. It adds a column with the count
Joining both tables (with the one extra column on registrations) on their user_id
Calculation the difference of the current eventdate and the registrationdate. Check if this is less one day.
Do not take one user twice (it does not happen in you example data but it can be that one user is logged twice within this range. This user should not be counted twice).
Group by the date of registration
Count all records with the difference under one day (using the FILTER clause) and divide by the total number of registrations calculated in (1)

How can I select stats on lateness of a record expected x days after previous record?

I have entities with config info in one table. If the 'vendor' doesn't do something within 'reminder_days' of the last time of doing it, then it becomes overdue.
CREATE TABLE t_vendors
(
vendor_id NUMBER,
vendor_name VARCHAR2 (250),
reminder_days NUMBER
);
Insert into T_VENDORS (vendor_id, vendor_name, reminder_days)
Values (12, 'sanity-test', 7);
and an app records what they do whenever they do it into this table with this sort of data:
CREATE TABLE t_vendor_events
(
vendor_event_id,
vendor_id NUMBER (19,0),
description VARCHAR2 (250),
event_date DATE
);
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10015, 12, TO_DATE('11/9/2015 21:22:55', 'MM/DD/YYYY HH24:MI:SS'), 'one');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10016, 12, TO_DATE('11/16/2015 21:23:55', 'MM/DD/YYYY HH24:MI:SS'), 'two');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10017, 12, TO_DATE('11/30/2015 21:24:55', 'MM/DD/YYYY HH24:MI:SS'), 'three');
Insert into t_vendor_events (vendor_event_id, vendor_id, description, event_date)
Values (10018, 12, TO_DATE('12/01/2015 21:25:55', 'MM/DD/YYYY HH24:MI:SS'), 'four');
Once I've got the comparative values, I need to aggregate the data to quantify the lateness:
how many events occurred
how often they were overdue
what was expected (the reminder days value)
how much they were late on average
how much they were late at worst (max)
I need to see all the vendors in the result, including those that failed to produce an event at all.
All the solutions that I can think of involve creating extra columns and storing some kind of 'lateness' data on every event. This though strikes me as a redundancy, since I know the required interval (reminder_days) but I don't know what kind of nested selects would produce what I need.
I would prefer to stick to standard SQL and I'm not using PL-SQL, but am able to use Oracle-specific syntax in selects where necessary.
The result would look something like this (Expected Days is the 'reminder days' column):
Vendor Event Overdue Expected Avg Max
Count Count Days Elapsed Elapsed
Mega1 5 2 10 12 20
Ole! 6 0 10 9 10
GoPunk 0 0 0 0 0
X-Dan 0 0 0 0 0
RetroB 1 1 30 60 60

You can use lag to get the previous event_date and calculate the difference with the current event_date. Then select rows where the difference is > reminder_days by vendor. Just aggregate the final result, to know how often a vendor was late.
with prev as
(select lag(event_date) over(partition by vendor_id order by event_date) prevdt
, t.* from t_vendor_events)
select v.vendor_id, v.vendor_name, event_date - nvl(prevdt, event_date) diff
from prev p
join t_vendors v on p.vendor_id = v.vendor_id
where event_date - nvl(prevdt, event_date) > v.reminder_days

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL: Joining 2 tables on first matching row condition - sql

Related

Calculate weekly hours of operation in T-SQL from overlapping date/time data?

Divide monthly spend into daily spend in BigQuery

PostgreSQL difference between values with different time stamps in same table

SQL: Calculate day-1 retention rate from user registration table and event log

How can I select stats on lateness of a record expected x days after previous record?

Categories

Resources