How do I use calendar exceptions to generate accurate schedules using GTFS? - sql

I'm having trouble figuring out the GTFS query to obtain the next 20 schedules for a given stop ID and a given direction.
I know the stop ID, the trip direction ID, the time (now) and the date (today)
I wrote
SELECT DISTINCT ST.departure_time FROM stop_times ST
JOIN trips T ON T._id = ST.trip_id
JOIN calendar C ON C._id = T.service_id
JOIN calendar_dates CD on CD.service_id = T.service_id
WHERE ST.stop_id = 3377699724118483
AND T.direction_id = 0
AND ST.departure_time >= "16:00:00"
AND
(
( C.start_date <= 20140607 AND C.end_date >= 20140607 AND C.saturday= 1 ) // regular service today
AND ( ( CD.date != 20140607 ) // no exception today
OR ( CD.date = 20140607 AND CD.exception_type = 1 ) // or ADDED exception today
)
)
ORDER BY stopTimes.departure_time LIMIT 20
This results in no record being found.
If a remove the last part, dealgin with the CD tables (i.e. the removed or added exceptions), it works perfectly fine.
So I think I'm miswriting the check on the exceptions.
As written above with // comments, I want to check that
today is in a regular service (from checking the calendar table)
there is no removal exception for today (or in this case the trips corresponding to this service id are not included in the computation)
if there is added exception for today, the corresponding trips shall be included in the computation
can you help me with that ?

I'm fairly certain it's not possible to do what you're trying to do with only a single SELECT statement, due to the design of the calendar and calendar_dates tables.
What I do is use a second, inner query to build the set of active service IDs on the requested date, then join the outer query against this set to include only results relevant for that date. Try this:
SELECT DISTINCT ST.departure_time FROM stop_times ST
JOIN trips T ON T._id = ST.trip_id
JOIN (SELECT _id FROM calendar
WHERE start_date <= 20140607
AND end_date >= 20140607
AND saturday = 1
UNION
SELECT service_id FROM calendar_dates
WHERE date = 20140607
AND exception_type = 1
EXCEPT
SELECT service_id FROM calendar_dates
WHERE date = 20140607
AND exception_type = 2
) ASI ON ASI._id = T.service_id
WHERE ST.stop_id = 3377699724118483
AND T.direction_id = 0
AND ST.departure_time >= "16:00:00"
ORDER BY ST.departure_time
LIMIT 20

Related

SQL Rowwise comparison between groups

Question
The following is a snippet of my data:
Create Table Emps(person VARCHAR(50), started DATE, stopped DATE);
Insert Into Emps Values
('p1','2015-10-10','2016-10-10'),
('p1','2016-10-11','2017-10-11'),
('p1','2017-10-12','2018-10-13'),
('p2','2019-11-13','2019-11-13'),
('p2','2019-11-14','2020-10-14'),
('p3','2020-07-15','2021-08-15'),
('p3','2021-08-16','2022-08-16');
db<>fiddle.
I want to use T-SQL to get a count of how many persons fulfil the following criteria at least once - multiples should also count as one:
For a person:
One of the dates in 'started' (say s1) is larger than at least one of the dates in 'ended' (say e1)
s1 and e1 are in the same year, to be set manually - e.g. '2021-01-01' until '2022-01-01'
Example expected response
If I put the date range '2016-01-01' until '2017-01-01' somewhere in a WHERE / HAVING clause, the output should be 1 as only p1 has both a start date and an end date that fall in 2016 where the start date is larger than the end date:
s1 = '2016-10-11', and e1 = '2016-10-10'.
Why can't I do this myself
The reason I'm stuck is that I don't know how to do this rowwise comparison between groups. The question requires comparing values across columns (start with end) across rows, within a person ID.
Use conditional aggregation to get the maximum start date and the minimum stop date in the given range.
select person
from emps
group by person
having max(case when started >= '2016-01-01' and started < '2017-01-01'
then started end) >
min(case when stopped >= '2016-01-01' and stopped < '2017-01-01'
then stopped end);
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=45adb153fcac9ce72708f1283cac7833
I would choose to use a self-outer-join with an exists correlation, it should be pretty much the most performant, all things being equal.
select Count(*)
from emps e
where exists (
select * from emps e2
where e2.person = e.person
and e2.stopped > e.started
and e.started between '20160101' and '20170101'
and e2.started between '20160101' and '20170101'
);
You said you plan to set the dates manually, so this works where we set the start date in one CTE, and the end date in another CTE. Then we calculate the min/max for each, and use that criteria in the query where statement.
with min_max_start as (
select person,
min(started) as min_start, --obsolete
max(started) as max_start
from emps
where started >= '2016-01-01'
group by person
),
min_max_end as (
select person,
min(stopped) as min_stop,
max(stopped) as max_stop --obsolete
from emps
where stopped < '2017-01-01'
group by person
)
select count(distinct e.person)
from emps e
join min_max_start mms
on e.person = mms.person
join min_max_end mme
on e.person = mme.person
where mms.max_start> mme.min_stop
Output: 1
Try the following:
With CTE as
(
Select D.person, D.started, T.stopped,
case
when Year(D.started) = Year(T.stopped) and D.started > T.stopped
then 1
else 0
end as chk
From
(Select person, started From Emps Where started >= '2016-01-01') D
Join
(Select person, stopped From Emps Where stopped <= '2017-01-01') T
On D.person = T.person
)
Select Count(Distinct person) as CNT
From CTE
Where chk = 1;
To get the employee list who met the criteria use the following on the CTE instead of the above Select Count... query:
Select person, started, stopped
From CTE
Where chk = 1;
See a demo from db<>fiddle.

SUM with left outer join gets inflated result

The following query gives me the MRR (monthly recurring revenue) for my customer:
with dims as (
select distinct subscription_id, country_name, product_name from revenue
where site_id = '18XLsHIVSJg' and subscription_id is not null
)
select to_date('2022-07-01') as occurred_date,
count(distinct srm.subscription_id) as subscriptions,
count(distinct srm.receiver_contact) as subscribers,
sum(srm.baseline_mrr) as mrr_srm
from subscription_revenue_mart srm
join dims d on d.subscription_id = srm.subscription_id
where srm.site_id = '18XLsHIVSJg'
-- MRR as of the day before ie June 30th
and to_date(srm.creation_date) < '2022-07-01'
-- Counting the subscriptions active after July 1st
and ((srm.subscription_status = 'SUBL.A') or
-- Counting the subscriptions canceled/deactivated after July 1st
(srm.subscription_status = 'SUBL.C' and (srm.deactivation_date >= '2022-07-01') or (srm.canceled_date >= '2022-07-01')) ) group by 1;
I get a total of $5922.15 but I need to add data from another table to capture upgrades/downgrades a customer makes on a product subscription. Using the same approach as above, I can query my "change" table thusly:
select subscription_id, sum(mrr_change_amount) mrr_change_amount,max(subscription_event_date) subscription_event_date from subscription_revenue_mart_change srmc
where site_id = '18XLsHIVSJg'
and to_date(srmc.creation_date) < '2022-07-01'
and ((srmc.subscription_status = 'SUBL.A')
or (srmc.subscription_status = 'SUBL.C' and (srmc.deactivation_date >= '2022-07-01') or (srmc.canceled_date >= '2022-07-01')))
group by 1;
I get a total of $3635.47
When I combine both queries into one, I get an inflated result:
with dims as (
select distinct subscription_id, country_name, product_name from revenue
where site_id = '18XLsHIVSJg' and subscription_id is not null
),
change as (
select subscription_id, sum(mrr_change_amount) mrr_change_amount,
-- there can be multiple changes per subscription
max(subscription_event_date) subscription_event_date from subscription_revenue_mart_change srmc
where site_id = '18XLsHIVSJg'
and to_date(srmc.creation_date) < '2022-07-01'
and ((srmc.subscription_status = 'SUBL.A')
or (srmc.subscription_status = 'SUBL.C' and (srmc.deactivation_date >= '2022-07-01') or (srmc.canceled_date >= '2022-07-01')))
group by 1
)
select to_date('2022-07-01') as occurred_date,
count(distinct srm.subscription_id) as subscriptions,
count(distinct srm.receiver_contact) as subscribers,
-- See comment RE: LEFT OUTER join
sum(coalesce(c.mrr_change_amount,srm.baseline_mrr)) as mrr
from subscription_revenue_mart srm
join dims d
on d.subscription_id = srm.subscription_id
-- LEFT OUTER join required for customers that never made a change
left outer join change c
on srm.subscription_id = c.subscription_id
where srm.site_id = '18XLsHIVSJg'
and to_date(srm.creation_date) < '2022-07-01'
and ((srm.subscription_status = 'SUBL.A')
or (srm.subscription_status = 'SUBL.C' and (srm.deactivation_date >= '2022-07-01') or (srm.canceled_date >= '2022-07-01'))) group by 1;
It should be $9557.62 ie (5922.15 + $3635.47) but the query outputs $16116.91, which is wrong.
I think the explode-implode syndrome may cause this.
I had designed my "change" CTE to prevent this by aggregating all the relevant fields but it's not working.
Can someone provide pointers on the best way to work around this issue?
It would help if you gave us sample data too, but I see a problem here:
sum(coalesce(c.mrr_change_amount,srm.baseline_mrr)) as mrr
Why COALESCE? That will give you one of the 2 numbers, but I guess what you want is:
sum(ifnull(c.mrr_change_amount, 0) + srm.baseline_mrr) as mrr
That's the best I can offer with what you've given us.

Aggregate SQL Query, GROUP BY Causing Issues

Everything in this query works except for the second LEFT JOIN, where BEGIN_DATE and END_DATE are. Because I have to group by the additional columns, so they can be used in the "on join", I am getting false numbers. Is there any way to do this without having to group by. I hope this makes sense. Basically because I have to include BEGIN_DATE AND END_DATE in the group by, everything gets lost.
SELECT
to_char(T1.CALL_TIMESTAMP,'YYYY-IW') AS OMONTH
,COUNT(T1.HOUSE) AS NODECALLS
,T3.NODE_CODE
,T5.NODECUSTCOUNT
,T1.CALL_CATEGORY_LVL_3
,sum((CASE WHEN T1.TC_WIP_TRANSACTION_ID IS NOT NULL THEN 1 ELSE 0 END )) AS TC
,sum((CASE WHEN T1.TC_WIP_TRANSACTION_ID IS NOT NULL THEN 1 ELSE 0 END ))/nullif(COUNT(T1.HOUSE), 0) AS SVRATEPERCALL
,COUNT(T1.HOUSE)/ nullif(T5.NODECUSTCOUNT, 0) AS CALLRATE
FROM CVKOMNZP.NZKOMUSER.NFOV_INBD_REMEDY_CALL_DETAILS T1
LEFT JOIN
(
SELECT T2.NODE_CODE,T2.BEGIN_DATE,T2.END_DATE,T2.HOUSE,T2.CORP
FROM CVKOMNZP.NZKOMUSER.D_HOUSEHOLD_CH_HIST T2
) T3
ON T1.CORP = T3.CORP AND T1.HOUSE = T3.HOUSE AND (T1.CALL_TIMESTAMP BETWEEN T3.BEGIN_DATE AND T3.END_DATE)
LEFT JOIN
(
SELECT count(ADM_HOUSEHOLD_ID) AS NODECUSTCOUNT,NODE_CODE,BEGIN_DATE, END_DATE
FROM CVKOMNZP.NZKOMUSER.D_HOUSEHOLD_CH_HIST
WHERE HOUSE_STATUS_CODE = 2
AND END_DATE = '2999-12-31 00:00:00'
AND T1.CALL_TIMESTAMP BETWEEN BEGIN_DATE AND END_DATE
GROUP BY NODE_CODE,BEGIN_DATE,END_DATE
) T5
ON T5.NODE_CODE = T3.NODE_CODE AND T1.CALL_TIMESTAMP BETWEEN T5.BEGIN_DATE AND T5.END_DATE
WHERE T1.EXCLUSION_FLAG = 'N'
AND T1.CALL_TIMESTAMP >= To_Date ('07-29-2017', 'MM-DD-YYYY' ) AND T1.CALL_TIMESTAMP <= To_Date ('07-31-2017', 'MM-DD-YYYY' )
GROUP BY
to_char(T1.CALL_TIMESTAMP,'YYYY-IW')
,T3.NODE_CODE
,T5.NODECUSTCOUNT
,T1.CALL_CATEGORY_LVL_3
If I am understanding this right, you want to get a COUNT without grouping by BEGIN and END DATE. However, because your Subquery (2nd LEFT JOIN) needs to include the BEGIN and NEED, you do not know how to group without it.
If this is the case, you'll need a subquery for your count and JOIN it back to the same table.
FYI: Your T1.CALL_TIMESTAMP does not make sense in this subquery since you don't have a table called T1. I renamed it to "a". Feel free to change it to what you want.
See if this make sense
LEFT JOIN
(
SELECT a.BEGIN_DATE,
a.END_DATE,
node.NODECUSTCOUNT,
a.node_code
FROM CVKOMNZP.NZKOMUSER.D_HOUSEHOLD_CH_HIST a
/**Subquery to get a COUNT of all the Node based on NODE_CODE.
You link this back to your query above using the NODE CODE**/
JOIN ( SELECT count(ADM_HOUSEHOLD_ID) AS NODECUSTCOUNT,
NODE_CODE
FROM CVKOMNZP.NZKOMUSER.D_HOUSEHOLD_CH_HIST
GROUP BY NODE_CODE ) node on node.node_code = a.node_code
WHERE a.HOUSE_STATUS_CODE = 2
AND a.END_DATE = '2999-12-31 00:00:00'
AND a.CALL_TIMESTAMP BETWEEN BEGIN_DATE AND END_DATE
) ..JOIN THIS BACK TO YOUR MAIN TABLE

SQL Query to show order of work orders

First off sorry for the poor subject line.
EDIT: The Query here duplicates OrderNumbers I am needing the query to NOT duplicate OrderNumbers
EDIT: Shortened the question and provided a much cleaner question
I have a table that has a record of all of the work orders that have been performed. there are two types of orders. Installs and Trouble Calls. My query is to find all of the trouble calls that have taken place within 30 days of an install and match that trouble call (TC) to the proper Install (IN). So the Trouble Call date has to happen after the install but no more than 30 days after. Additionally if there are two installs and two trouble calls for the same account all within 30 days and they happen in order the results have to reflect that. The problem I am having is I am getting an Install order matching to two different Trouble Calls (TC) and a Trouble Call(TC) that is matching to two different Installs(IN)
In the example on SQL Fiddle pay close attention to the install order number 1234567810 and the Trouble Call order number 1234567890 and you will see the issue I am having.
http://sqlfiddle.com/#!3/811df/8
select b.accountnumber,
MAX(b.scheduleddate) as OriginalDate,
b.workordernumber as OriginalOrder,
b.jobtype as OriginalType,
MIN(a.scheduleddate) as NewDate,
a.workordernumber as NewOrder,
a.jobtype as NewType
from (
select workordernumber,accountnumber,jobtype,scheduleddate
from workorders
where jobtype = 'TC'
) a join
(
select workordernumber,accountnumber,jobtype,scheduleddate
from workorders
where jobtype = 'IN'
) b
on a.accountnumber = b.accountnumber
group by b.accountnumber,
b.scheduleddate,
b.workordernumber,
b.jobtype,
a.accountnumber,
a.scheduleddate,
a.workordernumber,
a.jobtype
having MIN(a.scheduleddate) > MAX(b.scheduleddate) and
DATEDIFF(day,MAX(b.scheduleddate),MIN(a.scheduleddate)) < 31
Example of what I am looking for the results to look like.
Thank you for any assistance you can provide in setting me on the correct path.
You were actually very close. I realized that what you really want is the MIN() TC date that is greater than each install date for that account number so long as they are 30 days or less apart.
So really you need to group by the install dates from your result set excluding WorkOrderNumbers still. Something like:
SELECT a.AccountNumber, MIN(a.scheduleddate) TCDate, b.scheduleddate INDate
FROM
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'TC'
) a
INNER JOIN
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'IN'
) b
ON a.AccountNumber = b.AccountNumber
WHERE b.ScheduledDate < a.ScheduledDate
AND DATEDIFF(DAY, b.ScheduledDate, a.ScheduledDate) <= 30
GROUP BY a.AccountNumber, b.AccountNumber, b.ScheduledDate
This takes care of the dates and AccountNumbers, but you still need the WorkOrderNumbers, so I joined the workorders table back twice, once for each type.
NOTE: I assume that each workorder has a unique date for each account number. So, if you have workorder 1 ('TC') for account 1 done on '1/1/2015' and you also have workorder 2 ('TC') for account 1 done on '1/1/2015' then I can't guarantee that you will have the correct WorkOrderNumber in your result set.
My final query looked like this:
SELECT
aggdata.AccountNumber, inst.workordernumber OriginalWorkOrderNumber, inst.JobType OriginalJobType, inst.ScheduledDate OriginalScheduledDate,
tc.WorkOrderNumber NewWorkOrderNumber, tc.JobType NewJobType, tc.ScheduledDate NewScheduledDate
FROM (
SELECT a.AccountNumber, MIN(a.scheduleddate) TCDate, b.scheduleddate INDate
FROM
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'TC'
) a
INNER JOIN
(
SELECT WorkOrderNumber, ScheduledDate, JobType, AccountNumber
FROM workorders
WHERE JobType = 'IN'
) b
ON a.AccountNumber = b.AccountNumber
WHERE b.ScheduledDate < a.ScheduledDate
AND DATEDIFF(DAY, b.ScheduledDate, a.ScheduledDate) <= 30
GROUP BY a.AccountNumber, b.AccountNumber, b.ScheduledDate
) aggdata
LEFT OUTER JOIN workorders tc
ON aggdata.TCDate = tc.ScheduledDate
AND aggdata.AccountNumber = tc.AccountNumber
AND tc.JobType = 'TC'
LEFT OUTER JOIN workorders inst
ON aggdata.INDate = inst.ScheduledDate
AND aggdata.AccountNumber = inst.AccountNumber
AND inst.JobType = 'IN'
select in1.accountnumber,
in1.scheduleddate as OriginalDate,
in1.workordernumber as OriginalOrder,
'IN' as OriginalType,
tc.scheduleddate as NewDate,
tc.workordernumber as NewOrder,
'TC' as NewType
from
workorders in1
out apply (Select min(in2.scheduleddate) as scheduleddate from workorders in2 Where in2.jobtype = 'IN' and in1.accountnumber=in2.accountnumber and in2.scheduleddate>in1.scheduleddate) ins
join workorders tc on tc.jobtype = 'TC' and tc.accountnumber=in1.accountnumber and tc.scheduleddate>in1.scheduleddate and (ins.scheduleddate is null or tc.scheduleddate<ins.scheduleddate) and DATEDIFF(day,in1.scheduleddate,tc.scheduleddate) < 31
Where in1.jobtype = 'IN'

How can I include in schedules today's departures after midnight using GTFS?

I began with GTFS and offhand ran into big problem with my SQL query:
SELECT *, ( some columns AS shortcuts )
FROM stop_times
LEFT JOIN trips ON stop_times.trip_id = trips.trip_id
WHERE trips.max_sequence != stop_times.stop_sequence
AND stop_id IN( $incodes )
AND trips.service_id IN ( $service_ids )
AND ( departure_time >= $time )
AND ( trips.end_time >= $time )
AND ( trips.start_time <= $time_plus_3hrs )
GROUP BY t,l,sm
ORDER BY t ASC, l DESC
LIMIT 14
This should show departures from some stop in next 3 hours.
It works but with approaching midnight (e.g. 23:50) it catch only "today's departure". After midnight it catch only "new day departures" and departures from previous day are missing, because they have departure_time e.g. "24:05" (=not bigger than $time 00:05).
Is possible to use something lighter than UNION same query for next day?
If UNION is using, how can I ORDER departures for trimming by LIMIT?
Trips.start_time and end_time are my auxiliary variables for accelerate SQL query execution, it means sequence1-arrival_time and MAXsequence-departure_time of any trip.
Using UNION to link together a query for each day is going to be your best bet, unless perhaps you want to issue two completely separate queries and then merge the results together in your application. The contortionism required to do all this with a single SELECT statement (assuming it's even possible) would not be worth the effort.
Part of the complexity here is that the set of active service IDs can vary between consecutive days, so a distinct set must be used for each one. (For a suggestion of how to build this set in SQL using a subquery and table join, see my answer to "How do I use calendar exceptions to generate accurate schedules using GTFS?".)
More complexity arises from the fact the results for each day must be treated differently: For the result set to be ordered correctly, we need to subtract twenty-four hours from all of (and only) yesterday's times.
Try a query like this, following the "pseudo-SQL" in your question and assuming you are using MySQL/MariaDB:
SELECT *, SUBTIME(departure_time, '24:00:00') AS t, ...
FROM stop_times
LEFT JOIN trips ON stop_times.trip_id = trips.trip_id
WHERE trips.max_sequence != stop_times.stop_sequence
AND stop_id IN ( $incodes )
AND trips.service_id IN ( $yesterdays_service_ids )
AND ( departure_time >= ADDTIME($time, '24:00:00') )
AND ( trips.end_time >= ADDTIME($time, '24:00:00') )
AND ( trips.start_time <= ADDTIME($time_plus_3hrs, '24:00:00') )
UNION
SELECT *, departure_time AS t, ...
FROM stop_times
LEFT JOIN trips ON stop_times.trip_id = trips.trip_id
WHERE trips.max_sequence != stop_times.stop_sequence
AND stop_id IN ( $incodes )
AND trips.service_id IN ( $todays_service_ids )
AND ( departure_time >= $time )
AND ( trips.end_time >= $time )
AND ( trips.start_time <= $time_plus_3hrs )
GROUP BY t, l, sm
ORDER BY t ASC, l DESC
LIMIT 14