Remove overlapping days And arrange dates in SQL - sql

I have a table with the following data
Start End
===== ===
12/21/2011 12/20/2012
05/05/2012 10/20/2013
12/21/2012 12/20/2013
12/21/2013 12/20/2014
12/21/2014 12/20/2015
And want to get the following results
Start End
===== ===
12/21/2011 05/04/2012
05/05/2012 10/20/2013
10/21/2013 12/20/2013
12/21/2013 12/20/2014
12/21/2014 12/20/2015
Any ideas on where to start? A lot of the reading I've done suggests I need to create entries and for each single day once and remove overlapping days and manage date accordingly. is this the only way?

I think that this kind of problem is better solved in some sort of procedural approach, also for the sake of readability. Nevertheless, for fun, I figured out an SQL statement that does the trick with the aid of the rownum statement (in Oracle syntax, as I had no sql-server database at hand):
Let's assume your table is called DATE_TABLE with columns START_DATE and END_DATE. Then the statement is as follows:
select start_date,
coalesce(
(select case when tbl_inner.start_date < tbl_outer.end_date
then tbl_inner.start_date - 1
else tbl_outer.end_date end
from (select rownum row_num, start_date, end_date from date_table order by 1) tbl_inner
where tbl_inner.row_num = tbl_outer.row_num + 1),
tbl_outer.end_date)
from (select rownum row_num, start_date, end_date from date_table order by 1) tbl_outer;
The inner select provides the rows of the table DATE_TABLE with row numbers that can be referenced by the outer select. Without the COALESCE clause, the statement would not work for the last row in the DATE_TABLE.
I presume that the statement does not scale too well.

Related

Where clause in a calculation

Say I have this table:
month
num_of_fruits
harvested
2022-01-01
133
3
2022-02-01
145
12
2022-03-01
123
5
2022-04-01
111
4
2022-05-01
164
9
..
..
..
I want to be able to set a new column called lost based on the month and num_of_fruits columns. To set this lost column, requires a calculation. The calculation is harvested - (num_of_fruits - num_of_fruits(last_month))
I'm having trouble in the parenthesis part - getting the last month's num_of_fruits. I have this to start:
select
id,
"month",
num_of_fruits,
harvested,
harvested - (num_of_fruits - num_of_fruits WHERE date_trunc('month', "month" - interval '1' month)) as lost,
selecting other columns..
It's giving me an error in the where clause.
Can you have a where clause inside a select statement? How would I take the last month's num_of_fruits and subtract it with this month's num_of_fruits - all while inside the select statement?
Any help or advice will greatly help me! Thank you so much in advance!
If you want to check other rows in the table, you will likely want either a subquery in your SELECT or to join the table to itself.
I think you are probably trying to do:
SELECT
harvested - (num_of_fruits - (SELECT num_of_fruits FROM mytable t2 WHERE t2.month = date_trunc('month', t1."month" - interval '1' month))) as lost
FROM mytable t1
Note that I created a whole new subquery (SELECT/FROM/WHERE) within your existing SELECT statement, instead of just adding a stray WHERE clause.
I also changed your condition so that it actually has a compares the result of DATETRUNC with something.
It's not clear to me that you actually need the DATETRUNC here (and, if you do, you might want it on both sides of the comparison), but you can use the basic idea above and fix the condition to match your needs.
An alternative (joining to self) to consider might be:
SELECT
t1.harvested - (t1.num_of_fruits - t2.num_of_fruits)
FROM mytable t1 LEFT OUTER JOIN mytable t2
ON t2.month = date_trunc('month', t1."month" - interval '1' month)))
If you know that you always have one row per month, so the previous row (ordered by month) is also the previous month, you could just use LAG:
SELECT
harvested - (num_of_fruits - LAG(num_of_fruits, 1) OVER (ORDER BY month)
FROM mytable
LAG(num_of_fruits, 1) OVER (ORDER BY month) means "the num_of_fruits from the previous row in the table when the table is ordered by month".

sql query to get today new records compared with yesterday

i have this table:
COD (Integer) (PK)
ID (Varchar)
DATE (Date)
I just want to get the new ID's from today, compared with yesterday (the ID's from today that are not present yesterday)
This needs to be done with just one query, maximum efficiency because the table will have 4-5 millions records
As a java developer i am able to do this with 2 queries, but with just one is beyond my knowledge so any help would be so much appreciated
EDIT: date format is dd/mm/yyyy and every day each ID may come 0 or 1 times
Here is a solution that will go over the base data one time only. It selects the id and the date where the date is either yesterday or today (or both). Then it GROUPS BY id - each group will have either one or two rows. Then it filters by the condition that the MIN date in the group is "today". Those are the id's that exist today but did not exist yesterday.
DATE is an Oracle keyword, best not used as a column name. I changed that to DT. I also assume that your "dt" field is a pure date (as pure as it can be in Oracle, meaning: time of day, which is always present, is 00:00:00).
select id
from your_table
where dt in (trunc(sysdate), trunc(sysdate) - 1)
group by id
having min(dt) = trunc(sysdate)
;
Edit: Gordon makes a good point: perhaps you may have more than one such row per ID, in the same day? In that case the time-of-day may also be different from 00:00:00.
If so, the solution can be adapted:
select id
from your_table
where dt >= trunc(sysdate) - 1 and dt < trunc(sysdate) + 1
group by id
having min(dt) >= trunc(sysdate)
;
Either way: (1) the base table is read just once; (2) the column DT is not wrapped within any function, so if there is an index on that column, it can be used to access just the needed rows.
The typical method would use not exists:
select t.*
from t
where t.date >= trunc(sysdate) and t.date < trunc(sysdate + 1) and
not exists (select 1
from t t2
where t2.id = t.id and
t2.date >= trunc(sysdate - 1) and t2.date < trunc(sysdate)
);
This is a general solution. If you know that there is at most one record per day, there are better solutions, such as using lag().
Use MINUS. I suppose your date column has a time part, so you need to truncate it.
select id from mytable where trunc(date) = trunc(sysdate)
minus
select id from mytable where trunc(date) = trunc(sysdate) - 1;
I suggest the following function index. Without it, the query would have to full scan the table, which would probably be quite slow.
create idx on mytable( trunc(sysdate) , id );

Given a single column of effective dates, is there a SQL statement that can transform that into date ranges?

Similar to another question I've posted, given the following table...
Promo EffectiveDate
------ -------------
PromoA 1/1/2016
PromoB 4/1/2016
PromoC 7/1/2016
PromoD 10/1/2016
PromoE 1/1/2017
What is the easiest way to transform it into start and end dates, like so...
Promo StartDate EndDate
------ --------- ---------
PromoA 1/1/2016 4/1/2016
PromoB 4/1/2016 7/1/2016
PromoC 7/1/2016 10/1/2016
PromoD 10/1/2016 1/1/2017
PromoE 1/1/2017 null (ongoing until a new Effective Date is added)
Update
Correlated queries seem to be the simplest solution, but as I understand it, they are extremely inefficient since the subquery has to run once per row of the outer select.
What I was thinking as a potential solution was something along the lines of selecting the values from the table a second time, but eliminating the first result, then pairing them up with the first select by ordinal index with a simple outer left join.
As an example, substituting letters for dates above, the first select would be like A,B,C,D,E and second would be B,C,D,E (which is the first select minus the first record 'A') then pairing them up by ordinal index with a simple outer left join, resulting in A-B, B-C, C-D, D-E, E-null. However I couldn't figure out the syntax to make that work.
A correlated sub-query can lookup the additional field you need.
SELECT
yourTable.*,
(
SELECT MIN(lookup.EffectiveDate)
FROM yourTable AS lookup
WHERE lookup.EffectiveDate > yourTable.EffectiveDate
)
FROM
yourTable
EDIT
The notion of "has to run once per row" is a mis-understanding of how SQL generates the execution plan that actually runs. The same can be said for joining one table to another, the join has to be run at-least once per row... There is indeed a larger cost to a correlated sub-query, but with appropriate indexes it won't be "extemely high", and the functionality described does warrant it.
If you had another field that was guaranteed to be sequential, then it would be trivial, but do not try to re-use the existing Promo field for that additional purpose.
SELECT
this.*,
next.EffectiveEpoch
FROM
yourTable this
LEFT JOIN
yourTable next
ON next.sequential_id = this.sequential_id + 1
Yes, you can use a correlated query with LIMIT :
SELECT t.promo,t.effectiveDate as start_date,
(SELECT s.effectiveDate FROM YourTable s
WHERE s.date > t.date
ORDER BY s.effectiveDate
LIMIT 1) as end_date
FROM YourTable t
EDIT: Here is a solution with a join :
SELECT t.promo,t.effectiveDate as start_date,
MIN(s.effectiveDate) as end_date
FROM YourTable t
LEFT JOIN YourTable s
ON(t.date < s.date)
GROUP BY t.promo,t.effectiveDate
show this, use subquery
select
p.promo,
p.EffectiveDate as "Start",
(select n.EffectiveDate from table_promo n where n.EffectiveDate >
p.EffectiveDate order by n.EffectiveDate limit 1) as "End"
from table_promo p

Get average interval between pairs of rows in a table

I have a table with the following data (paypal transactions):
txn_type | date | subscription_id
----------------+----------------------------+---------------------
subscr_signup | 2014-01-01 07:53:20 | S-XXX01
subscr_signup | 2014-01-05 10:37:26 | S-XXX02
subscr_signup | 2014-01-08 08:54:00 | S-XXX03
subscr_eot | 2014-03-01 08:53:57 | S-XXX01
subscr_eot | 2014-03-05 08:58:02 | S-XXX02
I want to get the average subscription length overall for a given time period (subscr_eot is the end of a subscription). In the case of a subscription that is still ongoing ('S-XXX03') I want it to be included from it's start date until now in the average.
How would I go about doing this with an SQL statement in Postgres?
SQL Fiddle. Subscription length for each subscription:
select
subscription_id,
coalesce(t2.date, current_timestamp) - t1.date as subscription_length
from
(
select *
from t
where txn_type = 'subscr_signup'
) t1
left join
(
select *
from t
where txn_type = 'subscr_eot'
) t2 using (subscription_id)
order by t1.subscription_id
The average:
select
avg(coalesce(t2.date, current_timestamp) - t1.date) as subscription_length_avg
from
(
select *
from t
where txn_type = 'subscr_signup'
) t1
left join
(
select *
from t
where txn_type = 'subscr_eot'
) t2 using (subscription_id)
I used a couple of common table expressions; you can take the pieces apart pretty easily to see what they do.
One of the reasons this SQL is complicated is because you're storing column names as data. (subscr_signup and subscr_eot are actually column names, not data.) This is a SQL anti-pattern; expect it to cause you much pain.
with subscription_dates as (
select
p1.subscription_id,
p1.date as subscr_start,
coalesce((select min(p2.date)
from paypal_transactions p2
where p2.subscription_id = p1.subscription_id
and p2.txn_type = 'subscr_eot'
and p2.date > p1.date), current_date) as subscr_end
from paypal_transactions p1
where txn_type = 'subscr_signup'
), subscription_days as (
select subscription_id, subscr_start, subscr_end, (subscr_end - subscr_start) + 1 as subscr_days
from subscription_dates
)
select avg(subscr_days) as avg_days
from subscription_days
-- add your date range here.
avg_days
--
75.6666666666666667
I didn't add your date range as a WHERE clause, because it's not clear to me what you mean by "a given time period".
Using the window function lag(), this becomes considerably shorter:
SELECT avg(ts_end - ts) AS avg_subscr
FROM (
SELECT txn_type, ts, lag(ts, 1, localtimestamp)
OVER (PARTITION BY subscription_id ORDER BY txn_type) AS ts_end
FROM t
) sub
WHERE txn_type = 'subscr_signup';
SQL Fiddle.
lag() conveniently takes a default value for missing rows. Exactly what we need here, so we don't need COALESCE in addition.
The query builds on the fact that subscr_eot sorts before subscr_signup.
Probably faster than presented alternatives so far because it only needs a single sequential scan - even though the window functions add some cost.
Using the column ts instead of date for three reasons:
Your "date" is actually a timestamp.
"date" is a reserved word in standard SQL (even if it's allowed in Postgres).
Never use basic type names as identifiers.
Using localtimestamp instead of now() or current_timestamp since you are obviously operating with timestamp [without time zone].
Also, your columns txn_type and subscription_id should not be text
Maybe an enum for txn_type and integer for subscription_id. That would make table and indexes considerably smaller and faster.
For the query at hand, the whole table has to be read an indexes won't help - except for a covering index in Postgres 9.2+, if you need the read performance:
CREATE INDEX t_foo_idx ON t (subscription_id, txn_type, ts);

Calculating working days including holidays between dates without a calendar table in oracle SQL

Okay, so I've done quite a lot of reading on the possibility of emulating the networkdays function of excel in sql, and have come to the conclusion that by far the easiest solution is to have a calendar table which will flag working days or non working days. However, due to circumstances out of my control, we don't have access to such a luxury and it's unlikely that we will any time in the near future.
Currently I have managed to bodge together what is undoubtedly a horrible ineffecient query in SQL that does work - the catch is, it will only work for a single client record at a time.
SELECT O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE,
sum(CASE
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Sunday ' THEN 0
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Saturday ' THEN 0
WHEN O_ASSESSMENTS.ASM_START_DATE + rownum - 1
IN ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
THEN 0
ELSE 1
END)-1 AS Week_Day
From O_ASSESSMENTS,
ALL_OBJECTS
WHERE O_ASSESSMENTS.ASM_QSA_ID IN ('TYPE1')
AND O_ASSESSMENTS.ASM_END_DATE >= '01/01/2012'
AND O_ASSESSMENTS.ASM_ID = 'A00000'
AND ROWNUM <= O_ASSESSMENTS.ASM_END_DATE-O_ASSESSMENTS.ASM_START_DATE+1
GROUP BY
O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE
Basically, I'm wondering if a) I should stop wasting my time on this or b) is it possible to get this to work for multiple clients? Any pointers appreciated thanks!
Edit: Further clarification - I already work out timescales using excel, but it would be ideal if we could do it in the report as the report in question is something that we would like end users to be able to run without any further manipulation.
Edit:
MarkBannister's answer works perfectly albeit slowly (though I had expected as much given it's not the preferred solution) - the challenge now lies in me integrating this into an existing report!
with
calendar_cte as (select
to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'day') in ('sunday ','saturday ') then 0 when to_date('01-01-2000')+level-1 in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011','01-01-2012','02-01-2012') then 0 else 1 end working_day
from dual
connect by level <= 1825 + sysdate - to_date('01-01-2000') )
SELECT
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day)-1 AS Week_Day
From
O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1')
and a.ASM_END_DATE >= '01/01/2012'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE
There are a few ways to do this. Perhaps the simplest might be to create a CTE that produces a virtual calendar table, based on Oracle's connect by syntax, and then join it to the Assesments table, like so:
with calendar_cte as (
select to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'Day')
in ('Sunday ','Saturday ') then 0
when to_date('01-01-2000')+level-1
in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
then 0
else 1
end working_day
from dual
connect by level <= 36525 + sysdate - to_date('01-01-2000') )
SELECT a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day) AS Week_Day
From O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1') and
a.ASM_END_DATE >= '01/01/2012' -- and a.ASM_ID = 'A00000'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE
This will produce a virtual table populated with dates from 01 January 2000 to 10 years after the current date, with all weekends marked as non-working days and all days specified in the second in clause (ie. up to 27 December 2011) also marked as non-working days.
The drawback of this method (or any method where the holiday dates are hardcoded into the query) is that each time new holiday dates are defined, every single query that uses this approach will have to have those dates added.
If you can't use a calendar table in Oracle, you might be better off exporting to Excel. Brute force always works.
Networkdays() "returns the number of whole working days between start_date and end_date. Working days exclude weekends and any dates identified in holidays."
Excluding weekends seems fairly straightforward. Every 7-day period will contain two weekend days. You'll just need to take some care with the leftover days.
Holidays are a different story. You have to either store them or pass them as an argument. If you could store them, you'd store them in a calendar table, and your problem would be over. But you can't do that.
So you're looking at passing them as an argument. Off the top of my head--and I haven't had any tea yet this morning--I'd consider a common table expression or a wrapper for a stored procedure.