SQL find min & max range within dataset - sql

I have a table with the following columns:
contactId (int)
interval (int)
date (smalldate)
small sample data:
1,120,'12/02/2010'
1,121,'12/02/2010'
1,122,'12/02/2010'
1,123,'12/02/2010'
1,145,'12/02/2010'
1,146,'12/02/2010'
1,147,'12/02/2010'
2,122,'12/02/2010'
2,123,'12/02/2010'
2,124,'12/02/2010'
2,320,'12/02/2010'
2,321,'12/02/2010'
2,322,'12/02/2010'
2,450,'12/02/2010'
2,451,'12/02/2010'
how/is it possible - to get sql to return columns "contactId, minInterval, maxInterval, date", e.g
1,120,123,'12/02/2010'
1,145,147,'12/02/2010'
2,122,124,'12/02/2010'
2,320,322,'12/02/2010'
2,450,451,'12/02/2010'
hopefully this makes sense, basically i'm looking to figure out the min/max range of the intervals by provider & date for the range where they increment by one... once there is a break in the interval incrementer (e.g. more than one) then it would indicate a new min/max range...
any help is greatly appreciated :)
here is my exact SQL table setup:
create table availability
(
Id (int)
ProviderId (int)
IntervalId (int)
Date (date)
)
sample data
providerid,intervalid,date
1128,108,2010-12-27
1128,109,2010-12-27
1128,110,2010-12-27
1128,111,2010-12-27
1128,112,2010-12-27
1128,113,2010-12-27
1128,114,2010-12-27
1128,120,2010-12-27
1128,121,2010-12-27
1128,122,2010-12-27
1128,123,2010-12-27
1128,124,2010-12-27
1128,125,2010-12-27
1213,108,2010-12-27
1213,109,2010-12-27
1213,110,2010-12-27
1213,111,2010-12-27
1213,112,2010-12-27
1213,113,2010-12-27
1213,114,2010-12-27
1213,115,2010-12-27
1213,232,2010-12-27
1213,233,2010-12-27
1213,234,2010-12-27
3954,198,2010-12-27
3954,199,2010-12-27
3954,200,2010-12-27
3954,201,2010-12-27
3954,202,2010-12-27
3954,203,2010-12-27
3954,204,2010-12-27
3954,205,2010-12-27
3954,206,2010-12-27
3954,207,2010-12-27
3954,208,2010-12-27
3954,209,2010-12-27
3954,210,2010-12-27
3954,211,2010-12-27
3954,212,2010-12-27
3954,213,2010-12-27
3954,214,2010-12-27
3954,215,2010-12-27
3954,216,2010-12-27
3954,217,2010-12-27
3954,218,2010-12-27
3954,229,2010-12-27
3954,230,2010-12-27
3954,231,2010-12-27
3954,232,2010-12-27
3954,233,2010-12-27
3954,234,2010-12-27
1128,108,2010-12-28
1128,109,2010-12-28
1128,110,2010-12-28
1128,111,2010-12-28
1128,112,2010-12-28
1128,113,2010-12-28
1128,114,2010-12-28
1128,115,2010-12-28
1128,116,2010-12-28
3954,186,2010-12-28
3954,187,2010-12-28
3954,188,2010-12-28
3954,189,2010-12-28
3954,190,2010-12-28
3954,213,2010-12-28
3954,214,2010-12-28
3954,215,2010-12-28
3954,216,2010-12-28
3954,217,2010-12-28
3954,218,2010-12-28
3954,219,2010-12-28
3954,220,2010-12-28
3954,221,2010-12-28
3954,222,2010-12-28
sample result using current sql within answers:
1062,180,180,2010-12-20
1062,179,179,2010-12-20
1062,178,178,2010-12-20
1062,177,177,2010-12-20
1062,176,176,2010-12-20
1062,175,175,2010-12-20
1062,174,174,2010-12-20
1062,173,173,2010-12-20
1062,172,172,2010-12-20
1062,171,171,2010-12-20
1062,170,170,2010-12-20
1062,169,169,2010-12-20
1062,168,168,2010-12-20
1062,167,167,2010-12-20
1062,166,166,2010-12-20
1062,165,165,2010-12-20
1062,164,164,2010-12-20
1062,163,163,2010-12-20
1062,162,162,2010-12-20
1062,161,161,2010-12-20
1062,160,160,2010-12-20
1062,159,159,2010-12-20
1062,158,158,2010-12-20
1062,157,157,2010-12-20
1062,156,156,2010-12-20
1062,155,155,2010-12-20
1062,154,154,2010-12-20
1062,153,153,2010-12-20
1062,152,152,2010-12-20
1062,151,151,2010-12-20
1062,150,150,2010-12-20
1062,149,149,2010-12-20
1062,148,148,2010-12-20
1062,147,147,2010-12-20
1062,146,146,2010-12-20
1062,145,145,2010-12-20
1062,144,144,2010-12-20
1062,143,143,2010-12-20
1062,142,142,2010-12-20
1062,141,141,2010-12-20
1062,140,140,2010-12-20
1062,139,139,2010-12-20
1062,138,138,2010-12-20
1062,137,137,2010-12-20
1062,136,136,2010-12-20
1062,135,135,2010-12-20
1062,134,134,2010-12-20
1062,133,133,2010-12-20
1062,132,132,2010-12-20
1062,131,131,2010-12-20
1062,130,130,2010-12-20
1062,129,129,2010-12-20
1062,128,128,2010-12-20
1062,127,127,2010-12-20
1062,126,126,2010-12-20
1062,125,125,2010-12-20
1062,124,124,2010-12-20
1062,123,123,2010-12-20
1062,122,122,2010-12-20
1062,121,121,2010-12-20
1062,120,120,2010-12-20
1062,119,119,2010-12-20
1062,118,118,2010-12-20
1062,117,117,2010-12-20
1062,116,116,2010-12-20
1062,115,115,2010-12-20
1062,114,114,2010-12-20
1062,113,113,2010-12-20
1062,112,112,2010-12-20

In SQL Server, Oracle and PostgreSQL:
WITH q AS
(
SELECT t.*, interval - ROW_NUMBER() OVER (PARTITION BY contactID, date ORDER BY interval) AS sr
FROM mytable t
)
SELECT contactID, date, MIN(interval), MAX(interval)
FROM q
GROUP BY
date, contactID, sr
ORDER BY
date, contactID, sr
Update:
With your test data I get this output:
WITH mytable (providerId, intervalId, date) AS
(
SELECT 1128,108,'2010-12-27' UNION ALL
SELECT 1128,109,'2010-12-27' UNION ALL
SELECT 1128,110,'2010-12-27' UNION ALL
SELECT 1128,111,'2010-12-27' UNION ALL
SELECT 1128,112,'2010-12-27' UNION ALL
SELECT 1128,113,'2010-12-27' UNION ALL
SELECT 1128,114,'2010-12-27' UNION ALL
SELECT 1128,120,'2010-12-27' UNION ALL
SELECT 1128,121,'2010-12-27' UNION ALL
SELECT 1128,122,'2010-12-27' UNION ALL
SELECT 1128,123,'2010-12-27' UNION ALL
SELECT 1128,124,'2010-12-27' UNION ALL
SELECT 1128,125,'2010-12-27'
),
q AS
(
SELECT t.*, intervalId - ROW_NUMBER() OVER (PARTITION BY providerId, date ORDER BY intervalId) AS sr
FROM mytable t
)
SELECT providerId, date, MIN(intervalId), MAX(intervalId)
FROM q
GROUP BY
date, providerId, sr
ORDER BY
date, providerId, sr
1128 2010-12-27 108 114
1128 2010-12-27 120 125
, i. e. exactly what you were after.
Are you sure you are using the query correctly? Are you having duplicates on (providerId, intervalId, date)?

it's probably possible to do this with a SQL query alone, but it will probably be a bit mind-boggling. Basically a subquery to find places where it increments by one, joined to the original dataset, with tons of other logic in there. That's my impression at least.
If I were you,
If this is a one-time deal, don't care about performance and just iterate over it and do the calculation 'manually'.
If this is a production dataset and you need to do this operation on a frequent / automated / performance-intensive setting, then rearrange the original dataset to make this kind of query easier.
Hope one of those options is available to you.

Related

Create a date range based on rows but if date skips a day create another row

I have a business requirement to show data from the following table.
I need a way thru SQL to show the data as...
So everytime the User_ID or SPOT or Date is skip in sequential order we create a new row.
Assuming SQL Server, a solution might be:
with MyTbl as (
select *
from ( values
('SomeOne1','A','2023-06-16')
,('SomeOne1','A','2023-06-17')
,('SomeOne1','A','2023-06-18')
,('SomeOne1','A','2023-06-19')
,('SomeOne1','B','2023-06-20')
,('SomeOne1','B','2023-06-21')
,('SomeOne1','B','2023-06-22')
,('SomeOne1','B','2023-06-23')
,('SomeOne1','B','2023-06-24')
,('SomeOne1','B','2023-06-25')
,('SomeOne1','B','2023-06-26')
,('SomeOne1','B','2023-06-27')
,('SomeOneB','A','2023-06-20')
,('SomeOneB','A','2023-06-21')
,('SomeOneB','A','2023-06-22')
,('SomeOneB','A','2023-06-23')
,('SomeOneB','A','2023-06-24')
,('SomeOneB','A','2023-06-25')
,('SomeOneB','A','2023-06-28')
,('SomeOneB','A','2023-06-29')
,('SomeOneB','A','2023-06-30')
,('SomeOneB','A','2023-07-01')
,('SomeOneB','A','2023-07-02')
,('SomeOneB','A','2023-07-03')
) T(UserId, Spot, this_Date)
), AssgnGrp as (
select UserId, Spot, this_date
, [Grp] = DATEADD(DAY,-1 * (DENSE_RANK() OVER (partition by UserId, Spot ORDER BY [this_date])-1), [this_date])
from MyTbl
)
select UserId, Spot, Grp, begin_date=min(this_date), end_date=max(this_date)
from AssgnGrp
group by UserId, Spot, Grp

Insert data from table into a new one with condition

Okay, so this has been bugging me the whole day. I have two tables (e.g original_table and new_table). The new table is empty and I need to populate it with records from original_table given the following conditions:
Trip duration must be at least 30 seconds
Include only stations which have at least 100 trips starting there
Include only stations which have at least 100 trips ending there
The duration part is easy, but I find it hard to filter the other two conditions.
I tried to make two temporary tables like so:
CREATE TEMP TABLE start_stations AS(
SELECT ARRAY(SELECT DISTINCT start_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY start_station_id
HAVING COUNT(start_station_id)>=100
AND COUNT(end_station_id)>=100) as arr
);
CREATE TEMP TABLE end_stations AS(
SELECT ARRAY(SELECT DISTINCT end_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY end_station_id
HAVING COUNT(end_station_id)>=100
AND COUNT(start_station_id)>=100) as arr
);
And then try to insert in the new_table like this:
INSERT INTO `dataset.new_table`
SELECT a.* FROM `dataset.original_table` as a, start_stations as ss,
end_stations as es
WHERE a.start_station_id IN UNNEST(ss.arr)
AND a.end_station_id IN UNNEST(es.arr)
However, this does not provide me the right answer. I tried to make a temprary function to clean up the data, but I didnt go far. :(
Here's a sample of the table:
trip_id|duration_sec|start_date|start_station_id| end_date|end_station_id|
--------------------------------------------------------------------------|
afad333| 231|2017-12-20| 210|2017-12-20| 355|
sffde56| 35|2017-12-12| 355|2017-12-12| 210|
af33445| 333|2018-10-27| 650|2018-10-27| 650|
dd1238d| 456|2017-09-15| 123|2017-09-15| 210|
dsa2223| 500|2017-09-15| 210|2017-09-15| 123|
...
I will be very thankful If you can help me.
Thanks in advance!
Approach should be
with major_stations as(
select start_station_id station_id
from trips
group by start_station_id
having count(*) > 100
union
select end_station_id station_id
from trips
group by end_station_id
having count(*) > 100
)
select *
from trips
where start_station_id in (select station_id from major_stations)
and trip_duration > 30
There may be some easy way, but this is first approach I think of.
So I found what my problem was. Since I must filter out stations where 100 trips started AND ended, doing it the way I did before was wrong.
The current answer for me was this:
INSERT INTO dataset.new_table
WITH stations AS (
SELECT start_station_id, end_station_id FROM dataset.original_table
GROUP BY start_station_id, end_station_id
HAVING count(start_station_id)>=100
AND count(end_station_id)>=100
)
SELECT a.* FROM dataset.original_table AS a, stations as s
WHERE a.start_station_id = s.start_station_id
AND a.end_station_id = s.end_station_id
AND a.duration_sec >= 30
This way I am creating only one WITH clause which filters only start AND end stations, by the given criteria.
As easy as it looks, obviously my brain needs a rest sometimes and a start with a new perspective.

Why would the query show data from the wrong month?

I have a query:
;with date_cte as(
SELECT r.starburst_dept_name,r.monthly_past_date as PrevDate,x.monthly_past_date as CurrDate,r.starburst_dept_average - x.starburst_dept_average as Average
FROM
(
SELECT *,ROW_NUMBER() OVER(PARTITION BY starburst_dept_name ORDER BY monthly_past_date) AS rowid
FROM intranet.dbo.cse_reports_month
) r
JOIN
(
SELECT *,ROW_NUMBER() OVER(PARTITION BY starburst_dept_name ORDER BY monthly_past_date) AS rowid
FROM intranet.dbo.cse_reports_month
Where month(monthly_past_date) > month(DATEADD(m,-2,monthly_past_date))
) x
ON r.starburst_dept_name = x.starburst_dept_name AND r.rowid = x.rowid+1
Where r.starburst_dept_name is NOT NULL
)
Select *
From date_cte
Order by Average DESC
So doing some testing, I have alter some columns data, to see why it gives me certain information. I don't know why when I run the query it gives my a date column that should not be there from "january" (row 4) like the picture below:
The database has more data that has the same exact date '2014-01-25 00:00:00.000', so I'm not sure why it would only get that row and compare the average?
I did before I run the query alter the column in that row and change the date? But I'm not sure if that would have something to do with it.
UPDATE:
I have added the sqlfinddle,
What I would like to get it subtract the average
from last_month - last 2 month ago.
It Was actually working until I made a change and alter the data.
I made the changes to test a certain situation, which obviously lead
to learning that there are flaws to the query.
Based on your SQL Fiddle, this eliminates joins from prior than month-2 from showing up.
SELECT
thismonth.starburst_dept_name
,lastmonth.monthtly_past_date [PrevDate]
,thismonth.monthtly_past_date [CurrDate]
,thismonth.starburst_dept_average - lastmonth.starburst_dept_average as Average
FROM dbo.cse_reports thismonth
inner join dbo.cse_reports lastmonth on
thismonth.starburst_dept_name = lastmonth.starburst_dept_name
AND month(DATEADD(MONTH,-1,thismonth.monthtly_past_date))=month(lastmonth.monthtly_past_date)
WHERE MONTH(thismonth.monthtly_past_date)=month(DATEADD(MONTH,-1,GETDATE()))
Order by thismonth.starburst_dept_average - lastmonth.starburst_dept_average DESC

Datediff between two tables

I have those two tables
1-Add to queue table
TransID , ADD date
10 , 10/10/2012
11 , 14/10/2012
11 , 18/11/2012
11 , 25/12/2012
12 , 1/1/2013
2-Removed from queue table
TransID , Removed Date
10 , 15/1/2013
11 , 12/12/2012
11 , 13/1/2013
11 , 20/1/2013
The TansID is the key between the two tables , and I can't modify those tables, what I want is to query the amount of time each transaction spent in the queue
It's easy when there is one item in each table , but when the item get queued more than once how do I calculate that?
Assuming the order TransIDs are entered into the Add table is the same order they are removed, you can use the following:
WITH OrderedAdds AS
( SELECT TransID,
AddDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY AddDate)
FROM AddTable
), OrderedRemoves AS
( SELECT TransID,
RemovedDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY RemovedDate)
FROM RemoveTable
)
SELECT OrderedAdds.TransID,
OrderedAdds.AddDate,
OrderedRemoves.RemovedDate,
[DaysInQueue] = DATEDIFF(DAY, OrderedAdds.AddDate, ISNULL(OrderedRemoves.RemovedDate, CURRENT_TIMESTAMP))
FROM OrderedAdds
LEFT JOIN OrderedRemoves
ON OrderedAdds.TransID = OrderedRemoves.TransID
AND OrderedAdds.RowNumber = OrderedRemoves.RowNumber;
The key part is that each record gets a rownumber based on the transaction id and the date it was entered, you can then join on both rownumber and transID to stop any cross joining.
Example on SQL Fiddle
DISCLAIMER: There is probably problem with this, but i hope to send you in one possible direction. Make sure to expect problems.
You can try in the following direction (which might work in some way depending on your system, version, etc) :
SELECT transId, (sum(add_date_sum) - sum(remove_date_sum)) / (1000*60*60*24)
FROM
(
SELECT transId, (SUM(UNIX_TIMESTAMP(add_date)) as add_date_sum, 0 as remove_date_sum
FROM add_to_queue
GROUP BY transId
UNION ALL
SELECT transId, 0 as add_date_sum, (SUM(UNIX_TIMESTAMP(remove_date)) as remove_date_sum
FROM remove_from_queue
GROUP BY transId
)
GROUP BY transId;
A bit of explanation: as far as I know, you cannot sum dates, but you can convert them to some sort of timestamps. Check if UNIX_TIMESTAMPS works for you, or figure out something else. Then you can sum in each table, create union by conveniently leaving the other one as zeto and then subtracting the union query.
As for that devision in the end of first SELECT, UNIT_TIMESTAMP throws out miliseconds, you devide to get days - or whatever it is that you want.
This all said - I would probably solve this using a stored procedure or some client script. SQL is not a weapon for every battle. Making two separate queries can be much simpler.
Answer 2: after your comments. (As a side note, some of your dates 15/1/2013,13/1/2013 do not represent proper date formats )
select transId, sum(numberOfDays) totalQueueTime
from (
select a.transId,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate
) X
group by transId
Answer 1: before your comments
Assuming that there won't be a new record added unless it is being removed. Also note following query will bring numberOfDays as zero for unremoved records;
select a.transId, a.addDate, r.removeDate,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)
This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000
A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.
This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20
I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.
Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.