T-SQL Fewest Sets of Common Dates that Includes All Row IDs - sql

My table (#MyTable) is a list of IDs with start dates and end dates (inclusive) that represent an interval of days when the ID appears in a file that is received once per day:
ID Start_Date End_Date
1 10/01/2014 12/15/2014
2 11/05/2014 03/03/2015
3 12/07/2014 12/09/2014
4 04/01/2015 04/15/2015
Each ID appears only once, i.e. only has 1 associated time interval, and intervals between Start_Dates and End_dates can (but not necessarily) overlap across different IDs. I need a SQL query to find the sets of dates where each ID will appear at least once when the files from these sets of dates are merged, in the smallest number of dates as possible. In the table above the solution could be these 2 dates:
File_Date ID(s)
12/07/2015 1,2,3
04/01/2015 4
But for the example any 1 date between ID(3)'s Start_date and End_date & combined with 1 date between ID(4)'s Start_date and End_date would be a solution.
The actual data consists of 10,000 different IDs. The date range of possible file dates is 04/01/2014 - 07/01/2015. Each daily file is very large in size and must be downloaded manually, hence I want to minimize the number I must download to include all IDs.
So far I have a CTE that results in separate rows for all dates between the Start_Date and End_date of each ID:
;WITH cte (ID, d)
AS
(
SELECT
tbl.ID AS ID,
tbl.Start_Date AS d
FROM #MyTable tbl
UNION ALL
SELECT
tbl.ID AS ID,
DATEADD(DAY, 1, cte.d) AS d
FROM cte
INNER JOIN
#MyTable tbl ON cte.ID = tbl.ID
WHERE cte.d < tbl.End_Date
)
SELECT
ID AS ID,
d AS File_Date
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)
Using #MyTable example results are:
ID File_Date
1 10/01/2014
1 10/02/2014
1 10/03/2014
1 etc...
My thinking was to determine the most common File_Date among all the IDs, then pick the next most common File_Date among all the IDs left, and so on...but I'm stuck. To put it in more mathy terms, I am trying to find the fewest sets (File_Dates) that contain all the items (IDs), similar to https://softwareengineering.stackexchange.com/questions/263095/finding-the-fewest-sets-which-contain-all-items, but I don't care about minimizing duplicates. The final results do not have to include which IDs appear in which File_Dates; I just need to know all the File_Dates.
I'm using MS SQL Server 2008.

Just go on with what you started. The result found by this method is not optimal, but could be good enough for your purposes.
For each ID generate a set of rows for each day in the range. You already know how to do it, though I'd use a table of numbers for it, rather than generating it on the fly with CTE every time, but it doesn't really matter.
Put result into a temporary table. It will have 10,000 IDs * ~400 days = ~4M rows.
The temp table has two columns (ID, FileDate).
Create appropriate indexes. I'd start with two: on (ID, FileDate) and on (FileDate, ID). Make one of them clustered and primary key. I'd try to make (FileDate, ID) as clustered primary key.
Then process in a loop:
Find a date that has most number of IDs:
SELECT TOP(1) #VarDate = FileDate
FROM #temp
GROUP BY FileDate
ORDER BY COUNT(*) DESC;
Remember found date (and optionally its IDs) in another temp table for the final result.
Delete date and IDs that correspond to this date from the big table.
DELETE FROM #temp
WHERE FileDate = #VarDate
OR ID IN
(
SELECT t2.ID
FROM #temp AS t2
WHERE t2.FileDate = #VarDate
)
Repeat the loop until there are no rows in #temp.

Using Vladimir B.'s suggested approach and the answer from In SQL Server, how to create while loop in select as a model:
;WITH cte (ID, d)
AS
(
SELECT
tbl.ID AS ID,
tbl.Start_Date AS d
FROM #MyTable tbl
UNION ALL
SELECT
tbl.ID AS ID,
DATEADD(DAY, 1, cte.d) AS d
FROM cte
INNER JOIN
#MyTable tbl ON cte.ID = tbl.ID
WHERE cte.d < tbl.End_Date
)
SELECT
ID AS ID,
d AS File_Date
into #temp2
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)
Create Table #FileDates
(
File_Date date
)
GO
DECLARE #VarDate date
WHILE EXISTS (select * from #temp2)
BEGIN
SELECT TOP(1)
#VarDate = File_Date
FROM #temp2
GROUP BY File_Date
ORDER BY COUNT(*) DESC;
INSERT INTO #FileDates (File_Date)
Values (#VarDate)
DELETE from #temp2
WHERE File_Date=#VarDate
OR ID in
(
select t2.ID
from #temp2 as t2
where t2.File_Date = #VarDate
)
END
SELECT *
FROM #FileDates
ORDER BY File_Date
Took 30 seconds to return 40 file dates for approx. 4,000 IDs. Thank you very much Mr. Baranov!

Related

"Get the first row of each day" SQL query

I am the end-user of a SQL Server DB with multiple lines ordered by date.
Lets take this DB as an example:
Amount
Date
23.5
20210512010220111
24
20210512020220111
30
20210512030220111
1.2
20210513011020111
1000
20210513020220111
24
20210514100220111
240
20210514100220111
Be advised that the date is just a long that represent the date in the format: yyyymmddhhMMssfff.
I am trying to create a SQL query like this:
"Get the first row of each day"
So for the above example the result will be:
Amount
Date
23.5
20210512010220111
1.2
20210513011020111
24
20210514100220111
I saw this example in multiple sources:
https://learnsql.com/cookbook/how-to-select-the-first-row-in-each-group-by-group/
The problem is when I tried it that was way too slow for me
the DB is storing hundreds of millions of rows (with 9 columns each)
A couple of weeks ago I used a similar(ish) query for a daily min, avg, max:
SELECT MIN(Amount), AVG(Amount), MAX(Amount)
FROM table
GROUP BY Date/1000000000
the "/1000000000" is for days.
That worked quickly enough, if there is something similar to FIRST(Amount) that would be great.
Just to clarify, I am just an end-user, I have no saying over the overall structure of the DB.
Edit:
This is the query I tried and was too slow:
WITH added_row_number AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Date/1000000000 ORDER BY Date ASC) AS row_number
FROM table
)
SELECT
*
FROM added_row_number
WHERE row_number = 1;
EDIT 2:
I took inspiration from all the answers here and after some trial and error I found this query worked fast enough (adjusted query to suit the example DB, not the actual query I used):
SELECT OrgTable.*
FROM (
SELECT *
FROM Table
-- WHERE statement on a uniq key column
) As OrgTable
INNER JOIN (
SELECT MIN(Date) As ModTimeMin
FROM Table
-- WHERE statement on a uniq key column
GROUP BY DATE/1000000000
--this sub query gets the time of the first transaction of each day
) As MinTable
ON OrgTable.Date = MinTable.Date --this joins the relevant data to the times table
ORDER BY OrgTable.Date ASC
Thank you.
try something to affect
SELECT t.yyyy_mm_dd_date, table.*
FROM table
JOIN (
SELECT SUBSTRING (Date, 1, 8) as yyyy_mm_dd_date, MIN(Date) as min_date
FROM table
) t
ON t.min_date = table.Date
in general i find SQL queries to run fast with joins and aggregations (especially over their indices), so if can translate query to use those believe generally should run fairly fast
The following may perform better than your original query as they use an existance check (or top clause) instead of reading all data, calculating row_number for all rows, then scanning the result. It will perform best if there is already an index on the [Date] column.
As you have duplicate [Date] values, the query could return different results each time it is executed unless a unique key column is included in the query.
create table #t(
[Id] int
,[Amount] decimal(10,2)
,[Date] bigint
);
create index idx_Date on #t([Date]);
insert #t values
(1, 23.5, 20210512010220111)
,(2, 24, 20210512020220111)
,(3, 30, 20210512030220111)
,(4, 1.2, 20210513011020111)
,(5, 1000, 20210513020220111)
,(6, 24, 20210514100220111)
,(7, 240, 20210514100220111);
-- Assuming that you have a unique key available
select
*
,t1.[Date]/1000000000
from #t t1
where not exists (
select *
from #t t2
where t1.[Date]/1000000000 = t2.[Date]/1000000000
and (
t2.[Date] < t1.[Date]
or (
t2.[Date] = t1.[Date]
and t2.Id < t1.Id
)
)
);
--This is a kludge if you don't have a unique key available and may perform worse than your original query. Don't use this without testing it in a non production system first.
select
*
,t1.[Date]/1000000000
from #t t1
where not exists (
select *
from #t t2
where t1.[Date]/1000000000 = t2.[Date]/1000000000
and (
t2.[Date] < t1.[Date]
or (
t2.[Date] = t1.[Date]
and t2.%%physloc%% < t1.%%physloc%% -- %%physloc%% is the File/Page/Slot for the row
)
)
);
-- Alternatively using top. Assumes a unique column is available
select
t1.*
,t1.[Date]/1000000000
from #t t1
cross apply (
select top 1 *
from #t t2
where t1.[Date]/1000000000 = t2.[Date]/1000000000
order by Date, Id
) t2
where t1.Id = t2.Id
drop table #t;
Feels like you can use a cte here and cast the date strings as actual DATE values (the last query) here is my sample using test data (you did not post the column types so I guessed)
Not 100% clear about the "date" column if it is an actual datetime, you can just cast as a date
DECLARE #mytable TABLE (
Amount NUMERIC(10,2) NOT NULL,
[Date] VARCHAR(30) NOT NULL
);
INSERT INTO #mytable(Amount,[Date])
VALUES
(3.5, '20210512010220111'),
(24.0, '20210512020220111'),
(30.0,'20210512030220111'),
(1.2, '20210513011020111'),
(1000.0, '20210513020220111'),
(24.0, '20210514100220111'),
(240.0, '20210514100220111')
;
SELECT
[Amount],
MAX(CAST( LEFT([Date], 8) AS DATE)) AS NewDate
FROM #mytable
GROUP BY AMOUNT
ORDER BY MAX(CAST( LEFT([Date], 8) AS DATE)) DESC;
/* this is what we want perhaps: */
;
WITH cte AS (
SELECT
Amount,
CAST(LEFT([Date], 8) AS DATE) AS MyDate,
ROW_NUMBER() OVER(PARTITION BY CAST(LEFT([Date], 8) AS DATE) ORDER BY CAST(LEFT([Date], 8) AS DATE) DESC) AS row_number
FROM #mytable
)
SELECT
*
FROM cte
WHERE row_number = 1;

SQL query for filtering duplicate rows of a column by the minimum DateTime of those corresponding rows

I have a SQL database table, "Helium_Test_Data", that has multiple entries based on the KeyID column (the KeyID represents a single tested part ). I need to query the entries and only show one entry per KeyID (part) based on the earliest creation date-time (format example is 2018-12-29 08:22:11.123). This is because the same part was tested several times but the first reading is the one I need to use. Here is the query currently tried:
SELECT mt.*
FROM Helium_Test_Data mt
INNER JOIN
(
SELECT
KeyID,
MIN(DateTime) AS DateTime
FROM Helium_Test_Data
WHERE PSNo='11166565'
GROUP BY KeyID
) t ON mt.KeyID = t.KeyID AND mt.DateTime = t.DateTime
WHERE PSNo='11167197'
AND (mt.DateTime > '2018-12-29 07:00')
AND (mt.DateTime < '2018-12-29 18:00') AND OK=1
ORDER BY KeyId,DateTime
It returns only the rows that have no duplicate KeyID present in the table whereas I need one row per every single KeyID (duplicate or not). And for the duplicate ones, I need the earliest date.
Thanks in advance for the help.
use row_number() window function which support most dbms
select * from
(
select *,row_number() over(partition by KeyID order by DateTime) rn
from Helium_Test_Data
) t where t.rn=1
or you could use corelated subquery
select t1.* from Helium_Test_Data t1
where t1.DateTime= (select min(DateTime)
from Helium_Test_Data t2
where t2.KeyID=t1.KeyID
)

Insert data into temporary Table

I'm using SQL Server and I'm trying to build a temporary Table where it have a column retrieved by an currently exciting table and insert in each row a date.
For example I have
select NumberID into #Table0
from Table1
where numcue in [conditions]
In the new Table0 I have the NumberID ordered by a certain criteria. But in that exact same order I want to introduce a date for each row.
Is there any way to do it without using a CREATE TABLE, or INSERT ( I don't have permissions for that)
Thanks in advance
-------EDITION--- (MORE INFO)
Maybe I wasn't clear about it, long story short is that I have IDNUMBER in the TABLE1 on my Datawarehouse (10k+rows), but it have 20 dates for each IDNUMBER.
In an Excel I have the date I need to retrieve for each IDNUMBER, but I don't know how to retrieve that exact info directly with a QUERY. And the dates doesn't have a criteria is just random date for each IDNUMBER so I can't code it directly.
So what I was trying to do, is put each IDNUMBER with the date of the Excel in a temporary Table and then keep looking info with that
Hope this help to explain a little further
Thanks in advance and for all the current answers
So you mean something like this?
select NumberID, GETDATE() AS DateColumn into #Table0
from Table1
where numcue in [conditions]
I think you can do it by using a CTE (Common Table Expression) to find the row number for your ID's.
I'm not sure if this is the case, but I've understood you want to increment the date for each row, e.g.:
NumberID|Date
1 |2018-01-01
3 |2018-01-02
12 |2018-01-03
25 |2018-01-04
In that case, I've supplied some code that uses the sys_objects table as an example:
DECLARE #FirstDate DATE = '20180101'
;WITH CTE
AS
(
SELECT TOP (100) PERCENT object_id, ROW_NUMBER() OVER (ORDER BY object_id ASC) AS RowNumber
FROM master.sys.objects
ORDER BY object_id
)
SELECT object_id, DATEADD(dd, RowNumber-1, #FirstDate) AS Date, RowNumber
FROM CTE;
You can ignore the RowNumber column - I've just added for you to understand that it is a sequence.
For you case in particular, I think this code should work - remember to specify your initial date:
DECLARE #FirstDate DATE = '20180101'
;WITH CTE
AS
(
SELECT TOP (100) PERCENT NumberID, ROW_NUMBER() OVER (ORDER BY NumberID ASC) AS RowNumber
FROM Table1
WHERE numcue IN [conditions]
)
SELECT NumberID, DATEADD(dd, RowNumber-1, #FirstDate) AS Date
FROM CTE;

Get a single value where the latest date

I need to get the latest price of an item (as part of a larger select statement) and I can't quite figure it out.
Table:
ITEMID DATE SALEPRICE
1 1/1/2014 10
1 2/2/2014 20
2 3/3/2014 15
2 4/4/2014 13
I need the output of the select to be '20' when looking for item 1 and '13' when looking for item 2 as per the above example.
I am using Oracle SQL
The most readable/understandable SQL (in my opinion) would be this:
select salesprice from `table` t
where t.date =
(
select max(date) from `table` t2 where t2.itemid = t.itemid
)
and t.itemid = 1 -- change item id here;
assuming your table's name is table and you only have one price per day and item (else the where condition would match more than one row per item). Alternatively, the subselect could be written as a self-join (should not make a difference in performance).
I'm not sure about the OVER/PARTITION used by the other answers. Maybe they could be optimized to better performance depending on the DBMS.
Maybe something like this:
Test data
DECLARE #tbl TABLE(ITEMID int,DATE DATETIME,SALEPRICE INT)
INSERT INTO #tbl
VALUES
(1,'1/1/2014',10),
(1,'2/2/2014',20),
(2,'3/3/2014',15),
(2,'4/4/2014',13)
Query
;WITH CTE
AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY ITEMID ORDER BY [DATE] DESC) AS rowNbr,
tbl.*
FROM
#tbl AS tbl
)
SELECT
*
FROM
CTE
WHERE CTE.rowNbr=1
Try this!
In sql-server may also work in Oracle sql
select * from
(
select *,rn=row_number()over(partition by ITEMID order by DATE desc) from table
)x
where x.rn=1
You need Row_number() to allocate a number to all records which is partition by ITEMID so each group will get a RN,then as you are ordering by date desc to get Latest record
SEE DEMO

SQL query - how to aggregate only contiguous periods without lag or lead?

I'am still a novice and excuse for my english. You see that I have two persons with different periods of time and I want to aggregate the periods if they are contiguous. I don't know how to use for example the min() and max() functions related to the next line or the line before to compare the date. Or is there in easier way to solve this? I only have SQL Server 2008 R2 without the lag and lead-functions.
Sample data:
DECLARE #Table TABLE(
PersonID INT,
FROM date,
TO date
)
INSERT INTO #Table SELECT 1,'2011-01-01','2011-04-30'
INSERT INTO #Table SELECT 1,'2011-05-01','2011-08-31'
INSERT INTO #Table SELECT 1,'2011-09-01','2011-12-31'
INSERT INTO #Table SELECT 1,'2012-01-01','2012-03-31'
INSERT INTO #Table SELECT 2,'2011-03-01','2011-06-30'
INSERT INTO #Table SELECT 2,'2011-07-01','2011-10-31'
INSERT INTO #Table SELECT 2,'2013-01-01','2013-04-30'
INSERT INTO #Table SELECT 2,'2013-05-01','2013-08-31'
I expect something like this and look especially on PersonID 2:
PersonID FROM TO
1 , 2011-01-01 , 2012-03-31
2 , 2011-03-01 , 2011-10-31
2 , 2013-01-01 , 2013-08-31
This is a hard problem that would be made easier with cumulative sums and lag() or lead(). You can still do the work. I prefer to express it using correlated subqueries.
The logic starts by identifying which records are connected to the "next" record by an overlap. The following query uses this logic to define OverlapWithPrev.
select *
from (select t.*,
(select top 1 1
from t t2
where t2.personid = t.personid and
t2.fromd < t.fromd and
t2.tod >= dateadd(d, -1, t.fromd)
order by t2.fromd
) as OverlapWithPrev
from t
) t
This takes on the value of 1 when there is a previous record and NULL when there is not one.
Then with this information, the query then finds for each record the next record that is not overlapped with the previous one (and on the same person). When you have a sequence of overlapping records, then all will have the same such next record, and the next record is used for aggregation.
Here is the full query:
with tp as
(select *
from (select t.*,
(select top 1 1
from t t2
where t2.personid = t.personid and
t2.fromd < t.fromd and
t2.tod >= dateadd(d, -1, t.fromd)
order by t2.fromd
) as OverlapWithPrev
from t
) t
)
select personid, min(fromd) as fromd, max(tod) as tod
from (select tp.*,
(select top 1 fromd
from tp tp2
where tp2.OverlapWithPrev is null and
tp2.personid = tp.personid and
tp2.fromd > tp.fromd
) as NextFromD
from tp
) tp
group by personid, NextFromD;
Here is a SQLFiddle to show how it works.