SQL group by selecting top rows with possible nulls

SQL group by selecting top rows with possible nulls - sql

The example table:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
2
b
2022-01-01 13:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:
create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2
The expected output for the example:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).

You can combine two queries with UNION ALL. E.g.:
select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
select null
from mytable older
where older.group_id = mytable.group_id
and older.create_time < mytable.create_time
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;
This is standard SQL and very basic at that. It should work in about every RDBMS.
As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.
Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.
If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.
And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).

select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
Not sure how you imagine "pagination" should work. Here's one way:
and (
select count(distinct coalesce(t2.group_id, t2.id)) from T t2
where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)
I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.
You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.
This all gets more difficult when you want to order by other columns like name:
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
select count(*) from (
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
) t3
where t3.name < t1.name or t3.name = t1.name
and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;
That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691

I would guess
SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name
I should point out that 'name' is a reserved word.

Related

Query Optimization, Issue

Using SQL Server 2012;
I am using a query to find deltas in a table.
I have an archive table that has all the records with Licenceno PK,FileID
I want to find out how many Licenceno are in a fileId but are not in previous FileID.
Code Used:
Select count(*) from table where fileid = 123 and Licenceno not in (select Licenceno from table where fileid <123)
The code works fine but the problem is some of the fileIds have the same number of records as the previous ones but take 4 hours and are still running..
Is it a table issue?
Index cant be an issue as the whole table has
a non clustered index.
It is happening generally when i am calculating deltas for the latest Licenceno.
or Query planning is the issue?
I am not able to solve this for the past 5 days.

I would rewrite your query to use an exists clause, and also add an appropriate index:
SELECT COUNT(*)(
FROM yourTable t1
WHERE
fileid = 123 AND
NOT EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.Licenseno = t1.Licenseno AND t2.fileid < 123);
An index on (Licenseno, fileid) might help here:
CREATE INDEX idx ON yourTable (Licenseno, fileid);
You may also try the came composite index in the reverse order:
CREATE INDEX idx ON yourTable (fileid, Licenseno);

Why not use count(distinct)?
select count(distinct licenseno)
from table
where fileid = 123;
For this query, you want an index on (fileid, licenseno).
You are complicating the logic by thinking sequentially ("have I seen this licenseno already?"). Instead, you just want to count the distinct values.
EDIT:
For this problem, you can try two levels of aggregation:
select count(*)
from (select licenseno, min(fileid) as min_fileid
from t
where licenseno <= 123
group by licenseno
) t
where min_fileid = 123;
How good the performance is relative to other approaches dependson how selective <= 123 is.

You could also use LAG for this
SELECT COUNT(*)
FROM (SELECT fileid,
LAG(fileid) OVER (PARTITION BY Licenceno ORDER BY fileid) AS prevFileID
FROM TABLE
WHERE fileid <= 123 ) D
WHERE fileid = 123
AND prevFileID IS NULL
... or an aggregation query ...
WITH T
AS (SELECT 1 AS Flag,
FROM TABLE
WHERE fileid <= 123
GROUP BY Licenceno
HAVING MIN(fileid) = 123 )
SELECT COUNT(*)
FROM T

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here

You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).

Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).

A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Order by data as per supplied Id in sql

Query:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
Right now the data is ordered by the auto id or whatever clause I'm passing in order by.
But I want the data to come in sequential format as per id's I have passed
Expected Output:
All Data for 1247881
All Data for 174772
All Data for 808454
All Data for 2326154
Note:
Number of Id's to be passed will 300 000

One option would be to create a CTE containing the ration_card_id values and the orders which you are imposing, and the join to this table:
WITH cte AS (
SELECT 1247881 AS ration_card_id, 1 AS position
UNION ALL
SELECT 174772, 2
UNION ALL
SELECT 808454, 3
UNION ALL
SELECT 2326154, 4
)
SELECT t1.*
FROM [MemberBackup].[dbo].[OriginalBackup] t1
INNER JOIN cte t2
ON t1.ration_card_id = t2.ration_card_id
ORDER BY t2.position DESC
Edit:
If you have many IDs, then neither the answer above nor the answer given using a CASE expression will suffice. In this case, your best bet would be to load the list of IDs into a table, containing an auto increment ID column. Then, each number would be labelled with a position as its record is being loaded into your database. After this, you can join as I have done above.

If the desired order does not reflect a sequential ordering of some preexisting data, you will have to specify the ordering yourself. One way to do this is with a case statement:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
ORDER BY CASE ration_card_id
WHEN 1247881 THEN 0
WHEN 174772 THEN 1
WHEN 808454 THEN 2
WHEN 2326154 THEN 3
END
Stating the obvious but note that this ordering most likely is not represented by any indexes, and will therefore not be indexed.

Insert your ration_card_id's in #temp table with one identity column.
Re-write your sql query as:
SELECT a.*
FROM [MemberBackup].[dbo].[OriginalBackup] a
JOIN #temps b
on a.ration_card_id = b.ration_card_id
order by b.id

SQL Server : Update Flag on Max Date on Foreign key

I'm trying to do this update but for some reason I cannot quite master SQL sub queries.
My table structure is as follows:
id fk date activeFlg
--- -- ------- ---------
1 1 04/10/11 0
2 1 02/05/99 0
3 2 09/10/11 0
4 3 11/28/11 0
5 3 12/25/98 0
Ideally I would like to set the activeFlg to 1 for all of the distinct foreign keys with the most recent date. For instance after running my query id 1,3 and 4 will have an active flag set to one.
The closest thing I came up with was a query returning all of the max dates for each distinct fk:
SELECT MAX(date)
FROM table
GROUP BY fk
But since I cant even come up with the subquery there is no way I can proceed :/
Can somebody please give me some insight on this. I'm trying to really learn more about sub queries so an explanation would be greatly appreciated.
Thank you!

You need to select the fk to and then restrict by that, so
SELECT fk,MAX(date)
FROM table
GROUP BY fk
To
With Ones2update AS
(
SELECT fk,MAX(date)
FROM table
GROUP BY fk
)
Update table
set Active=1
from table t
join Ones2update u ON t.fk = u.fk and t.date = u.date
also I would test first so do this query first
With Ones2update AS
(
SELECT fk,MAX(date)
FROM table
GROUP BY fk
)
selct fk, date, active
from table t
join Ones2update u ON t.fk = u.fk and t.date = u.date
to make sure you are getting what you expect and I did not make any typos.
Additional note: I use a join instead of a sub-query -- they are logically the same but I always find joins to be clearer (once I got used to using joins). Depending on the optimizer they can be faster.

This is the general idea. You can flesh out the details.
update t
set activeFlg = 1
from yourTable t
join (
select id, max([date] maxdate
from TheForeignKeyTable
group by [date]
) sq on t.fk = sq.id and t.[date] = maxdate

How do I SELECT TOP X where it INCLUDES records based on a criteria?

I have a table, with multiple columns, including a column named "PolicyNumber"
Here's a sample:
PolicyNumber
NYH1111
NYD2222
SCH3333
SCS4444
LUH5555
LUS6666
ALH7777
ALW8888
VAH9999
AKH0000
...
NYH1010
NYD2318
There are 1,000+ records in this table and records contain several of each policy number types. For example, multiple policies starting with "NYH" or multiple policies starting with "VAH."
The possible policy types are here:
NYH
NYD
SCH
SCS
LUH
LUS
ALH
ALW
VAH
AKH
How do I do a SELECT TOP 300 where it'll INCLUDE at least one of each Policy Type? Remember, a policy type is the first 3 letters of a policy number.
Is this even possible? The purpose of this is that I have to grab 300 records from production to dump into a test environment and I need to include at least 1 of each policy. After I have at least one of each, it can be completely randomized.

You can try this:
In this solution first there is the newid() with you can generate random order by each running.
To achive the "at least one from each policy" goal, I made the AtLeastOne column. This selects the first from the randomized CTE table for each unique three letters at the start. If the current Policy equals with this first selected value, then it gets 1 else 0. So with this logic, you can select a randomized first one from each unique three letters.
Note: You can put this logic directly into the Order By part too if you need the Policy field only. (I made the example on this way to make the logic behind it visible)
In the last step you just have to order by the AtLeastOne Desc and then by the random ID.
WITH CTE_Policy
AS
(
SELECT newid() as ID, Policy
FROM Code
)
SELECT TOP 300
Policy,
CASE WHEN Policy = (SELECT TOP 1 Policy FROM cte_Policy c
WHERE SUBSTRING(c.Policy,1,3) =
SUBSTRING(CTE_Policy.Policy,1,3))
THEN 1 ELSE 0 END AS AtLeastOne
FROM CTE_Policy
ORDER BY AtLeastOne DESC, ID
Here is an SQLFiddle demo.

Off the top of my head, you could do:
SELECT TOP 30 Column1, Column2, Column3, PolicyNumber
FROM YourTable
WHERE PolicyNumber LIKE 'NYH%'
UNION
SELECT TOP 30 Column1, Column2, Column3, PolicyNumber
FROM YourTable
WHERE PolicyNumber LIKE 'NYD%'
UNION
/* ... remaining eight policy types go here */
ORDER BY PolicyNumber /* Or whatever sort order you want */
It will give you 30 of each type every time, instead of X of one type, and Y of another, however.

one quick way that comes to my mind.. below query will grab just 1 record per policy type
SELECT TOP 300 *
FROM ( SELECT *,rank1= ROW_NUMBER () OVER (PARTITION BY LEFT (PolicyNo,3) ORDER BY GETDATE ()) FROM MyTable
) AS t1
WHERE t1.rank1 = 1

Try this for SQL Server 2005+:
;WITH CTE AS
(
SELECT LEFT(PolicyNumber) PolicyType, PolicyNumber,
ROW_NUMBER() OVER(PARTITION BY LEFT(PolicyNumber) ORDER BY NEWID()) RN
FROM YourTable
)
SELECT TOP 300 PolicyNumber
FROM CTE
ORDER BY RN, NEWID()

Borrowed from ClearLogic +1 Please give the check to ClearLogic if this works
Problem with WHERE t1.rank1 = 1 is that it will stop short of 300 if less then 300 unique
SELECT TOP 300 t1.PolicyNo
FROM ( SELECT PolicyNo, rank1= ROW_NUMBER ()
OVER (PARTITION BY LEFT (PolicyNo,3) ORDER BY NEWID())
FROM MyTable
) AS t1
order by t1.rank, t1.PolicyNo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas