SQL - how to efficiently select distinct records - sql

I've got a very performance sensitive SQL Server DB. I need to make an efficient select on the following problem:
I've got a simple table with 4 fields:
ID [int, PK]
UserID [int, FK]
Active [bit]
GroupID [int, FK]
Each UserID can appear several times with a GroupID (and in several groupIDs) with Active='false' but only once with Active='true'.
Such as:
(id,userid,active,groupid)
1,2,false,10
2,2,false,10
3,2,false,10
4,2,true,10
I need to select all the distinct users from the table in a certain group, where it should hold the last active state of the user. If the user has an active state - it shouldn't return an inactive state of the user, if it has been such at some point in time.
The naive solution would be a double select - one to select all the active users and then one to select all the inactive users which don't appear in the first select statement (because each user could have had an inactive state at some point in time). But this would run the first select (with the active users) twice - which is very unwanted.
Is there any smart way to make only one select to get the needed query? Ideas?
Many thanks in advance!

What about a view such as this :
createview ACTIVE as select * from USERS where Active = TRUE
Then just one select from that view will be sufficient :
select user from ACTIVE where ID ....

Try this:
Select
ug.GroupId,
ug.UserId,
max(ug.Active) LastState
from
UserGroup ug
group by
ug.GroupId,
ug.UserId
If the active field is set to 1 for a user / group combination you will get the 1, if not you will get a 0 for the last state.

I'm not a big fan of the use of an "isActive" column the way you're doing it. This requires two UPDATEs to change an active status and has the effect of storing the information about the active status several times in the different records.
Instead, I would remove the active field and do one of the following two things:
If you already have a table somewhere in which (userid, groupid) is (or could be) a PRIMARY KEY or UNIQUE INDEX then add the active column to that table. When a user becomes active or inactive with respect to a particular group, update only that single record with true or false.
If such a table does not already exist then create one with '(userid, groupid)as thePRIMARY KEYand the fieldactive` and then treat the table as above.
In either case, you only need to query this table (without aggregation) to determine the users' status with respect to the particular group. Equally importantly, you only store the true or false value one time and only need to UPDATE a single value to change the status. Finally, this tables acts as the place in which you can store other information specific to that user's membership in that group that applies only once per membership, not once per change-in-status.

Try this:
SELECT t.* FROM tbl t
INNER JOIN (
SELECT MAX(id) id
FROM tbl
GROUP BY userid
) m
ON t.id = m.id

Not sure that I understand what you want your query to return but anyway. This query will give you the users in a group that is active in the last entry. It uses row_number() so you need at least SQL Server 2005.
Table definition:
create table YourTable
(
ID int identity primary key,
UserID int,
Active bit,
GroupID int
)
Index to support the query:
create index IX_YourTable_GroupID on YourTable(GroupID) include(UserID, Active)
Sample data:
insert into YourTable values
(1, 0, 10),
(1, 0, 10),
(1, 0, 10),
(1, 1, 10),
(2, 0, 10),
(2, 1, 10),
(2, 0, 10),
(3, 1, 10)
Query:
declare #GroupID int = 10
;with C as
(
select UserID,
Active,
row_number() over(partition by UserID order by ID desc) as rn
from YourTable as T
where T.GroupID = #GroupID
)
select UserID
from C
where rn = 1 and
Active = 1
Result:
UserID
-----------
1
3

Related

Remove clients who don't have 2 rows by their name in SQL

What I'm trying to do is to filter by the clients that registered twice in the DB. This as I need to know who of them came at least twice, that is why I´m working with a table that registers every time they registered in the system as it follows:
order #
client
date
One
Andrew
XX
Two
Andrew
XX+1
Three
Andrew
XX+2
One
David
YY
One
Marc
ZZ
Two
Marc
ZZ+1
In this case I want to delete David´s record, as I only want people who has order numbers distinct than "one".
I tried this SQL:
select *
from table
where order_number > 1
however what this does is remove all the rows of the first orders, including the ones that came back.
Does somebody know an easy way for me to compare row names and filter by that or just how could I delete those rows in which there are clients with only one entry?
you need something like this :
select * from yourtable
where not exists (select 1 from yourtable where order_number >1)
or:
select client
from tablename
group by client
having count(*) > 1
CREATE TABLE records (
ID INTEGER PRIMARY KEY,
order_number TEXT NOT NULL,
client TEXT NOT NULL,
date DateTime NOT NULL
);
INSERT INTO records VALUES (1,'ONE', 'Adrew', '01.01.1999');
INSERT INTO records VALUES (2, 'TWO','Adrew', '02.02.1999');
INSERT INTO records VALUES (3, 'THREE','Adrew', '03.03.1999');
INSERT INTO records VALUES (4, 'ONE', 'David', '01.01.1999');
INSERT INTO records VALUES (5, 'ONE','Marc', '01.01.1999');
INSERT INTO records VALUES (6, 'TWO','Marc', '01.03.1999');
DELETE FROM records WHERE ID in
(
SELECT COUNT(client) as numberofclient FROM records
Group By client Having Count (client) > 1
);

Merge multiple rows having some identity to a new one that have the sum of a column thats is distinct between thems

I have table like this :
id name qt
----------------
0 mm 4
1 mm 5
2 xx 8
I want update it or get new table that will produce this kind of result:
id name qt
------------------
0 mm 9 (sum of the two or multiple some identical )
1 xx 8
Including the id column will cause the GROUP BY to fail since multiple records are being summed that have different ids.
SELECT name, SUM(qt) as qt_sum
FROM table GROUP BY name
SELECT ROW_NUMBER() OVER (ORDER BY name) AS id
, name
, SUM(qt) AS qt
FROM YourTableName
GROUP BY name
ORDER BY name
I'm making the assumption that the id field doesn't actually mean anything because the id of the record xx actually changes between your two visuals. That's why I'm setting it by ROW_NUMBER() so it increments for distinct name. If this isn't the case, remove the ROW_NUMBER() expression and add id to the GROUP BY clause. This does mean that records in the name field may change depending on the number of distinct names.
If you really need and id column you could create one like this...
create table Test (id int, name varchar(10), qt int)
insert into Test values (0, 'mm', 4)
insert into Test values (1, 'mm', 5)
insert into Test values (2, 'xx', 8)
select
row_number() over (order by name) - 1
, name
, sum(qt) as qt
from Test
group by name
There may be some cases where this does not work for you, but with such limited sample data it is hard to tell.

avoiding group by for column used in datediff?

As the database is currently constructed, I can only use a Date Field of a certain table in a datediff-function that is also part of a count aggregation (not the date field, but that entity where that date field is not null. The group by in the end messes up the counting, since the one entry is counted on it's own / as it's own group.
In some detail:
Our lead recruiter want's a report that shows the sum of applications, and conducted interviews per opening. So far no problem. Additionally he likes to see the total duration per opening from making it public to signing a new employee per opening and of cause only if the opening could already be filled.
I have 4 tables to join:
table 1 holds the data of the opening
table 2 has the single applications
table 3 has the interview data of the applications
table 4 has the data regarding the publication of the openings (with the date when a certain opening was made public)
The problem is the duration requirement. table 4 holds the starting point and in table 2 one (or none) applicant per opening has a date field filled with the time he returned a signed contract and therefor the opening counts as filled. When I use that field in a datediff I'm forced to also put that column in the group by clause and that results in 2 row per opening. 1 row has all the numbers as wanted and in the second row there is always that one person who has a entry in that date field...
So far I haven't come far in thinking of a way of avoiding that problem except for explanining to the colleague that he get's his time-to-fill number in another report.
SELECT
table1.col1 as NameOfProject,
table1.col2 as Company,
table1.col3 as OpeningType,
table1.col4 as ReasonForOpening,
count (table2.col2) as NumberOfApplications,
sum (case when table2.colSTATUS = 'withdrawn' then 1 else 0 end) as mberOfApplicantsWhoWithdraw,
sum (case when table3.colTypeInterview = 'PhoneInterview' then 1 else 0 end) as NumberOfPhoneInterview,
...more sum columns...,
table1.finished, // shows „1“ if opening is occupied
DATEDIFF(day, table4.colValidFrom, **table2.colContractReceived**) as DaysToCompletion
FROM
table2 left join table3 on table2.REF_NR = table3.REF_NR
join table1 on table2.PROJEKT = table1.KBEZ
left join table4 on table1.REFNR = table4.PRJ_REFNR
GROUP BY
**table2.colContractReceived**
and all other columns except the ones in aggregate (sum and count) functions go in the GROUP BY section
ORDER BY table1.NameOfProject
Here is a short rebuild of what it looks like. First a row where the opening is not filled and all aggregations come out in one row as wanted. The next project/opening shows up double, because the field used in the datediff is grouped independently...
project company; no_of_applications; no_of_phoneinterview; no_of_personalinterview; ... ; time_to_fill_in_days; filled?
2018_312 comp a 27 4 2 null 0
2018_313 comp b 54 7 4 null 0
2018_313 comp b 1 1 1 42 1
I'd be glad to get any idea how to solve this. Thanks for considering my request!
(During the 'translation' of all the specific column and table names I might have build in a syntax error here and there but the query worked well ecxept for that unwanted extra aggregation per filled opening)
If I've understood your requirement properly, I believe the issue you are having is that you need to show the date between the starting point and the time at which an applicant responded to an opening, however this must only show a single row based on whether or not the position was filled (if the position was filled, then show that row, if not then show that row).
I've achieved this result by assuming that you count a position as filled using the "ContractsRecevied" column. This may be wrong however the principle should still provide what you are looking for.
I've essentially wrapped your query in to a subquery, performed a rank ordering by the contractsfilled column descending and partitioned by the project. Then in the outer query I filter for the first instance of this ranking.
Even if my assumption about the column structure and data types is wrong, this should provide you with a model to work with.
The only issue you might have with this ranking solution is if you want to aggregate over both rows within one (so include all of the summed columns for both the position filled and position not filled row per project). If this is the case let me know and we can work around that.
Please let me know if you have any questions.
declare #table1 table (
REFNR int,
NameOfProject nvarchar(20),
Company nvarchar(20),
OpeningType nvarchar(20),
ReasonForOpening nvarchar(20),
KBEZ int
);
declare #table2 table (
NumberOfApplications int,
Status nvarchar(15),
REF_NR int,
ReturnedApplicationDate datetime,
ContractsReceived bit,
PROJEKT int
);
declare #table3 table (
TypeInterview nvarchar(25),
REF_NR int
);
declare #table4 table (
PRJ_REFNR int,
StartingPoint datetime
);
insert into #table1 (REFNR, NameOfProject, Company, OpeningType, ReasonForOpening, KBEZ)
values (1, '2018_312', 'comp a' ,'Permanent', 'Business growth', 1),
(2, '2018_313', 'comp a', 'Permanent', 'Business growth', 2),
(3, '2018_313', 'comp a', 'Permanent', 'Business growth', 3);
insert into #table2 (NumberOfApplications, Status, REF_NR, ReturnedApplicationDate, ContractsReceived, PROJEKT)
values (27, 'Processed', 4, '2018-04-01 08:00', 0, 1),
(54, 'Withdrawn', 5, '2018-04-02 10:12', 0, 2),
(1, 'Processed', 6, '2018-04-15 15:00', 1, 3);
insert into #table3 (TypeInterview, REF_NR)
values ('Phone', 4),
('Phone', 5),
('Personal', 6);
insert into #table4 (PRJ_REFNR, StartingPoint)
values (1, '2018-02-25 08:00'),
(2, '2018-03-04 15:00'),
(3, '2018-03-04 15:00');
select * from
(
SELECT
RANK()OVER(Partition by NameOfProject, Company order by ContractsReceived desc) as rowno,
table1. NameOfProject,
table1.Company,
table1.OpeningType,
table1.ReasonForOpening,
case when ContractsReceived >0 then datediff(DAY, StartingPoint, ReturnedApplicationDate) else null end as TimeToFillInDays,
ContractsReceived Filled
FROM
#table2 table2 left join #table3 table3 on table2.REF_NR = table3.REF_NR
join #table1 table1 on table2.PROJEKT = table1.KBEZ
left join #table4 table4 on table1.REFNR = table4.PRJ_REFNR
group by NameOfProject, Company, OpeningType, ReasonForOpening, ContractsReceived,
StartingPoint, ReturnedApplicationDate
) x where rowno=1

SQL Limit number of references to another table without locking

Is there a technique to avoid locking a row but still be able to limit the number of rows in another table that reference it?
For example:
create table accounts (
id integer,
name varchar,
max_users integer
);
create table users (
id integer,
account_id integer,
email varchar
);
If I want to limit the number of users that are part of an account using the max_users value in accounts. Is there another way to ensure that concurrent calls won't create more users than permitted without locking the group row?
Something like this doesn't work, since this happening in two concurrent transactions can have select count(*)... be true even if the count is just at the limit:
begin;
insert into users(id, account_id, email)
select 1, 1, 'john#abc.com' where (select count(*) from users where account_id = 1) < (select max_users from accounts where id = 1);
commit;
And the following works, but I'm having performance issues that are mostly based transactions waiting for locks:
begin;
select id from accounts where id = 1 for update;
insert into users(id, account_id, email)
select 1, 1, 'john#abc.com' where (select count(*) from users where account_id = 1) < (select max_users from accounts where id = 1);
commit;
EDIT: Bonus question: what if the value is not stored in the database, but is something you can set dynamically?

sql join using recursive cte

Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6