Access/SQL Select Query - Return "Most Like" Value Only - sql
We have a chargeback process in an AccessDB where Departments must approve the expenses entered by another department. We only want a single 'default' approver, but the way the data has been set-up and the query we currently use to fill in the approver returns multiple results.
In the tUserSec table, for example, we have two columns. Name(UserIDX) and UserCode
User1 - 550*
User2 - 55003*
The idea here being that User1 is the Director and so is a 'catchall' for everything in this department, while User2 is a Manager and is specifically assigned to a narrower division. Departments are always 7 characters total.
Say the Department is 5500309, the idea is that User2 should populate as the approver since their code is most closely matched to the Department ID. However, using the "Like" criteria returns both users and the form appears to select one of the two users at random with no rhyme or reason that I can determine. It always selects User1 for 5500309 but always selects User2 for 5500301, despite there being no further delineation - but ideally User1 shouldn't be populating at all unless no one else matches closer.
Below is a simplified version of the SQL, I cut out some other stuff that muddies the situation:
SELECT TDepts.Dept, TDepts.DDescr, tUserSec.UserIDX
FROM tUserSec, TDepts
WHERE (((TDepts.Dept) Like [usercode] & "*"));
How can I change this up so that I only pull in the UserID who is most like the usercode? I tried to figure out a way to pull in the UserID based on the length or max of the usercode, etc. but I wasn't able to find a way that worked. It's a safe assumption that if two users have usercodes that are "like" the department that the usercode that is longest is the one we want.
(This is my first question on here and a struggled with how to best explain this issue. Please be gentle :) )
First, I have to say that the main problem here is when a developer thought that they would be clever and build a lot of logic into the department and user IDs. Hiding this sort of information within a column is a big source of headaches in general (as you're just starting to see).
I don't develop with Access, so I'm not certain of the syntax, but hopefully you'll get the general idea. Please let me know if the syntax needs to be tweaked for future users who find this question:
SELECT
D.Dept,
D.DDescr,
U.UserIDX
FROM
TDepts D
LEFT OUTER JOIN
(
SELECT
SQ_D.Dept,
MAX(LEN(SQ_U.usercode)) AS max_len_usercode
FROM
TDepts SQ_D
INNER JOIN tUserSec SQ_U ON SQ_D.Dept LIKE SQ_U.usercode & "*"
GROUP BY
SQ_D.Dept
) SQ ON SQ_D.Dept = D.Dept
LEFT OUTER JOIN tUserSec U ON
D.Dept LIKE U.usercode & "*" AND
LEN(U.usercode) = SQ.max_len_usercode
The query gets a list of all of the departments along with the length of the longest usercode that matches for that department. Then it uses that to determine which user matches for the "most like" the department.
Related
Selecting only such groups that contain certain value
First of all, even though this SQL: How do you select only groups that do not contain a certain value? thread is almost identical to my problem, it doesn't fully dissipate my confusion about the problem. Let's have a table "Contacts" like this one: +----------------------+ | Department FirstName | +----------------------+ | 100 Thomas | | 200 Peter | | 100 Jerry | +----------------------+ First, I want to group the rows by the department number and show number of rows in each displayed group. This, I believe, can be easily done by the following query. SELECT Department, Count(*) As "Rows_in_group" FROM Contacts GROUP BY Department This outputs 2 groups. First with dep.no. 100 containing 2 rows, second with 200 containing only one row. But then, I want to extend the query to exclude any group that doesn't contain certain value in certain column (e.g. Thomas in FirstName). Here are my questions: 1) Reading the above-mentioned thread I was able to come up with this, which seems to work correctly: SELECT Department, Count(*) As "Rows_in_group" FROM Contacts WHERE Department IN (SELECT Department FROM Contacts WHERE FirstName = "Thomas") GROUP BY Department Q: How does this work? I understand the "WHERE Department IN" part, but then I'd expect a value, but instead another nested query is included, which to me doesn't make much sense as I'm only beginner with SQL. 2) By accident I was able to come up with another query that also seems to work, but feels weird and I also don't understand its workings. SELECT Department, Count(*) As "Rows_in_group" FROM Contacts GROUP BY Department HAVING NOT SUM(FirstName = "Thomas") = 0 Q: How does this work? Why alteration: HAVING SUM(FirstName = "Thomas") > 0 doesn't work? 3) Q: Is there any simple and correct way to do this using the HAVING clause? I expected, that simple "HAVING FirstName='Thomas'" after the GROUP BY would do the trick as it seems to follow a common language, but it does not. Note that I want the whole groups to be chosen by the query so "WHERE FirstName='Thomas'" isn't s solution for my problem as it excludes all the rows that don't satisfy the condition before the grouping takes place (at least the way I understand it).
Q: How does this work? I understand the "WHERE Department IN" part, but then I'd expect a value, but instead another nested query is included, which to me doesn't make much sense as I'm only beginner with SQL. The nested query returns values which are used to match against Department 2) By accident I was able to come up with another query that also seems to work, but feels weird and I also don't understand its workings. HAVING NOT SUM(FirstName = "Thomas") = 0 "Feels weird" because, well, it is. This is not a place for the SUM function. EDIT: Why does this work? The expression FirstName = "Thomas" gets evaluated as true or false (known as a Boolean expression). True numerically is equal to 1 and False converts to 0 (zero). By including SUM you then calculated the totals so really zero (still) means false and "not zero" is true. Then to make it weird(er) you included NOT which negated the whole thing and it becomes NOT TRUE = 0 or FALSE = FALSE (which is of course... TRUE)!! EDIT: I think what could be more helpful to you is consideration of when to use WHERE and when to use HAVING (instead of the Boolean magic taking place). From this answer: WHERE clause introduces a condition on individual rows; HAVING clause introduces a condition on aggregations, i.e. results of selection where a single result, such as count, average, min, max, or sum, has been produced from multiple rows. WHERE was appropriate for your example because first you want to "only return rows WHERE Department IN (100)" and then you want to "group those rows by Department" and get a COUNT of how many rows had been selected.
Sorting with many to many relationship
I have a 3 tables person, person_speaks_language and language. person has 80 records language has 2 records I have the following records the first 10 persons speaks one language the first 70 persons (include the first group) speaks 2 languages the last 10 persons dont speaks any language Following with the example I want sort the persons by language, How I can do it correctly. I'm trying to use the the following SQL but seems quite strange SELECT "person".* FROM "person" LEFT JOIN "person_speaks_language" ON "person"."id" = "person_speaks_language"."person_id" LEFT JOIN "language" ON "person_speaks_language"."language_id" = "language"."id" ORDER BY "language"."name" ASC dataset 71,Catherine,Porter,male,NULL 72,Isabelle,Sharp,male,NULL 73,Scott,Chandler,male,NULL 74,Jean,Graham,male,NULL 75,Marc,Kennedy,male,NULL 76,Marion,Weaver,male,NULL 77,Melvin,Fitzgerald,male,NULL 78,Catherine,Guerrero,male,NULL 79,Linnie,Strickland,male,NULL 80,Ann,Henderson,male,NULL 11,Daniel,Boyd,female,English 12,Ora,Beck,female,English 13,Hulda,Lloyd,female,English 14,Jessie,McBride,female,English 15,Marguerite,Andrews,female,English 16,Maurice,Hamilton,female,English 17,Cecilia,Rhodes,female,English 18,Owen,Powers,female,English 19,Ivan,Butler,female,English 20,Rose,Bishop,female,English 21,Franklin,Mann,female,English 22,Martha,Hogan,female,English 23,Francis,Oliver,female,English 24,Catherine,Carlson,female,English 25,Rose,Sanchez,female,English 26,Danny,Bryant,female,English 27,Jim,Christensen,female,English 28,Eric,Banks,female,English 29,Tony,Dennis,female,English 30,Roy,Hoffman,female,English 31,Edgar,Hunter,female,English 32,Matilda,Gordon,female,English 33,Randall,Cruz,female,English 34,Allen,Brewer,female,English 35,Iva,Pittman,female,English 36,Garrett,Holland,female,English 37,Johnny,Russell,female,English 38,Nina,Richards,female,English 39,Mary,Ballard,female,English 40,Adrian,Sparks,female,English 41,Evelyn,Santos,female,English 42,Bess,Jackson,female,English 43,Nicholas,Love,female,English 44,Fred,Perkins,female,English 45,Cynthia,Dunn,female,English 46,Alan,Lamb,female,English 47,Ricardo,Sims,female,English 48,Rosie,Rogers,female,English 49,Susan,Sutton,female,English 50,Mary,Boone,female,English 51,Francis,Marshall,male,English 52,Carl,Olson,male,English 53,Mario,Becker,male,English 54,May,Hunt,male,English 55,Sophie,Neal,male,English 56,Frederick,Houston,male,English 57,Edwin,Allison,male,English 58,Florence,Wheeler,male,English 59,Julia,Rogers,male,English 60,Janie,Morgan,male,English 61,Louis,Hubbard,male,English 62,Lida,Wolfe,male,English 63,Alfred,Summers,male,English 64,Lina,Shaw,male,English 65,Landon,Carroll,male,English 66,Lilly,Harper,male,English 67,Lela,Gordon,male,English 68,Nina,Perry,male,English 69,Dean,Perez,male,English 70,Bertie,Hill,male,English 1,Nelle,Gill,female,Spanish 2,Lula,Wright,female,Spanish 3,Anthony,Jensen,female,Spanish 4,Rodney,Alvarez,female,Spanish 5,Scott,Holmes,female,Spanish 6,Daisy,Aguilar,female,Spanish 7,Elijah,Olson,female,Spanish 8,Alma,Henderson,female,Spanish 9,Willie,Barrett,female,Spanish 10,Ada,Huff,female,Spanish 11,Daniel,Boyd,female,Spanish 12,Ora,Beck,female,Spanish 13,Hulda,Lloyd,female,Spanish 14,Jessie,McBride,female,Spanish 15,Marguerite,Andrews,female,Spanish 16,Maurice,Hamilton,female,Spanish 17,Cecilia,Rhodes,female,Spanish 18,Owen,Powers,female,Spanish 19,Ivan,Butler,female,Spanish 20,Rose,Bishop,female,Spanish 21,Franklin,Mann,female,Spanish 22,Martha,Hogan,female,Spanish 23,Francis,Oliver,female,Spanish 24,Catherine,Carlson,female,Spanish 25,Rose,Sanchez,female,Spanish 26,Danny,Bryant,female,Spanish 27,Jim,Christensen,female,Spanish 28,Eric,Banks,female,Spanish 29,Tony,Dennis,female,Spanish 30,Roy,Hoffman,female,Spanish 31,Edgar,Hunter,female,Spanish 32,Matilda,Gordon,female,Spanish 33,Randall,Cruz,female,Spanish 34,Allen,Brewer,female,Spanish 35,Iva,Pittman,female,Spanish 36,Garrett,Holland,female,Spanish 37,Johnny,Russell,female,Spanish 38,Nina,Richards,female,Spanish 39,Mary,Ballard,female,Spanish 40,Adrian,Sparks,female,Spanish 41,Evelyn,Santos,female,Spanish 42,Bess,Jackson,female,Spanish 43,Nicholas,Love,female,Spanish 44,Fred,Perkins,female,Spanish 45,Cynthia,Dunn,female,Spanish 46,Alan,Lamb,female,Spanish 47,Ricardo,Sims,female,Spanish 48,Rosie,Rogers,female,Spanish 49,Susan,Sutton,female,Spanish 50,Mary,Boone,female,Spanish 51,Francis,Marshall,male,Spanish 52,Carl,Olson,male,Spanish 53,Mario,Becker,male,Spanish 54,May,Hunt,male,Spanish 55,Sophie,Neal,male,Spanish 56,Frederick,Houston,male,Spanish 57,Edwin,Allison,male,Spanish 58,Florence,Wheeler,male,Spanish 59,Julia,Rogers,male,Spanish 60,Janie,Morgan,male,Spanish 61,Louis,Hubbard,male,Spanish 62,Lida,Wolfe,male,Spanish 63,Alfred,Summers,male,Spanish 64,Lina,Shaw,male,Spanish 65,Landon,Carroll,male,Spanish 66,Lilly,Harper,male,Spanish 67,Lela,Gordon,male,Spanish 68,Nina,Perry,male,Spanish 69,Dean,Perez,male,Spanish 70,Bertie,Hill,male,Spanish Update the expect results are: each person must be appears only one time using the language order For explain the case further, I'll take a new and small dataset, using only the person id and the language name 1,English 2,English 3,English 4,English 19,English 1,Spanish 2,Spanish 3,Spanish 4,Spanish 5,Spanish 14,Spanish 15,Spanish 16,Spanish 19,Spanish 21,Spanish 25,Spanish I'm using the same order but if I use a limit for example LIMIT 8 the results will be 1,English 2,English 3,English 4,English 19,English 1,Spanish 2,Spanish 3,Spanish And the expected result is 1,English 2,English 3,English 4,English 19,English 5,Spanish 14,Spanish 15,Spanish What I'm trying to do What I'm trying to do is sorting, paginating and filtering a list of X that may have a many-to-many relationship with Y, in this case X is a person and Y is the language. I need do it in a general way. I found a trouble if I want ordering the list by some Y properties. The list will show in this way: firstname, lastname, gender , languages Daniel , Boyd , female , English Spanish Ora , Beck , female , English Anthony , Jensen , female , Spanish .... I only need return a array with the IDs in the correct order this is the main reason I need that the results only appears the person one time is because the ORM (that I'm using) try to hydrate each result and if I paginate the results using offset and limit. the results maybe aren't the expected. I'm doing assumptions many to many relationships I can't use the string_agg or group_concat because I dont know the real data, I dont know if are integers or strings
If you want each person to appear only once, then you need to aggregate by that person. If you then want the list of languages, you need to combine them in some way, concatenation comes to mind. The use of double quotes suggests Postgres or Oracle to me. Here is Postgres syntax for this: SELECT p.id, string_agg(l.name) as languages FROM person p LEFT JOIN person_speaks_language psl ON p.id = psl.person_id LEFT JOIN language l ON psl.language_id = l.id GROUP BY p.id ORDER BY COUNT(l.name) DESC, languages; Similar functionality to string_agg() exists in most databases.
There is nothing wrong with Bertie Hill appearing in two rows, with one language each, that is the Tabular View of Data per the Relational Model. There are no dependencies on data values or number of data values. It is completely correct and un-confused. But here, the requirement is confused, because you really want three separate lists: speaks one language speaks two languages [or the number of languages currently in the language file] speaks no language [on file] ) ... But you want those three lists in one list. Concatenating data values is never, ever a good idea. It is a breach of rudimentary standards, specifically 1NF. It may be common, but it is a gross error. It may be taught by the so-called "theoreticians", but it remains a gross error. Even in a result set, yes. It creates confusion, such as I have detailed at the top. With concatenated strings, as the number of languages changes, the width of that concatenated field will grow, and eventually exceed space, wherever it appears (eg. the width of the field on the screen). Just two of the many reasons why it is incorrect, not expandable, sub-standard. By the way, in your "dataset" (it isn't the result set produced by your code), the sexes appear to be nicely mixed up. Therefore the answer, and the only correct one, even if it isn't popular, is that your code is correct (it can be cleaned it up, sure), and you have to educate the user re the dangers of sub-standard code or reports. You can sort by person.name (rather than by language.name) and then write smarter SQL such that (eg) the person.name is not repeated on the second and subsequent row for persons who speak more than one language, etc. That is just pretty printing. The non-answer, for those who insist on sub-standard code that will break one day when, is Gordon's response. Response to Comments In the Relational Model: There is no order to the rows, that is deemed a physical or implementation aspect, which we have no control over, and which changes anyway, and which we are warned not to rely upon. If order is sought in the output result set, then we must us ORDER BY, that is its purpose in life. The data has meaning, and that meaning is carried in Relational Keys. Meaning cannot be carried in surrogates (ie. ID columns). Limiting myself to the files (they are not tables) that you have given, there is no such thing in the data as: the first 10 persons who speaks one language Obtaining persons who speak one language is simple, I believe you already understand that: SELECT person.first_name, person.last_name FROM person P, (SELECT person_id FROM person_speaks_language GROUP BY person_id HAVING COUNT(*) = 1 -- change this for 2 languages, etc ) AS PL WHERE P.person_id = PL.person_id But "first" ? "first" by what criteria ? Record creation date ? ORDER BY date_created -- if it exists in the data Record ID does not give first anything: as records are added and deleted, any "order" that may exist initially is completely lost. You cannot extract meaning out of, or assign meaning to something that, by definition, has no meaning. If the Record ID is relevant, ie. you are going to use it for some purpose, then it is not a Record ID, name the field for what it actually is. I fail to see, I do not understand, the relevance of the difference between the "dataset" and the updated "small dataset". The "dataset" size is irrelevant, the field headings are irrelevant, what the result set means, is relevant. The problem is not some "limitation" in the Relational Model, the problem is (a) your fixed view of data values, and (b) your lack of understanding about what the Relational Model is, what it does, understanding of which makes this whole question disappear, and we are left with a simple SQL (as tagged) "how to" question. Eg. If I had a Relational Database, with persons and languages, with no ID columns, there is nothing that I cannot do with it, no report that I cannot produce from it, from the data. Please try to use an example that conveys the meaning in the data, in what you are trying to do. the expect results are: each person must be appear only one time They already appear only once (for each language) using the language order Well, there is no order in the language file. We can give it some order, whatever order is meaning-ful, to you, in the result set, based on the data. Eg. language.name. Of course, many persons speak each language, so what order would you like within language.name? How about last_name, first_name. The Record IDs are meaningless to the user, so I won't display them in the result set. NULL is also meaningless, and ambiguous, so I will make the meaning here explicit. This is pretty much what you have, tidied up: SELECT [language] = CASE name WHEN NULL THEN "[None]" ELSE name END, last_name, first_name FROM person P LEFT JOIN person_speaks_language PL ON P.id = PL.person_id LEFT JOIN language L ON PL.language_id = L.id ORDER BY name, last_name, first_name But then you have: And the expected result is The example data of which contradicts your textual descriptions: the expect results are: each person must be appear only one time using the language order So now, if I ignore the text, and examine the example data re what you want (which is a horrible thing to do, because I am joining you in the incorrect activity of focussing on the data values, rather than understanding the meaning), it appears you want the person to appear only once, full stop, regardless of how many languages they speak. Your example data is meaningless, so I cannot be asked to reproduce it. See if this has some meaning. SELECT last_name, first_name, [language] = ( -- correlated subquery SELECT TOP 1 -- get the "first" language CASE name -- make meaning of null explicit WHEN NULL THEN "[None]" ELSE name END FROM person_speaks_language PL JOIN language L ON PL.language_id = L.id WHERE P.id = PL.person_id -- the subject person ORDER BY name -- id would be meaningless ) FROM person P -- vector for person, once ORDER BY last_name, first_name Now if you wanted only persons who speak a language (on file): SELECT last_name, first_name, [language] = ( -- correlated subquery SELECT TOP 1 -- get the "first" language name FROM person_speaks_language PL JOIN language L ON PL.language_id = L.id WHERE P.id = PL.person_id -- the subject person ORDER BY name -- id would be meaningless ) FROM person P, ( SELECT DISTINCT person_id -- just one occ, thanks FROM person_speaks_language PL -- vector for speakers ) AS PL_1 WHERE P.id = PL_1.person_id -- join them to person fields There, not an outer join anywhere to be seen, in either solution. LEFT or RIGHT will confuse you. Do not attempt to "get everything", so that you can "see" the data values, and then mangle, hack and chop away at the result set, in order to get what you want from that. No, forget about the data values and get only what you want from the record filing system. Response to Update I was trying to explain the case with a data set, I think I made things tougher than they actually were Yes, you did. Reviewing the update then ... The short answer is, get rid of the ORM. There is nothing in it of value: you can access the RDB from the queries that populate your objects directly. The way we did for decades before the flatulent beast came along. Especially if you understand and implement Open Architecture Standards. Further, as evidenced, it creates masses of problems. Here, you are trying to work around the insane restrictions of the ORM. Pagination is a straight-forward issue, if you have your data Normalised, and Relational Keys. The long answer is ... please read this Answer. I trust you will understand that the approach you take to designing your app components, your design of windows, will change. All your queries will be simplified, you get only what you require for the specific window or object. The problem may well disappear entirely (except for possibly the pagination, you might need a method). Then please think about those architectural issues carefully, and make specific comments of questions.
Trouble getting newest record from grouped records
I'm pretty new to Access so I'm sure this is something simple. I'm not sure I even have the best subject. I have an Owner and a Names table that contain data like this: Owner Names TMKFK NID ... NIDFK Last ModDate 7721011 45 45 Smith 1/18/15 7721011 137 137 Jones 2/1/15 7721012 45 45 Smith 1/18/15 I am trying to query them so that I get the TMKFK for the latest timestamped record in the Name table. This is used for a lookup from a form. So if I lookup Smi* I expect to get 7721012. After a bunch of looking around on this site and elsewhere and looking at partition over I concluded the answer had to be using a subquery, but I can't quite figure out what to put where. This is where I got stuck: SELECT Owner.TMKFK FROM Owner INNER JOIN Names ON Owner.NID = Names.NIDFK GROUP BY Owner.TMKFK, [Owner Name].Last, [Owner Name].M WHERE (Owner.TMKFK=7721011 Or Owner.TMKFK=7721012) AND Names.Last Like "Smith" AND Names.ModDate=(SELECT Max(Names.ModDate) FROM Names); This fails because the subquery returns the Max date from the entire table and not just the two records with the same TMKFK. A HAVING clause doesn't seem to make a difference. Re-ordering the fields in group by didn't make a difference.
The subquery to get the max date would need to be restricted to the owner in question. Something along these lines: SELECT Owner.TMKFK FROM Owner INNER JOIN Names ON Owner.NID = Names.NIDFK WHERE (Owner.TMKFK=7721011 Or Owner.TMKFK=7721012) AND Names.Last Like 'Smith%' AND Names.ModDate=(SELECT Max(Names.ModDate) FROM Names WHERE NIDFK = Owner.NID ) Don't think you need the GROUP BY. Don't know the Access syntax, but LIKE usually implies wildcards like % and the string should be single quoted. And if you want case-insensitive searching: AND UPPER(Names.Last) LIKE UPPER('Smith%')
Using distinct and then an aggregate function in Postgresql?
This is a pretty basic problem and for whatever reason I can't find a reasonable solution. I'll do my best to explain. Say you have an event ticket (section, row, seat #). Each ticket belongs to an attendee. Multiple tickets can belong to the same attendee. Each attendee has a worth (ex: Attendee #1 is worth $10,000). That said, here's what I want to do: 1. Group the tickets by their section 2. Get number of tickets (count) 3. Get total worth of the attendees in that section Here's where I'm having problems: If Attendee #1 is worth $10,000 and is using 4 tickets, sum(attendees.worth) is returning $40,000. Which is not accurate. The worth should be $10,000. Yet when I make the result distinct on the attendee, the count is not accurate. In an ideal world it'd be nice to do something like select tickets.section, count(tickets.*) as count, sum(DISTINCT ON (attendees.id) attendees.worth) as total_worth from tickets INNER JOIN attendees ON attendees.id = tickets.attendee_id GROUP BY tickets.section Obviously this query doesn't work. How can I accomplish this same thing in a single query? OR is it even possible? I'd prefer to stay away from sub queries too because this is part of a much larger solution where I would need to do this across multiple tables. Also, the worth should follow the ticket divided evenly. Ex: $10,000 / 4. Each ticket has an attendee worth of $5,000. So if the tickets are in different sections, they take their prorated worth with them. Thanks for your help.
You need to aggregate the tickets before the attendees: select ta.section, sum(ta.numtickets) as count, sum(a.worth) as total_worth from (select attendee_id, section, count(*) as numtickets from tickets group by attendee_id, section ) ta INNER JOIN attendees a ON a.id = ta.attendee_id GROUP BY ta.section You still have a problem of a single attendee having seats in multiple sections. However, you do not specify how to solve that (apportion the worth? randomly choose one section? attribute it to all sections? canonically choose a section?)
Using jsonb_object_agg: select tickets.section, count(tickets.*) as count, ( SELECT SUM(value::int4) FROM jsonb_each_text(jsonb_object_agg(attendees.id, attendees.worth)) ) as total_worth from tickets INNER JOIN attendees ON attendees.id = tickets.attendee_id GROUP BY tickets.section
How to optimize group by in table with huge number of records
I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not. Here is the query I write SELECT * FROM Person INNER JOIN ( SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear FROM Person GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate) HAVING count(*) > 1 ) as dupPersons ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear order by Person.lastName,Person.firstName but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way? EDIT Note that I can cut data, which can have contribution in optimization for example if I want to cut data by 2 it could return two persons Johan Smith | Jane Smith | have same lastname and first name inita Jack Smith | Mark Tween | have same lastname and first name inita Mac Tween |
If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN SELECT * FROM Person p1 INNER JOIN Person p2 ON p2.PersonID > p1.PersonID WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1) AND p2.LastName = p1.LastName AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate) ORDER BY p1.LastName, p1.FirstName
Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc. Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off. As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc. Things I'd want to know: How long does it already take to run? What's your runtime window? (User & app tolerance for query time.) Is this run once a day? Week? Month? Quarter? Do you have the permission to create tables, change current tables, or alter indexes? Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%? How stable is the matching criteria requirement? Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.
You can try something like this and see the difference on the execution plans, or benchmark the results on performance: ;WITH DupPersons AS ( SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant FROM Person ) SELECT * FROM DupPersons WHERE Quant > 1 Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.