Join with case statement using dynamic values

Join with case statement using dynamic values - sql

I am having a brain fart and could use some help. I hope my description of the issue is concise and clear enough.
Let's say I have an auto shop and I have contracts with insurance companies to repair vehicles at certain costs. These costs are based on the insurance company, the technician performing the job, and the job type. The contracts are a little weird, however. Only certain technicians have special rates for specific jobs.
For example: a contract with company 'A' states that they will pay $XX for a motor replacement if Sally replaces it, but they will pay $YY dollars if it's anyone else.
Now, I am trying to estimate the cash flow of my company, so I want to calculate the expected payments based on the mentioned criteria above. Let's pretend I have a table (jobs) that contains the job information, and this table contains fields for the job_type_id, technician_id, and insurance_id. There is another table (payment_schedule). This table also contains job_type_id, technician_id, and insurance_id. If there is no technician specified (eg: anyone but Sally), then the technician_id is 0.
In the case that Sally (technician_id = 26) is the only one with special rates for all of the insurance companies, then it's fairly easy for me to create a join condition:
select
*
from jobs as j
join payment_schedule as s on s.insurance_id = j.insurance_id
and j.job_type_id = s.job_type_id
and s.technician_id = case when j.technician_id = 26 then j.technician_id else 0 end
However, reality is more complicated than that. Sally may be the only one at that particular location with special prices, but we have multiple locations with many technicians. These contracts change frequently, so I need to be able to create a case statement that checks the id 'dynamically'. I tried this, knowing it'd probably be slow. It is.
select
*
from jobs as j
join payment_schedule as s on s.insurance_id = j.insurance_id
and j.job_type_id = s.job_type_id
and s.technician_id = case when j.technician_id in (select distinct technician_id from payment_schedule) then j.technician_id else 0 end
It's definitely slow, and I'm not surprised, but I'm having trouble coming up with a good way of ensuring we don't get multiple payment_schedule records for each job record (1 to 1 relationship) without having to hard code IDs into a case statement. Does anyone have any suggestions for this?
I feel like I'm missing something wildly obvious: I've been staring at this for too long and need fresh eyes. Thanks

I believe you are looking for something like below but if you give some data and desired output will help
select
*
from jobs as j
join payment_schedule as s on s.insurance_id = j.insurance_id
and j.job_type_id = s.job_type_id
and 1 = case when j.technician_id = s.technician_id then 1 else 0 end

Related

My Joins in query not pulling through correctly

Good evening. Could someone please help me with the following. I am trying to join two tables.The first id wbr_global.gl_ap_details. This stores historic GL information. The second table sandbox.utr_fixed_mapping is where account mapping is stored. For example, ana ccount number 60820 is mapped as Employee relation. The first table needs the mapping from the second table linked on the account number. The output I am getting is not right and way to bug. Any help would be appreciated!
Output
select sandbox.utr_fixed_mapping_na.new_mapping_1,sum(wbr_global.gl_ap_details.amount)
from wbr_global.gl_ap_details
LEFT JOIN sandbox.utr_fixed_mapping_na ON wbr_global.gl_ap_details.account_number = sandbox.utr_fixed_mapping_na.account_number
Where gl_ap_details.cost_center = '1172'
and gl_ap_details.period_name = 'JUL-21'
and gl_ap_details.ledger_name = 'Amazon.com, Inc.'
Group by 1;
I tried adding the cast function but after 5000 seconds of the query running I canceled it.

The query itself appears ok, but minor changes. Learn to use table "aliases". This way you don't have to keep typing long database.table.column all over. Additionally, SQL is easier to read doing it that way anyhow.
Notice the aliases "gl" and "fm" after the tables are declared, then these aliases are used to represent the columns.. Easier to read, would you agree.
Added GL Account number as described below the query.
select
gl.account_number,
fm.new_mapping_1,
sum(gl.amount)
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
gl.account_number,
fm.new_mapping_1
Now, as for your query and getting null. This just means that there are records within the gl_ap_details table with an account number that is not found in the utr_fixed_mapping_na table. So, to see WHAT gl account number does NOT exist, I have added it to the query. Its possible there are MULTIPLE records in the gl_ap_details that are not found in the mapping table. So, you may get
GLAccount Description SumOfAmount
glaccount1 null $someAmount
glaccount37 null $someAmount
glaccount49 null $someAmount
glaccount72 Depreciation $someAmount
glaccount87 Real Estate $someAmount
glaccount92 Building $someAmount
glaccount99 Salaries $someAmount
I obviously made-up glaccounts just to show the purpose. You may have multiple where the null's total amount is actually masking how many different gl account numbers were NOT found.
Once you find which are missing, you can check / confirm they SHOULD be in the mapping table.
FEEDBACK.
Since you do realize the missing numbers, lets consider a Cartesian result. If there are multiple entries in the mapping table for the same G/L account number, you will get a Cartesian result thus bloating your numbers. To clarify, lets say your mapping table has
Mapping file.
GL Descr1 NewMapping
1 test Salaries
1 testView Buildings
1 Another Depreciation
And your GL_AP_Details has
GL Amount
1 $100
Your total for the query would result in $300 because the query is trying to join the AP Details GL #1 to EACH of the entries in the mapping file thus bloating the amount. You could also add a COUNT(*) as NumberOfEntries to the query to see how many transactions it THINKS it is processing. Is there some "unique ID" in the GL_AP_Details table? If so, then you could also do a count of DISTINCT ID values. If they are different (distinct is lower than # of entries), I think THAT is your culprit.
select
fm.new_mapping_1,
sum(gl.amount),
count(*) as NumberOfEntries,
count( distinct gl.UniqueIdField ) as DistinctTransactions
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
fm.new_mapping_1
Might you also need to limit the mapping table for a specific prophecy or mec view?

If you "think" that the result of an aggregate is wrong, then the easiest way to verify this is to select the individual rows that correlate to 1 record in the aggregate output and inspect the records, looking for duplications.
For instance, pick 'Building Management':
SELECT fixed.new_mapping_1,details.amount,*
FROM wbr_global.gl_ap_details details
LEFT JOIN sandbox.utr_fixed_mapping_na fixed ON details.account_number = fixed.account_number
WHERE details.cost_center = '1172'
AND details.period_name = 'JUL-21'
AND details.ledger_name = 'Amazon.com, Inc.'
AND details.account_number = 'Building Management'
Notice that we tack on a ,* to the end of the projection, this will show you everything that the query has access to, you should look for repeating sections of data that you were not expecting, then depending on which table they originate from your might add additional criteria to the JOIN, or to the WHERE or you might need to group by additional columns.
This type of issue is really hard to comment on in a forum like this because it is highly specific to your schema, and the data contained within it, making solutions highly subjective to criteria you are not likely to publish online.
Generally if you think a calculation is wrong, you need to manually compute it to verify, this above advice helps you to inspect the data your query is using, you should either construct your own query or use other tools to build the data set that helps you to manually compute the correct values, then work them back into or replace your original query.
The speed issues are out of scope here, we can comment on the poor schema design but I suspect you don't have a choice. In the utr_fixed_mapping_na table you should make the account_number have the same column type as the source data, or add a new column that has the data in the original type, then you can setup indexes on the columns to improve the speed of the join.

SQL question with attempt on customer information

Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.

Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)

I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.

Writing a query to include certain values but exclude others when looking for a latest time period

I am trying to write a query that looks for a people that have a certain code with the latest period (year) but not if they have another code with that latest period(year). I'll be explicit just so my example makes sense.
I want people who have the code A1,A2,A3,A4,A5 but not AG,AP,AQ. There are people who have an A1 code for a period (like 2014) and an AG code for a the same period. I'd like to exclude them. Not everyone has a code so the field value could be NULL.
Is there a way to express this in a different way (i.e. less characters) than the way I did?
SELECT
people.firstName
FROM
people
WHERE EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (SELECT MAX(period) FROM codes codes2 WHERE codes2.people_id = codes.people_id)
AND code LIKE 'A[1-5]'
)
AND NOT EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (
SELECT MAX(period)
FROM codes codes2
WHERE codes2.people_id = codes.people_id
)
AND code LIKE 'A[GPQ]'
)
Schema is as follows:
People
id (PK)
firstName
Codes
people_id (FK) many to one relation with People table
code (e.g. "A1", "A2", "AG")
period (e.g. "2013", "2014")

There are so many ways you could do that, I'm not an SQL expert but I can't see your query being too bad, if you want to try and reduce the number of sub-queries you could consider using the GROUP BY clause along with a SUM Aggregate function in a HAVING clause.
I started updating your code as follows:
SELECT
people.firstName
FROM
people
LEFT JOIN codes AS a15 ON a15.people_id = people.id AND a15.code LIKE 'A[1-5]'
LEFT JOIN codes AS agpq ON agpq.people_id = people.id AND agpq.code LIKE 'A[GPQ]'
GROUP BY
people.firstName
HAVING
SUM(CASE WHEN a15.code IS NULL THEN 0 ELSE 1 END) > 0
AND SUM(CASE WHEN agpq.code IS NULL THEN 0 ELSE 1 END) = 0
This however doesn't take into account anything to do with period specific requirements described. You could add the period to the GROUP BY clause or add it to a WHERE or one of the JOIN constraints but I'm not quite sure from your description exactly what you're after (I don't believe this is through any fault of your own, I just can't personally align the code provided to the description).
I would also like to point out that the SUM functions above will not give an accurate count of the number of matching codes. This is because if both A[GPQ] and A[1_5] return at least one row, the number returned by each constraint will be multiplied by the number returned for the other, it can however be used to determine if there are "any" returned items as if the criteria is matched it will have a SUM(...) > 0
I'm sure a more experienced SQL Developer / DBA will be able to poke many holes in my proposed query but it might give them or someone else something to work from and hopefully gives you ideas for alternatives to using sub-queries.

Splitting one table based on criteria and comparing

I'm not quite sure on the best way to phrase this particular query, so I hope the title is adequate, however, I will attempt to describe what it is I need to be able to understand how to do. Just to clarify, this is for oracle sql.
We have a table called assessments. There are different kinds of assessments within this table, however, some assessments should follow others in a logical order and within set time frames. The problems come in when a client has multiple assessments of the same type, as we have to use a fairly inefficient array formula in excel to identify which 'full' assessment corresponds with the 'initial' assessment.
I have an earlier query that was resolved on this site (Returning relevant date from multiple tables including additional table info) which I believe includes a lot of the logic for what is required (particularly in identifying a corresponding event which has occurred within a specified timeframe). However, whilst that query pulls data from 3 seperate tables (assessments, events, responsiblities), I now need to create a query that generates a similar outcome but pulling from 1 main table and a 2nd table to return worker information. I thought the most logical way would be be to create a query that looks at the assessment table with one type of assessment, and then joins to the assessment table again (possibly a temporary table?) with assessment type that would follow the initial one.
For example:
Table 1 (Assessments):
Client ID Assessment Type Start End
P1 1 Initial 01/01/2012 05/01/2012
Table 2 (Assessments temp?):
Client ID Assessment Type Start End
P1 2 Full 12/01/2012
Table 3:
ID Worker Team
1 Bob Team1
2 Lyn Team2
Result:
Client ID Initial Start Initial End Initial Worker Full Start Full End
P1 1 01/01/2012 05/01/2012 Bob 12/01/2012
So table 1 and table 2 draw from the same table, except it's bringing back different assessments. Ideally, there'd be a check to make sure that the 'full' assessment started within X days of the end of the 'initial' assessment (similar to the 'likely' check in the previous query mentioned earlier). If this can be achieved, it's probably worth mentioning that I'd also be interested in expanding this to look at multiple assessment types, as roughly in the cycle a client could be expected to have between 4 or 5 different types of assessment. Any pointers would be appreciated, I've already had a great deal of help from this community which is very valuable.
Edit:
Edited to include solution following MBs advice.
Select
*
From(
Select
I.ASM_SUBJECT_ID as PNo,
I.ASM_ID As IAID,
I.ASM_QSA_ID as IAType,
I.ASM_START_DATE as IAStart,
I.ASM_END_DATE as IAEnd,
nvl(olm_bo.get_ref_desc(I.ASM_OUTCOME,'ASM_OUTCOME'),'') as IAOutcome,
C.ASM_ID as CAID,
C.ASM_QSA_ID as CAType,
C.ASM_START_DATE as CAStart,
C.ASM_END_DATE as CAEnd,
nvl(olm_bo.get_ref_desc(C.ASM_OUTCOME,'ASM_OUTCOME'),'') as CAOutcome,
ROUND(C.ASM_START_DATE -I.ASM_START_DATE,0) as "Likely",
row_number() over(PARTITION BY I.ASM_ID
ORDER BY
abs(I.ASM_START_DATE - C.ASM_START_DATE))as "Row Number"
FROM
O_ASSESSMENTS I
left join O_ASSESSMENTS C
on I.ASM_SUBJECT_ID = C.ASM_SUBJECT_ID
and C.ASM_QSA_ID IN ('AA523','AA1326') and
ROUND(C.ASM_START_DATE - I.ASM_START_DATE,0) >= -2
AND
ROUND(C.ASM_START_DATE - I.ASM_START_DATE,0) <= 25
and C.ASM_OUTCOME <>'ABANDON'
Where I.ASM_QSA_ID IN ('AA501','AA1323')
AND I.ASM_OUTCOME <> 'ABANDON'
AND
I.ASM_END_DATE >= '01-04-2011') WHERE "Row Number" = 1

You can access the same table multiple times in a given query in SQL, simply by using table aliases. So one way of doing this would be:
select i.client,
i.id initial_id,
i.start initial_start,
i.end initial_end,
w.worker initial_worker,
f.id full_id,
f.start full_start,
f.end full_end
from assessments i
join workers w on i.id = w.id
left join assessments f
on i.client = f.client and
f.assessment_type = 'Full' and
f.start between i.end and i.end + X
/* replace X with appropriate number of days */
where i.assessment_type = 'Initial'
Note: column names such as end (that are reserved words in Oracle SQL) should normally be double-quoted, but from the previous question it looks as though these are simplified versions of the actual column names.

From your post, I assume that you're using Oracle here (as I see "Oracle" in the question).
In terms of "temp" tables, Views come right to mind. An Oracle View can give you different looks of a table which is what it sounds like you're looking for with different kinds of assessments.
Don Burleson is a good source for anything Oracle related and he gives some tips on Oracle Views at http://www.dba-oracle.com/concepts/views.htm

How to optimize group by in table with huge number of records

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not.
Here is the query I write
SELECT *
FROM Person INNER JOIN
(
SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
FROM Person
GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
HAVING count(*) > 1
) as dupPersons
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName
but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?
EDIT
Note that I can cut data, which can have contribution in optimization
for example if I want to cut data by 2 it could return two persons
Johan Smith |
Jane Smith | have same lastname and first name inita
Jack Smith |
Mark Tween | have same lastname and first name inita
Mac Tween |

If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN
SELECT *
FROM Person p1
INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1)
AND p2.LastName = p1.LastName
AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
p1.LastName, p1.FirstName

Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.
Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.
As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.
Things I'd want to know:
How long does it already take to run?
What's your runtime window? (User & app tolerance for query time.)
Is this run once a day? Week? Month? Quarter?
Do you have the permission to create tables, change current tables, or alter indexes?
Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
How stable is the matching criteria requirement?
Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.

You can try something like this and see the difference on the execution plans, or benchmark the results on performance:
;WITH DupPersons AS
(
SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
FROM Person
)
SELECT *
FROM DupPersons
WHERE Quant > 1
Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas