SQL index match to find duplicate data - sql

I have the following table
Code Name Task
aa jones DC
ab dave DC
aca james IF
aca james DC
ab trevor IF
aa jones IF
ag francis DC
ag francis IF
af derek SF
af derek DC
This is a very big table, above is just a quick example.
So, I would like some help finding the code and name that have completed a IF or SF task and a DC task.
I would like it to show where one person has touched both of these tasks. The hierarchy of the tasks is; it comes in as either a SF or IF then someone will do that, then off the back of that we receive a DC task, and I want the ones where it has been completed by the same person, with the same reference number.
I am able to do this in excel with an INDEX MATCH function, but this takes up a tremendous amount of calculation time due to the size of the table.

One way to approach this is using group by with a having. This is a flexible way of expressing these types of conditions:
select code, name
from table t
group by code, name
having sum(case when task = 'DC' then 1 else 0 end) > 0 and
sum(case when task in ('IF', 'SF') then 1 else 0 end) > 0;
Each condition in the having clause counts the number of rows that meet the particular condition. The first, for instance, counts the rows that match 'DC' and takes only the code, name pairs that have at least one such match.

SELECT code,name FROM YOUR_TABLE_NAME WHERE task = 'DC' AND (task = 'IF' OR task = 'SF') GROUP BY name
try this query

Gordon Linoff's query can be made easier under the hypothesys that IF and SF are synonym and cannot be both present for the same Code-Name couple, as the data provided by the OP suggests
SELECT code, name
FROM table t
GROUP BY code, name
HAVING SUM(CASE WHEN task IN ('IF', 'SF', 'DC') THEN 1 ELSE 0 END) = 2;

select code,name from (select distinct code,name from table1 where task='SF' or task='IF') as temp1 inner join (select distinct code as code2,name as name2 from table1 where task='DC') as temp2 on code=code2,name=name2;
I'm assuming that you have the table in table1. The code constructs two tables temp1 and temp2. temp1 contains those codes and names which have been assigned SF and IF. temp2 contains those codes and names which have been assigned DC. Finally, I join the two tables together to find code-name pairs in both tables. This is faster than in Excel because the database engine probably temporarily indexes the columns being joined on.
Actually, you can do this in Excel. You sort the table by code and name, then enter the following formulas (assuming "Code" is in A1):
D2=if(and(A2=A1,B2=B1,D1),true,or(C2="IF",C2="SF"))
E2=if(and(A2=A1,B2=B1,E1),true,C2="DC")
Select these two cells, and double-click the fill-handle (the little square at the bottom right of the selection). Then, with the two columns selected, copy, and then "Paste Special..." > "Values". Then, filter (Alt-D-F-F) for the rows with values in columns D and E being both true. That is the result you want. Select these rows and copy to a new sheet if desired.
Alternatively, you can follow the SQL "group by" solution given by Gordon, so that you do not need to sort: Create two new columns like the above, but:
D1: "D"
E1: "E"
D2=if(or(C2="IF",C2="SF"),1,0)
E2=if(C2="DC",1,0)
Then, "Insert" > "PivotTable", drag "Code" and "Name" to be row labels. Drag "D" to be under Values, click on it, "Value Field Settings...", and then select "Max". Do the same for "E", and then the rows with 1 in both D and E will be the result you want.

Related

Counting from different categories within the same query

I am trying to make a query from a table in Access that would give me totals for different types of product based off of 2 categories, all within one query. For example my Table looks as follows:
Type
Description 1
Description 2
Date
New
Shiny
Black
1/1/2022
New
Black
Dull
1/1/2022
Old
Shiny
Grey
1/1/2022
Old
Grey
Dull
1/1/2022
The query results that I want to receive are as follows:
Description
New
Old
Shiny
1
1
Black
2
0
Dull
1
1
Grey
0
2
The dataset that I am working with isn't as clean as my example shown here and is causing some of the issues. I never had an issue with the code running, but I just felt that there had to be an easier way that I was missing.
They way I was doing it originally just turned into a bunch of separate query's and was messy to get around. I essentially wrote a query to separate the table into new and old types. From there I used a bunch of
SUM(IIF( Description 1 = "x" OR Description 2 = "x") AS X
SUM(IIF( Description 1 = "y" OR Description 2 = "y") AS Y
expressions to count my totals for each of the objects. This would give me a query where all the totals were displayed in columns. Then I created a separate query to join these data sets together into a presentable manner, but it was turning into too much for how many different "types" I had.
I was just looking for a way to combine all of this into 1 query that would make pulling reports much easier.
Strongly advise not to use space in naming convention nor reserved words as names. Date is a reserved word.
Consider:
Query1
SELECT Type, Description1 AS D, [Date], 1 AS Category FROM Table1
UNION SELECT Type, Description2, [Date], 2 FROM Table1;
UNION will not allow duplicate rows. Use UNION ALL to include all records, even if there are duplicates. There is no query designer or wizard for UNION - must type or copy/paste in SQLView of query builder.
Query2
TRANSFORM Nz(Count(Query1.Category),0) AS CountOfCategory
SELECT Query1.D
FROM Query1
GROUP BY Query1.D
PIVOT Query1.Type;

How do I stop my query from pulling duplicates?

Yes, I know this seems simple:
SELECT DISTINCT(...)
Except, it apparently isn't
Here is my actual Query:
SELECT
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
CompletedTrainings.DecShotDate,
CompletedTrainings.DecShotLocation,
CompletedTrainings.DecReason,
CompletedTrainings.DecExplanation,
IIf([DecShotLocation]="MCS","Yes","No") AS YesMCS,
IIf([DecReason]=1,1,0) AS YesAllergy,
IIf([DecReason]=2,1,0) AS YesImmune,
IIf([DecReason]=3,1,0) AS YesAdverse,
IIf([DecReason]=4,1,0) AS YesMedical,
IIf([DecReason]=5,1,0) AS YesSpiritual,
IIf([DecReason]=6,1,0) AS YesOther,
IIf([DecReason]=7,1,0) AS YesAlready
FROM
EmployeeInformation
INNER JOIN (CompletedTrainings
LEFT JOIN DeclinationReasons ON CompletedTrainings.DecReason = DeclinationReasons.ReasonID)
ON EmployeeInformation.ID = CompletedTrainings.Employee
GROUP BY
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
CompletedTrainings.DecShotDate,
CompletedTrainings.DecShotLocation,
CompletedTrainings.DecReason,
CompletedTrainings.DecExplanation,
IIf([DecShotLocation]="MCS","Yes","No"),
IIf([DecReason]=1,1,0),
IIf([DecReason]=2,1,0),
IIf([DecReason]=3,1,0),
IIf([DecReason]=4,1,0),
IIf([DecReason]=5,1,0),
IIf([DecReason]=6,1,0),
IIf([DecReason]=7,1,0)
HAVING
((((EmployeeInformation.Active) Like -1)
AND ((CompletedTrainings.DecShotDate + 365 >= DATE())
OR (CompletedTrainings.DecShotDate IS NULL))));
This is Joining a few tables (obviously) in order to get a number of records. The problem is that if someone is duplicated on the table with a NULL in one of the date fields, and a date in another field, it pulls both the NULL and the DATE, or pulls multiple NULLS it might pull multiple dates but those are not present right at the moment.
I need the Nulls, they are actual data in this particular case, but if someone has a date and a NULL I need to pull only the newest record, I thought I could add MAX(RecordID) from the table, but that didn't change the results of the query either.
That code:
SELECT
DeclinationReasons.Reason,
EmployeeInformation.ID,
EmployeeInformation.Employee,
EmployeeInformation.Active,
MAX(CompletedTrainings.RecordID),
CompletedTrainings.DecShotDate
...
And it returned the same issue, Duplicated EmployeeInformation.ID with different DecShotDate values.
Currently it returns:
ID
Active
DecShotDate
etc. x a bunch
1
-1
date date
whatever goes
2
-1
in these
2
-1
date date
columns
These are being used in a report, that is to determine the total number of employees who fit the criteria of the report. The NULLs in DecShotDate are needed as they show people who did not refuse to get a flu vaccine in the current year, while the dates are people who did refuse.
Now I have come up with one simple solution, I could add a column to the CompletedTrainings Table that contains a date or other value, and add that to the HAVING statement. This might be the right solution as this is a yearly training questionnaire that employees have to fill out. But I am asking for advice before doing this.
Am I right in thinking I need to add a column to filter by so that older data isn't being pulled, or should I be able to do this by pulling recordID, and did I just bork that part of the query up?
Edited to add raw table views:
EmployeeInformation Table:
ID
Last
First
empID
Active
Termdate
DoH
Title
PT/FT/PD
PI
1
Doe
Jane
982
-1
date
Sr
PD
X
2
Roe
John
278
0
date
date
Jr
PD
X
3
Moe
Larry
1232
-1
date
Sr
FT
X
4
Zoe
Debbie
1424
-1
date
Sr
PT
X
DeclinationReasons Table:
ReasonID
Reason
1
Allergy
2
Already got it
3
Illness
CompletedTrainings Table:
RecordID
Employee
Training
...
DecShotdate
DecShotLocation
DecShotReason
DecExp
1
1
4
date
location
2
text
2
1
4
3
2
4
4
3
4
date
location
3
text
5
3
4
date
location
1
text
6
4
4
After some serious soul searching, I decided to use another column and filter by that.
In the end my query looks like this:
SELECT *
FROM (
(
SELECT RecordID, DecShotDate, DecShotLocation, DecReason, DecExplanation, Employee,
IIf([DecShotLocation]="MCS","Yes","No") AS YesMCS, IIf([DecReason]=1,1,0) AS YesAllergy,
IIf([DecReason]=2,1,0) AS YesImmune, IIf([DecReason]=3,1,0) AS YesAdverse,
IIf([DecReason]=4,1,0) AS YesMedical, IIf([DecReason]=5,1,0) AS YesSpiritual,
IIf([DecReason]=6,1,0) AS YesOther, IIf([DecReason]=7,1,0) AS YesAlready
FROM CompletedTrainings WHERE (CompletedDate > DATE() - 365 ) AND (Training = 69)) AS T1
LEFT JOIN
(
SELECT ID, Active FROM EmployeeInformation) AS T2 ON T1.Employee = T2.ID)
LEFT JOIN
(
SELECT Reason, ReasonID FROM DeclinationReasons) AS T3 ON T1.DecReason = T3.ReasonID;
This may not have been the best solution, but it did exactly what I needed. Which is to get the information by latest entry into the database.
Previously I had tried to use MAX(), DISTINCT(), etc. but always had a problem of multiple records being retrieved. In this case, I intentionally SELECT the most recent records first, then join them to the results of the next query, and so on. Until I have all the required data for my report.
I write this in hopes someone else finds it useful. Or even better if someone tells me why this is wrong, so as to improve my own skills.

SQL JOIN with CASE statement result

Is there any way of joining the result of a case statement with a reference table without creating a CTE, ect.
Result AFTER CASE statement:
ID Name Bonus Level (this is the result of a CASE statement)
01 John A
02 Jim B
01 John B
03 Jake C
Reference table
A 10%
B 20%
C 30%
I want to then get the % next to each employee, then the max %age using the MAX function and grouping by ID, then link it back again to the reference so that each employee has the single correct (highest) bonus level next to their name. (This is a totally fictitious scenario, but very similar to what I am looking for).
Just need help with joining the result of the CASE statement with the reference table.
Thanks in advance.
In place of a temporary value as the result of the case statement, you could use a select statement from the reference table.
So if your case statement looks like:
case when variable1=value then bonuslevel =A
Then, replacing it like this might help
case when variable1=value then (select percentage from ReferenceTable where variable2InReferenceTable=A)
Don't know if I am overly simplifying, but based on the results of your case result query, why not just join that to the reference table, and do a max grouped by ID/Name. Since the ID and persons name wont change anyhow since they are the same person, you are just getting the max you want. To complete the Bonus level, rejoin just that portion after the max percentage determined for the person.
select
lvl1.ID,
lvl1.Name,
lvl1.FinalBonus,
rt2.BonusLvl
from
( select
PQ.ID,
PQ.Name,
max( rt.PcntBonus ) as FinalBonus
from
(however you
got your
data query ) PQ
JOIN RefTbl rt
on PQ.BonusLvl = rt.BonusLvl
) lvl1
JOIN RefTbl rt2
on lvl1.FinalBonus = rt2.PcntBonus
Since the Bonus levels (A,B,C) do not guarantee corresponding % levels (10,20,30), I did it this way... OTHERWISE, you could have just used max() on both the bonus level and percent. But what if your bonus levels were listed as something like
Limited 10%
Aggressive 20%
Ace 30%
You could see that a max of the level above would have "Limited", but the max % = 30 is associated with an "Ace" sales rep... Get the 30% first, then see what the label that matched that is.

Unable to remove duplicates with subquery

I had a built a query previously that returned zero duplicates until I decided to join in a couple more tables. Now that I've joined them in, I'm unable to create the desired flag due to duplicates being returned. I've attached the scenario below as an example.
I only want one occurrence of each reference number (123456789). I want to create a flag when certain criteria are met. For example, I want to see when reference numbers for a certain account meet "X", but when I join the table I get every instance of that reference number in the joined table.
REFNO BEG END STATUS
123456789 123 456 E
123456789 456 789 E
123456789 789 012 A
I want to see all of the REFNO's based on other parameters set in the query, but I want a flag for anything where END = '012'. I can't left join to the table because I will get all three lines. If I do an inner join then I just get the 012 lines. I Tried the code below in my select statement to only pull when that scenario exists, but I'm getting wacky returns and don't know why. I feel like this should be fairly easy to accomplish, but I can't wrap my head around how to create a flag for just that scenario without getting duplicates or removing results with an inner join.
,(CASE WHEN EXISTS(SELECT 1
FROM QW.ABCD Z
WHERE Z.ABCD = P.ABCD
AND Z.END = '012'
AND Z.TIMESTAMP IS NULL
AND Z.STATUS IN ('A','E'))
THEN 'Y' ELSE 'N'
END)
AS "FLAG"
Please help as I'm not sure what I'm doing wrong to get the flag I want to see.
I am not sure id DB2 allows this combination of UNION and GROUP BY, but what you want is probably something like this:
SELECT 'N' AS FLAG, REF_NO,MIN(BEG),MAX(THEEND) FROM TAB_QW A
WHERE NOT EXIST (SELECT * FROM TAB_QW B WHERE B.REF_NO = A.REF_NO AND B.THEEND = '012')
GROUP BY REF_NO
UNION
SELECT 'Y' AS FLAG, REF_NO,MIN(BEG),MAX(THEEND) FROM TAB_QW C
WHERE EXISTS (SELECT * FROM TAB_QW D WHERE D.REF_NO = C.REF_NO AND D.THEEND = '012')
GROUP BY REF_NO

mysql query to dynamically convert row data to columns

I am working on a pivot table query.
The schema is as follows
Sno, Name, District
The same name may appear in many districts eg take the sample data for example
1 Mike CA
2 Mike CA
3 Proctor JB
4 Luke MN
5 Luke MN
6 Mike CA
7 Mike LP
8 Proctor MN
9 Proctor JB
10 Proctor MN
11 Luke MN
As you see i have a set of 4 distinct districts (CA, JB, MN, LP). Now i wanted to get the pivot table generated for it by mapping the name against districts
Name CA JB MN LP
Mike 3 0 0 1
Proctor 0 2 2 0
Luke 0 0 3 0
i wrote the following query for this
select name,sum(if(District="CA",1,0)) as "CA",sum(if(District="JB",1,0)) as "JB",sum(if(District="MN",1,0)) as "MN",sum(if(District="LP",1,0)) as "LP" from district_details group by name
However there is a possibility that the districts may increase, in that case i will have to manually edit the query again and add the new district to it.
I want to know if there is a query which can dynamically take the names of distinct districts and run the above query. I know i can do it with a procedure and generating the script on the fly, is there any other method too?
I ask so because the output of the query "select distinct(districts) from district_details" will return me a single column having district name on each row, which i will like to be transposed to the column.
You simply cannot have a static SQL statement returning a variable number of columns. You need to build such statement each time the number of different districts changes. To do that, you execute first a
SELECT DISTINCT District FROM district_details;
This will give you the list of districts where there are details. You then build a SQL statement iterating over the previous result (pseudocode)
statement = "SELECT name "
For each row returned in d = SELECT DISTINCT District FROM district_details
statement = statement & ", SUM(IF(District=""" & d.District & """,1 ,0)) AS """ & d.District & """"
statement = statement & " FROM district_details GROUP BY name;"
And execute that query. You'll then need have to handle in your code the processing of the variable number of columns
a) "For each " is not supported in MySQL stored procedures.
b) Stored procedures cannot execute prepared statements from concatenated strings using so called dynamic SQL statements, nor can it return results with more than One distinct row.
c) Stored functions cannot execute dynamic SQL at all.
It is a nightmare to keep track of once you got a good idea and everyone seems to debunk it before they think "Why would anyone wanna..."
I hope you find your solution, I am still searching for mine.
The closes I got was
(excuse the pseudo code)
-> to stored procedure, build function that...
1) create temp table
2) load data to temp table from columns using your if statements
3) load the temp table out to INOUT or OUT parameters in a stored procedure as you would a table call... IF you can get it to return more than one row
Also another tip...
Store your districts as a table conventional style, load this and iterate by looping through the districts marked active to dynamically concatenate out a querystring that could be plain text for all the system cares
Then use;
prepare stmName from #yourqyerstring;
execute stmName;
deallocate prepare stmName;
(find much more on the stored procedures part of the mysql forum too)
to run a different set of districts every time, without having to re-design your original proc
Maybe it's easier in numerical form.
I work on plain text content in my tables and have nothing to sum, count or add up
The following assumes you want matches of distinct (name/district) pairs. I.e. Luke/CA and Duke/CA would yield two results:
SELECT name, District, count(District) AS count
FROM district_details
GROUP BY District, name
If this is not the case simply remove name from the GROUP BY clause.
Lastly, notice that I switched sum() for count() as you are trying to count all of the grouped rows rather than getting a summation of values.
Via comment by #cballou above, I was able to perform this sort of function which is not exactly what OP asked for but suited my similar situation, so adding it here to help those who come after.
Normal select statement:
SELECT d.id ID,
q.field field,
q.quota quota
FROM defaults d
JOIN quotas q ON d.id=q.default_id
Vertical results:
ID field quota
1 male 25
1 female 25
2 male 50
Select statement using group_concat:
SELECT d.id ID,
GROUP_CONCAT(q.fields SEPARATOR ",") fields,
GROUP_CONCAT(q.quotas SEPARATOR ",") quotas
FROM defaults d
JOIN quotas q ON d.id=q.default_id
Then I get comma-separated fields of "fields" and "quotas" which I can then easily process programmatically later.
Horizontal results:
ID fields quotas
1 male,female 25,25
2 male 50
Magic!