complex sql query help needed - sql

I'm not sure how to write this query in SQL. there are two tables
**GroupRecords**
Id (int, primary key)
Name (nvarchar)
SchoolYear (datetime)
RecordDate (datetime)
IsUpdate (bit)
**People**
Id (int, primary key)
GroupRecordsId (int, foreign key to GroupRecords.Id)
Name (nvarchar)
Bio (nvarchar)
Location (nvarchar)
return a distinct list of people who belong to GroupRecords that have a SchoolYear of '2000'. In the returned list, people.name should be unique (no duplicate People.Name), in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
It would probably be better to write a stored procedure for this right?

This is untested, but it should do what is required in the question.
It selects all details about the person.
The subquery will make it match only the latest RecordDate for a single name. It will also look only in the right GroupRecord because of the Match between the ids.
SELECT
People.Id,
People.GroupRecordsId,
People.Name,
People.Group,
People.Bio,
People.Location
FROM
People
INNER JOIN GroupRecords ON GroupRecords.Id = People.GroupRecordsId
WHERE
GroupRecords.SchoolYear = '2000/1/1' AND
GroupRecords.RecordDate = (
SELECT
MAX(GR2.RecordDate)
FROM
People AS P2
INNER JOIN GroupRecords AS GR2 ON P2.GroupRecordsId = GR2.Id
WHERE
P2.Name = People.Name AND
GR2.Id = GroupRecords.Id
)

Select Distinct ID
From People
Where GroupRecordsID In
(Select Id From GroupRecords
Where SchoolYear = '2000/1/1')
This will produce a distinct list of those individuals in the 2000 class...
but I don't understand what you're getting at with the cpmment about duplicates... please elaborate...
It reads as though you're talking about when two different people happen to have the same name you don't want them both listed... Is that really what you want?

MySQL specific:
SELECT *
FROM `People`
LEFT JOIN `GroupRecords` ON `GroupRecordsId` = `GroupRecords`.`Id`
GROUP BY `People`.`Name`
ORDER BY `GroupRecords`.`RecordDate` DESC
WHERE `GroupRecords`.`SchoolYear` = '2000/1/1'

people.name should be unique (no duplicate People.Name)
? Surely you mean no duplicate People.ID?
in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
There's the rub — that's the bit that it's not obvious how to do in plain SQL. There are a number of approaches to the “For each X, select the row Y with maximum/minimum Z” question; which work and which perform better depend on which database software you're using.
http://kristiannielsen.livejournal.com/6745.html has some good discussion of some of the usual techniques for attacking this (in the context of MySQL, but widely applicable).

Related

SQL Query to return a table of specific matching values based on a criteria

I have 3 tables in PostgreSQL database:
person (id, first_name, last_name, age)
interest (id, title, person_id REFERENCES person)
location (id, city, state text NOT NULL, country, person_id REFERENCES person)
city can be null, but state and country cannot.
A person can have many interests but only one location. My challenge is to return a table of people who share the same interest and location.
All ID's are serialized and thus created automatically.
Let's say I have 4 people living in "TX", they each have two interests a piece, BUT only person 1 and 3 share a similar interest, lets say "Guns" (cause its Texas after all). I need to select all people from person table where the person's interest title (because the id is auto generated, two Guns interest would result in two different ID keys) equals that of another persons interest title AND the city or state is also equal.
I was looking at the answer to this question here Select Rows with matching columns from SQL Server and I feel like the logic is sort of similar to my question, the difference is he has two tables, to join together where I have three.
return a table of people who share the same interest and location.
I'll interpret this as "all rows from table person where another rows exists that shares at least one matching row in interest and a matching row in location. No particular order."
A simple solution with a window function in a subquery:
SELECT p.*
FROM (
SELECT person_id AS id, i.title, l.city, l.state, l.country
, count(*) OVER (PARTITION BY i.title, l.city, l.state, l.country) AS ct
FROM interest i
JOIN location l USING (person_id)
) x
JOIN person p USING (id)
WHERE x.ct > 1;
This treats NULL values as "equal". (You did not specify clearly.)
Depending on undisclosed cardinalities, there may be faster query styles. (Like reducing to duplicative interests and / or locations first.)
Asides 1:
It's almost always better to have a column birthday (or year_of_birth) than age, which starts to bit-rot immediately.
Asides 2:
A person can have [...] only one location.
You might at least add a UNIQUE constraint on location.person_id to enforce that. (If you cannot make it the PK or just append location columns to the person table.)

Find potential duplicate names in database

I have two tables in a SQL Server Database:
Table: People
Columns: ID, FirstName, LastName
Table: StandardNames
Columns: Nickname, StandardName
Sample Nicknames would be Rick, Rich, Richie when StandardName is Richard.
I would like to find duplicate contacts in my People table but replace any of the nicknames with the standard name. IE: sometimes I have Rich Smith other times it is Richard Smith in the People table. Is this possible? I realize it might be multiple joins to the same table but can't figure out how to start.
Firstly, you need to determine how many duplicates you have in your People table...
SELECT p.FirstName, COUNT(*)
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0
GROUP BY p.FirstName
HAVING COUNT(*) > 1
That's just to get an idea of what data you're trying to find in relation to the Nicknames that may possibly exist inside (as a wildcard word search) the Firstname and Lastname columns.
If you are happy with the items found then expand on the query to update the values.
Let's say you wanted to change the Firstname to be the Standardname...
UPDATE p2
SET p2.FirstName = p2.Standardname
FROM
(SELECT p.ID, sn.StandardName
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0) AS a
INNER JOIN People AS p2 ON p2.ID = a.ID
So this will obviously find all the People IDs that have a match based on the query above, and it will update the People table by replacing the FirstName with the StandardName.
However, there are issues with this due to the limitation of your question.
the StandardNames table should have its own ID field. All tables should have an ID column as its primary table. That's just my view.
this is only going to work for data it matches using the CHARINDEX() function. What you really need is something to find based on a "sound" or similarity to the nicknames. Check out the SOUNDEX() function and apply your logic from there.
And this is assuming your IDs above are unique!
Good luck
You could standardize the names by joining, and count the number of occurrences. Extracting the ID is a bit fiddly, but also quite possible. I'd suggest the following - use a case expression to find the contact with the standard name, and if you don't have one, just take the id of the first duplicate:
SELECT COALESCE(MIN(CASE FirstName WHEN StandardName THEN id END), MIN(id)),
StandardName,
LastName,
COUNT(*)
FROM People p
LEFT JOIN StandardNames s ON FirstName = Nickname AND
GROUP BY StandardName, LastName

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

Efficient way to select records missing in another table

I have 3 tables. Below is the structure:
student (id int, name varchar(20))
course (course_id int, subject varchar(10))
student_course (st_id int, course_id int) -> contains name of students who enrolled for a course
Now, I want to write a query to find out students who did not enroll for any course. As I could figure out there are multiple ways to fetching this information. Could you please let me know which one of these is the most efficient and also, why. Also, if there could be any other better way of executing same, please let me know.
db2 => select distinct name from student inner join student_course on id not in (select st_id from student_course)
db2 => select name from student minus (select name from student inner join student_course on id=st_id)
db2 => select name from student where id not in (select st_id from student_course)
Thanks in advance!!
The subqueries you use, whether it is not in, minus or whatever, are generally inefficient. Common way to do this is left join:
select name
from student
left join student_course on id = st_id
where st_id is NULL
Using join is "normal" and preffered solution.
The canonical (maybe even synoptic) idiom is (IMHO) to use NOT EXISTS :
SELECT *
FROM student st
WHERE NOT EXISTS (
SELECT *
FROM student_course
WHERE st.id = nx.st_id
);
Advantages:
NOT EXISTS(...) is very old, and most optimisers will know how to handle it
, thus it will probably be present on all platforms
the nx. correlation name is not leaked into the outer query: the select * in the outer query will only yield fields from the student table, and not the (null) rows from the student_course table, like in the LEFT JOIN ... WHERE ... IS NULL case. This is especially useful in queries with a large number of range table entries.
(NOT) IN is error prone (NULLs), and it might perform bad on some implementations (duplicates and NULLs have to be removed from the result of the uncorrelated subquery)
Using "not in" is generally slow. That makes your second query the most efficient. You probably don't need the brackets though.
Just as a comment: I would suggest to select student Id (which are unique) and not names.
As another query option you might want to join the two tables, group by student_id, count(course_id) having count(course_id) = 0.
Also, I agree that indexes will be more important.

Design : multiple visits per patient

Above is my schema. What you can't see in tblPatientVisits is the foreign key from tblPatient, which is patientid.
tblPatient contains a distinct copies of each patient in the dataset as well as their gender. tblPatientVists contains their demographic information, where they lived at time of admission and which hospital they went to. I chose to put that information into a separate table because it changes throughout the data (a person can move from one visit to the next and go to a different hospital).
I don't get any strange numbers with my queries until I add tblPatientVisits. There are just under one millions claims in tblClaims, but when I add tblPatientVisits so I can check out where that person was from, it returns over million. I thinkthis is due to the fact that in tblPatientVisits the same patientID shows up more than once (due to the fact that they had different admission/dischargedates).
For the life of me I can't see where this is incorrect design, nor do I know how to rectify it beyond doing one query with count(tblPatientVisits.PatientID=1 and then union with count(tblPatientVisits.patientid)>1.
Any insight into this type of design, or how I might more elegantly find a way to get the claimType from tblClaims to give me the correct number of rows with I associate a claim ID with a patientID?
EDIT: The biggest problem I'm having is the fact that if I include the admissionDate,dischargeDate or the patientStatein the tblPatient table I can't use the patientID as a primary key.
It should be noted that tblClaims are NOT necessarily related to tblPatientVisits.admissionDate, tblPatientVisits.dischargeDate.
EDIT: sample queries to show that when tblPatientVisits is added, more rows are returned than claims
SELECT tblclaims.id, tblClaims.claimType
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID INNER JOIN
tblPatientVisits ON tblPatient.patientID = tblPatientVisits.patientID
more than one million query rows returned
SELECT tblClaims.id, tblPatient.patientID
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID
less than one million query rows returned
I think this is crying for a better design. I really think that a visit should be associated with a claim, and that a claim can only be associated with a single patient, so I think the design should be (and eliminating the needless tbl prefix, which is just clutter):
CREATE TABLE dbo.Patients
(
PatientID INT PRIMARY KEY
-- , ... other columns ...
);
CREATE TABLE dbo.Claims
(
ClaimID INT PRIMARY KEY,
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID)
-- , ... other columns ...
);
CREATE TABLE dbo.PatientVisits
(
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID),
ClaimID INT NULL FOREIGN KEY
REFERENCES dbo.Claims(ClaimID),
VisitDate DATE
, -- ... other columns ...
, PRIMARY KEY (PatientID, ClaimID, VisitDate) -- not convinced on this one
);
There is some redundant information here, but it's not clear from your model whether a patient can have a visit that is not associated with a specific claim, or even whether you know that a visit belongs to a specific claim (this seems like crucial information given the type of query you're after).
In any case, given your current model, one query you might try is:
SELECT c.id, c.claimType
FROM dbo.tblClaims AS c
INNER JOIN dbo.tblPatientClaims AS pc
ON c.id = pc.id
INNER JOIN dbo.tblPatient AS p
ON pc.patientid = p.patientID
-- where exists tells SQL server you don't care how many
-- visits took place, as long as there was at least one:
WHERE EXISTS (SELECT 1 FROM dbo.tblPatientVisits AS pv
WHERE pv.patientID = p.patientID);
This will still return one row for every patient / claim combination, but it should only return one row per patient / visit combination. Again, it really feels like the design isn't right here. You should also get in the habit of using table aliases - they make your query much easier to read, especially if you insist on the messy tbl prefix. You should also always use the dbo (or whatever schema you use) prefix when creating and referencing objects.
I'm not sure I understand the concept of a claim but I suspect you want to remove the link table between claims and patient and instead make the association between patient visit and a claim.
Would that work out better for you?