How to self join only a subset of rows in PostgreSQL? - sql

Given the following table:
CREATE TABLE people (
name TEXT PRIMARY KEY,
age INT NOT NULL
);
INSERT INTO people VALUES
('Lisa', 30),
('Marta', 27),
('John', 32),
('Sam', 41),
('Alex', 12),
('Aristides',43),
('Cindi', 1)
;
I am using a self join to compare each value of a specific column with all the other values of the same column. My query looks something like this:
SELECT DISTINCT A.name as child
FROM people A, people B
WHERE A.age + 16 < B.age;
This query aims to spot potential sons/daughters based on age difference. More specifically, my goal is to identify the set of people that may have stayed in the same house as one of their parents (ordered by name), assuming that there must be an age difference of at least 16 years between a child and their parents.
Now I would like to combine this kind of logic with the information that is in another table.
The other table looks something like that:
CREATE TABLE houses (
house_name TEXT NOT NULL,
house_member TEXT NOT NULL REFERENCES people(name)
);
INSERT INTO houses VALUES
('house Smith', 'Lisa'),
('house Smith', 'Marta'),
('house Smith', 'John'),
('house Doe', 'Lisa'),
('house Doe', 'Marta'),
('house Doe', 'Alex'),
('house Doe', 'Sam'),
('house McKenny', 'Aristides'),
('house McKenny', 'John'),
('house McKenny', 'Cindi')
;
The two tables can be joined ON houses.house_member = people.name.
More specifically I would like to spot the children only within the same house. It does not make sense to compare the age of each person with the age of all the others, but instead it would be more efficient to compare the age of each person with all the other people in the same house.
My idea is to perform the self join from above but only within a PARTITION BY household_name. However, I don't think this is a good idea since I do not have an aggregate function. Same applies for GROUP BY statements as well. What could I do here?
The expected output should be the following, ordered by house_member:
house_member
Alex
Cindi
For simplicity I have created a fiddle.

At first join two tables to build one table that has all three bits of info: house_name, house_member, age.
And then join it with itself just as you did originally and add one extra filter to look only at the same households.
WITH
CTE_All
AS
(
SELECT
houses.house_name
,houses.house_member
,people.age
FROM
houses
INNER JOIN people ON people.name = houses.house_member
)
SELECT DISTINCT
Children.house_name
,Children.house_member AS child_name
FROM
CTE_All AS Children
INNER JOIN CTE_All AS Parents
ON Children.age + 16 < Parents.age
-- this is our age difference
AND Children.house_name = Parents.house_name
-- within the same house
;
All this is one single query. You don't have to use CTE, you can inline it as a subquery, but it is more readable with CTE.
Result
house_name | child_name
:------------ | :---------
house Doe | Alex
house McKenny | Cindi

Related

Calculating weighted marks within weighted marks

I have a maths / SQL problem that I've been grappling with.
I have two tables with the following structure:
CREATE TABLE Exams
(
ExamID INT PRIMARY KEY,
ExamName VARCHAR(100),
CourseID INT,
RelatedExamID INT NULL,
Weighting DECIMAL (5,3)
)
CREATE TABLE ExamMarks
(
ExamMarkID INT IDENTITY PRIMARY KEY,
StudentID VARCHAR(8),
ExamID INT FOREIGN KEY REFERENCES Exams(ExamID),
ExamMark DECIMAL (5,4)
)
The exams table contains the following data:
INSERT INTO Exams (ExamID, ExamName, CourseID, RelatedExamID, Weighting)
VALUES (1, 'English',1,NULL,1),
(2, 'French',2,NULL,1),
(3, 'Maths',3,NULL,0.6),
(4, 'Statistics',3,NULL,0.4),
(5, 'Physics Part 1',4,NULL,0.5),
(6, 'Physics Part 2',4,NULL,0.5),
(7, 'Heat and Mass',4,6,0.25)
The Exam Marks table contains the following data:
INSERT INTO ExamMarks (StudentID, ExamID, ExamMark)
VALUES ('00112233', 1, 0.75),
('00112233', 2, 0.52),
('00112233', 3, 0.68),
('00112233', 4, 0.8),
('00112233', 5, 0.50),
('00112233', 6, 0.66),
('00112233', 7, 0.45)
The idea here is that a given course may have
A single exam (such as with English and French)
Multiple exams (such as with course 3, which has 2 exams called "Maths" and "Physics") which have independent weightings - in this case, the course is structured so that the Maths exam is 60% of the total, and the Physics exam contributes 40%
Exams with sub-exams, such as with Course 4, more on which shortly.
If I want to get the weighted total marks for each candidate for each exam - forgetting about course 4 for the moment - I do the following:
SELECT em.StudentID,e.CourseID, SUM(em.ExamMark * e.Weighting)/SUM(e.Weighting)
FROM Exams e
INNER JOIN ExamMarks em ON e.ExamID = em.ExamID
GROUP BY em.StudentID,e.CourseID
However, Course 4 is made up of 3 parts:
Physics Part 1 - 50% of the total, and
Physics Part 2 - also 50% of the total
Heat and Mass, which makes up 25% of Physics Part 2 (thus its ID in the 'RelatedExamID' column)
To be clear, Heat and Mass makes up 25% of Physics Part 2, which itself makes up 50% of the course.
I've put these figures into an Excel spreadsheet, and after a lot of head-scratching, was able to figure that our student should end up with a mark for Course 4 of 55.375%.
However, unfortunately, my SQL (and maths / logic) skills are not good enough to get to this result in a SQL query.
The data above represents something of a simplification. There are, in fact, around 10000 marks to be considered (around 500 students involved), for about 200 different exams, of which perhaps 30 are "sub-exams". Each year, these have to be totalled up to give the student their mark per course, given these weightings.
Okay, so I have managed to find a solution. I'd still be grateful for any others which might be more efficient or robust from those who know more than me.
--Get SubComponent marks
WITH SubComponents
AS
(SELECT
em.StudentID
,em.ExamID
,e.RelatedExamID
,e.Weighting
,e.Weighting * em.ExamMark AS WeightedMark
FROM Exams e
INNER JOIN ExamMarks em
ON e.ExamID = em.ExamID
WHERE e.RelatedExamID IS NOT NULL),
--Get marks for those components which have subcomponents
ParentComponents
AS
(SELECT
em.StudentID
,e.CourseID
,em.ExamID
,e.RelatedExamID
,e.Weighting
,((1 - SubComponents.Weighting) * em.ExamMark)
+ SubComponents.WeightedMark AS OverallComponentMark
FROM Exams e
INNER JOIN ExamMarks em
ON e.ExamID = em.ExamID
INNER JOIN SubComponents
ON SubComponents.RelatedExamID = e.ExamID
AND SubComponents.StudentID=em.StudentID),
--Get marks for those components which are neither parent nor child components
StandaloneComponents
AS
(SELECT
em.StudentID
,e2.CourseID
,em.ExamID
,e2.RelatedExamID
,e2.Weighting
,em.ExamMark
FROM Exams e2
INNER JOIN ExamMarks em
ON e2.ExamID = em.ExamID
WHERE NOT EXISTS (SELECT
*
FROM Exams
WHERE RelatedExamID = e2.ExamID)
AND e2.RelatedExamID IS NULL),
-- Bring all the above together
ComponentMarks
AS
(SELECT
StudentID
,CourseID
,ExamID
,Weighting
,ExamMark
FROM StandaloneComponents
UNION
SELECT
StudentID
,CourseID
,ExamID
,Weighting
,OverallComponentMark
FROM ParentComponents)
-- Finally group and combine marks at course level
SELECT
StudentID
,CourseID
,SUM(ExamMark * Weighting) / SUM(Weighting)
FROM ComponentMarks
GROUP BY StudentID
,CourseID

Query database for distinct values and aggregate data based on condition

I am trying to extract distinct items from a Postgres database pairing a column from a table with a column from another table based on a condition. Simplified version looks like this:
CREATE TABLE users
(
id SERIAL PRIMARY KEY,
name VARCHAR(255)
);
CREATE TABLE photos
(
id INT PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
flag VARCHAR(255)
);
INSERT INTO users VALUES (1, 'Bob');
INSERT INTO users VALUES (2, 'Alice');
INSERT INTO users VALUES (3, 'John');
INSERT INTO photos VALUES (1001, 1, 'a');
INSERT INTO photos VALUES (1002, 1, 'b');
INSERT INTO photos VALUES (1003, 1, 'c');
INSERT INTO photos VALUES (1004, 2, 'a');
INSERT INTO photos VALUES (1004, 2, 'x');
What I need is to extract each user name, only once, and a flag value for each of them. The flag value should prioritize a specific one, let's say b. So, the result should look like:
Bob b
Alice a
Where Bob owns a photo having the b flag, while Alice does not and John has no photos. For Alice the output for the flag value is not important (a or x would be just as good) as long as she owns no photo flagged b.
The closest thing I found were some self-join queries where the flag value would have been aggregated using min() or max(), but I am looking for a particular value, which is not first, nor last. Moreover, I found out that you can define your own aggregate functions, but I wonder if there is an easier way of conditioning the query in order to obtain the required data.
Thank you!
Here is a method with aggregation:
select u.name,
coalesce(max(flag) filter (where flag = 'b'),
min(flag)
) as flag
from users u left join
photos p
on u.id = p.user_id
group by u.id, u.name;
That said, a more typical method would be a prioritization query. Perhaps:
select distinct on (u.id) u.name, p.flag
from users u left join
photos p
on u.id = p.user_id
order by u.id, (p.flag = 'b') desc;

Postgresql check for double entries

I searched for nearly one hour to solve my problem but i cant find anything.
So:
I created a table named s (Suppliers) where some Suppliers for Parts are listed, it looks like this:
insert into S(sno, sname, status, city)
values ('S1', 'Smith', 20, 'London'),
('S2', 'Jones', 10, 'Paris'),
('S3', 'Blake', 30, 'Paris'),
('S4', 'Clark', 20, 'London'),
('S5', 'Adams', 30, 'Athens');
Now i want to check this table for double entries in the column "city", so this would be London and Paris and i want to sort it by the sno and print it out.
I know that it's a bit harder in Postgres than in mySQL and i tried it like this:
SELECT sno, COUNT(city) AS NumOccurencies FROM s GROUP BY sno HAVING ( COUNT (city) > 1 );
But all i get is an empty table :(. I tried different ways but it's always the same, i don't know what to do to be honest. I hope some of you could help me out here :).
Greetings Max
You're thinking about it a little backwards. By grouping by the sno you're finding all of those rows with the same sno, not the same city. Try this instead:
SELECT
city
FROM
S
GROUP BY
city
HAVING
COUNT(*) > 1
You can then use that as a subquery to find the rows that you want:
SELECT
sno, sname, status, city
FROM
S
WHERE
city IN
(
SELECT
city
FROM
S
GROUP BY
city
HAVING
COUNT(*) > 1
)

Postgresql aggregate array

I have a two tables
Student
--------
Id Name
1 John
2 David
3 Will
Grade
---------
Student_id Mark
1 A
2 B
2 B+
3 C
3 A
Is it possible to make native Postgresql SELECT to get results like below:
Name Array of marks
-----------------------
'John', {'A'}
'David', {'B','B+'}
'Will', {'C','A'}
But not like below
Name Mark
----------------
'John', 'A'
'David', 'B'
'David', 'B+'
'Will', 'C'
'Will', 'A'
Use array_agg: http://www.sqlfiddle.com/#!1/5099e/1
SELECT s.name, array_agg(g.Mark) as marks
FROM student s
LEFT JOIN Grade g ON g.Student_id = s.Id
GROUP BY s.Id
By the way, if you are using Postgres 9.1, you don't need to repeat the columns on SELECT to GROUP BY, e.g. you don't need to repeat the student name on GROUP BY. You can merely GROUP BY on primary key. If you remove the primary key on student, you need to repeat the student name on GROUP BY.
CREATE TABLE grade
(Student_id int, Mark varchar(2));
INSERT INTO grade
(Student_id, Mark)
VALUES
(1, 'A'),
(2, 'B'),
(2, 'B+'),
(3, 'C'),
(3, 'A');
CREATE TABLE student
(Id int primary key, Name varchar(5));
INSERT INTO student
(Id, Name)
VALUES
(1, 'John'),
(2, 'David'),
(3, 'Will');
What I understand you can do something like this:
SELECT p.p_name,
STRING_AGG(Grade.Mark, ',' ORDER BY Grade.Mark) As marks
FROM Student
LEFT JOIN Grade ON Grade.Student_id = Student.Id
GROUP BY Student.Name;
EDIT
I am not sure. But maybe something like this then:
SELECT p.p_name, 
    array_to_string(ARRAY_AGG(Grade.Mark),';') As marks
FROM Student
LEFT JOIN Grade ON Grade.Student_id = Student.Id
GROUP BY Student.Name;
Reference here
You could use the following:
SELECT Student.Name as Name,
(SELECT array(SELECT Mark FROM Grade WHERE Grade.Student_id = Student.Id))
AS ArrayOfMarks
FROM Student
As described here: http://www.mkyong.com/database/convert-subquery-result-to-array/
Michael Buen got it right. I got what I needed using array_agg.
Here just a basic query example in case it helps someone:
SELECT directory, ARRAY_AGG(file_name)
FROM table
WHERE type = 'ZIP'
GROUP BY directory;
And the result was something like:
| parent_directory | array_agg |
+-------------------------+----------------------------------------+
| /home/postgresql/files | {zip_1.zip,zip_2.zip,zip_3.zip} |
| /home/postgresql/files2 | {file1.zip,file2.zip} |
This post also helped me a lot: "Group By" in SQL and Python Pandas.
It basically says that it is more convenient to use only PSQL when possible, but that Python Pandas can be useful to achieve extra functionalities in the filtering process.

SQL Server 2005 Full Text Search over multiple tables and columns

I'm looking for a good solution to use the containstable feature of the SQL Serve r2005 effectivly. Currently I have, e.g. an Employee and an Address table.
-Employee
Id
Name
-Address
Id
Street
City
EmployeeId
Now the user can enter search terms in only one textbox and I want this terms to be split and search with an "AND" operator. FREETEXTTABLE seems to work with "OR" automatically.
Now lets say the user entered "John Hamburg". This means he wants to find John in Hamburg.
So this is "John AND Hamburg".
So the following will contain no results since CONTAINSTABLE checks every column for "John AND Hamburg".
So my question is: What is the best way to perform a fulltext search with AND operators across multiple columns/tables?
SELECT *
FROM Employee emp
INNER JOIN
CONTAINSTABLE(Employee, *, '(JOHN AND Hamburg)', 1000) AS keyTblSp
ON sp.ServiceProviderId = keyTblSp.[KEY]
LEFT OUTER JOIN [Address] addr ON addr.EmployeeId = emp.EmployeeId
UNION ALL
SELECT *
FROM Employee emp
LEFT OUTER JOIN [Address] addr ON addr.EmployeeId = emp.EmployeeId
INNER JOIN
CONTAINSTABLE([Address], *, '(JOHN AND Hamburg)', 1000) AS keyTblAddr
ON addr.AddressId = keyTblAddr.[KEY]
...
This is more of a syntax problem. How do you divine the user's intent with just one input box?
Are they looking for "John Hamburg" the person?
Are they looking for "John Hamburg Street"?
Are they looking for "John" who lives on "Hamburg Street" in Springfield?
Are they looking for "John" who lives in the city of "Hamburg"?
Without knowing the user's intent, the best you can hope for is to OR the terms, and take the highest ranking hits.
Otherwise, you need to program in a ton of logic, depending on the number of words passed in:
2 words:
Search Employee data for term 1, Search Employee data for term 2, Search Address data for term 1, Search address data for term 2. Merge results by term, order by most hits.
3 words:
Search Employee data for term 1, Search Employee data for term 2, Search employee data for term 3, Search Address data for term 1, Search address data for term 2, Search address data for term 3. Merge results by term, order by most hits.
etc...
I guess I would redesign the GUI to separate the input into Name and Address, at a minimum. If that is not possible, enforce a syntax rule to the effect "First words will be considered a name until a comma appears, any words after that will be considered addresses"
EDIT:
Your best bet is still OR the terms, and take the highest ranking hits. Here's an example of that, and an example why this is not ideal without some pre-processing of the input to divine the user's intent:
insert into Employee (id, [name]) values (1, 'John Hamburg')
insert into Employee (id, [name]) values (2, 'John Smith')
insert into Employee (id, [name]) values (3, 'Bob Hamburg')
insert into Employee (id, [name]) values (4, 'Bob Smith')
insert into Employee (id, [name]) values (5, 'John Doe')
insert into Address (id, street, city, employeeid) values (1, 'Main St.', 'Springville', 1)
insert into Address (id, street, city, employeeid) values (2, 'Hamburg St.', 'Springville', 2)
insert into Address (id, street, city, employeeid) values (3, 'St. John Ave.', 'Springville', 3)
insert into Address (id, street, city, employeeid) values (4, '5th Ave.', 'Hamburg', 4)
insert into Address (id, street, city, employeeid) values (5, 'Oak Lane', 'Hamburg', 5)
Now since we don't know what keywords will apply to what table, we have to assume they could apply to either table, so we have to OR the terms against each table, UNION the results, Aggregate them, and compute the highest rank.
SELECT Id, [Name], Street, City, SUM([Rank])
FROM
(
SELECT emp.Id, [Name], Street, City, [Rank]
FROM Employee emp
JOIN [Address] addr ON emp.Id = addr.EmployeeId
JOIN CONTAINSTABLE(Employee, *, 'JOHN OR Hamburg') AS keyTblEmp ON emp.Id = keyTblEmp.[KEY]
UNION ALL
SELECT emp.Id, [Name], Street, City, [Rank]
FROM Employee emp
JOIN [Address] addr ON emp.Id = addr.EmployeeId
JOIN CONTAINSTABLE([Address], *, 'JOHN OR Hamburg') AS keyTblAdd ON addr.Id = keyTblAdd.[KEY]
) as tmp
GROUP BY Id, [Name], Street, City
ORDER BY SUM([Rank]) DESC
This is less than ideal, here's what you get for the example (in your case, you would have wanted John Doe from Hamburg to show up first):
Id Name Street City Rank
2 John Smith Hamburg St. Springville 112
3 Bob Hamburg St. John Ave. Springville 112
5 John Doe Oak Lane Hamburg 96
1 John Hamburg Main St. Springville 48
4 Bob Smith 5th Ave. Hamburg 48
But that is the best you can do without parsing the input before submitting it to SQL to make a "best guess" at what the user wants.
I had the same problem. Here is my solution, which worked for my case:
I created a view that returns the columns that I want. I added another extra column which aggregates all the columns I want to search among. So, in this case the view would be like
SELECT emp.*, addr.*, ISNULL(emp.Name,'') + ' ' + ISNULL(addr.City, '') AS SearchResult
FROM Employee emp
LEFT OUTER JOIN [Address] addr ON addr.EmployeeId = emp.EmployeeId
After this I created a full-text index on SearchResult column. Then, I search on this column
SELECT *
FROM vEmpAddr ea
INNER JOIN CONTAINSTABLE(vEmpAddr, *, 'John AND Hamburg') a ON ea.ID = a.[Key]