Counting multiple different one to many relationships - sql

I have the following SQL:
SELECT j.AssocJobKey
, COUNT(DISTINCT o.ID) AS SubjectsOrdered
, COUNT(DISTINCT s.ID) AS SubjectsShot
FROM Jobs j
LEFT JOIN Orders o ON o.AssocJobKey = j.AssocJobKey
LEFT JOIN Subjects s ON j.AssocJobKey = s.AssocJobKey
GROUP BY
j.AssocJobKey
,j.JobYear
The basic structure is a Job is the parent that is unique by AssocJobKey and has a one to many relationships with Subjects and Orders.
The query gives me what I want, the output looks like this:
| AssocJobKey | SubjectsOrdered | SubjectsShot |
|-----------------------|------------------------|---------------------|
| BAT-H181 | 107 | 830 |
|--------------------- |------------------------|---------------------|
| BAT-H131 | 226 | 1287 |
The problem is the query is way to heavy and my memory is spiking, there's no way I could run this on a large dataset. If I remove one of the LEFT JOINs on the corresponding count the query executes instantly and theres no problem. So somehow things are bouncing around between the two left joins more than they should, but I don't understand why they would.
Really hoping to avoid joining on sub selects if at all possible.

Your query is generating a Cartesian product for each job. And this is big -- your second row has about 500k rows being generated. COUNT(DISTINCT) then has to figure out the unique ids among this Cartesian product.
The solution is simple: pre-aggregate:
SELECT j.AssocJobKey, o.SubjectsOrdered, s.SubjectsShot
FROM Jobs j LEFT JOIN
(SELECT o.AssocJobKey, COUNT(*) as SubjectsOrdered
FROM Orders o
GROUP BY o.AssocJobKey
) o
ON o.AssocJobKey = j.AssocJobKey LEFT JOIN
(SELECT j.AssocJobKey, COUNT(s.ID) AS SubjectsShot
FROM Subjects s
GROUP BY j.AssocJobKey
) s
ON j.AssocJobKey = s.AssocJobKey;
This makes certain assumptions that I think are reasonable:
The ids in the orders and subjects table are unique and non-NULL.
jobs.AssocJobKey is unique.
The query can be easily adapted if either of these are not true, but they seem like reasonable assumptions.
Often for these types of joins over different dimensions, COUNT(DISTINCT) is a reasonable solution (the queries are certainly simpler). This is true when there are at most a handful of values.

Related

How to select rows by comparing two sums on child tables, without subqueries?

I have a schema that looks like the following:
Invoices: | id | ... |
InvoicePayments: | id | invoice_id | amount_cents | ... |
LineItems: | id | invoice_id | unit_price_cents | quantity | ... |
and I am looking to find unpaid invoices, that is, Invoices who have a sum of amount_cents from InvoicePayments that is less than the sum of (quantity * unit_price) from LineItems. I was able to accomplish this with two sub queries, like:
SELECT prices.id FROM (
SELECT invoices.id, sum(invoice_payments.amount_cents) as paid
FROM invoices
LEFT JOIN invoice_payments ON invoice_payments.invoice_id = invoices.id
GROUP BY invoices.id
) payments JOIN (
SELECT invoices.id, sum(line_items.quantity * line_items.unit_price_cents) as price
FROM invoices
LEFT JOIN line_items ON line_items.invoice_id = invoices.id
GROUP BY invoices.id
) prices
ON payments.id = prices.id
WHERE paid < price OR paid IS NULL;
However, I am using ActiveRecord and would like something simpler that could be translated into Arel statements; additionally, I would like to use this as a reusable scope, so I could apply additional constraints, such as finding Invoices that were unpaid on a certain date, by filtering out InvoicePayments that are newer than that date.
Is there a way to accomplish this without subqueries so that I can use this more easily with Rails and apply flexible filters?
One approach would be to define views to encapsulate the subqueries you have defined above, and to then define read only models on them.
They can then associate with the invoice model, and give you the opportunity to simplify your syntax considerably.
SELECT
i.id
COALESCE(SUM(l.quantity * l.unit_price_cents),0) - COALESCE(SUM(p.amounts_cents),0) as UnpaidBalance
FROM
invoices i
LEFT JOIN invoice_payments p
ON i.id = p.invoice_id
AND p.DateColumn <= '2016-01-30'
LEFT JOIN line_items l
ON i.id = l.invoice_id
GROUP BY
i.id
HAVING
COALESCE(SUM(p.amounts_cents),0) < COALESCE(SUM(l.quantity * l.unit_price_cents),0)
Especially with the addition of window functions to many relational database management systems such as postgressql you rarely need to sub queries like that anymore. When using left joins aggregate functions will ignore the null values so by simply combining your left joins into the same query you can still get to your desired result.
You may notice the use of table alias which makes it a little easier to read and a lot less writing of code for you. you can define an alias by typing one just after then table name in your query. I did include p.DateColumn <= some date. you can parameratize the query to pass a variable their and test if it is null and choose current date time if you want as of today.

Distinct Count with two tables

I have two tables on Access, Customer and Transaction. I'm trying to find the total # of transactions sorted by Dogs (0-3). The Transaction table has a line for each item bought, so multiple lines can be for one TransactionID.
Here's what I have so far:
SELECT Customer.Dogs, COUNT(Transaction.TransactionID) AS TotTrans
FROM Transaction, Customer
WHERE Transaction.CustomerID = Customer.CustomerID
GROUP BY Dogs
And I get
Dogs | TotTrans
0 | 130104
1 | 59132
2 | 17811
3 | 1401
Obviously this counts the total rows in the Transaction Table and sorts them by # of dogs. However, it is counting for the duplicates in the Transaction Table (e.g. There are three rows with TransactionID = 2, because in that transaction the customer bought 3 items. The Count is obviously including the extra 2 rows).
When I try to do COUNT(DISTINCT Transaction.TransactionID) it doesn't work, and the message
"Syntax error (missing operator) in query expression 'COUNT(DISTINCT Transaction.TransactionID)'.
I have looked around, but can't seem to find the solution. I think part of the problem stems from the fact that I'm selecting two attributes.
If anyone could help explain what to do and the logic behind it, that would be great!
You should join the customer table with an already distinct-ed table (using inner query)
SELECT Customer.Dogs, COUNT(distinctTransactions.TransactionID) AS TotTrans
FROM (select distinct TransactionID,CustomerID from Transaction) as
distinctTransactions, Customer
WHERE distinctTransactions.CustomerID = Customer.CustomerID
GROUP BY Dogs
You should learn to use proper join syntax. Also, table aliases make the query easier to write and to read:
SELECT c.Dogs, COUNT(DISTINCT t.TransactionID) AS TotTrans
FROM Transaction t JOIN
Customer c
ON t.CustomerID = c.CustomerID
GROUP BY c.Dogs
ORDER BY c.Dogs;

self join vs inner join

what is difference between self join and inner join
I find it helpful to think of all of the tables in a SELECT statement as representing their own data sets.
Before you've applied any conditions you can think of each data set as being complete (the entire table, for instance).
A join is just one of several ways to begin refining those data sets to find the information that you really want.
Though a database schema may be designed with certain relationships in mind (Primary Key <-> Foreign Key) these relationships really only exist in the context of a particular query. The query writer can relate whatever they want to whatever they want. I'll give an example of this later...
An INNER JOIN relates two tables to each other. There are often multiple JOIN operations in one query to chain together multiple tables. It can get as complicated as it needs to. For a simple example, consider the following three tables...
STUDENT
| STUDENTID | LASTNAME | FIRSTNAME |
------------------------------------
1 | Smith | John
2 | Patel | Sanjay
3 | Lee | Kevin
4 | Jackson | Steven
ENROLLMENT
| ENROLLMENT ID | STUDENTID | CLASSID |
---------------------------------------
1 | 2 | 3
2 | 3 | 1
3 | 4 | 2
CLASS
| CLASSID | COURSE | PROFESSOR |
--------------------------------
1 | CS 101 | Smith
2 | CS 201 | Ghandi
3 | CS 301 | McDavid
4 | CS 401 | Martinez
The STUDENT table and the CLASS table were designed to relate to each other through the ENROLLMENT table. This kind of table is called a Junction Table.
To write a query to display all students and the classes in which they are enrolled one would use two inner joins...
SELECT stud.LASTNAME, stud.FIRSTNAME, class.COURSE, class.PROFESSOR
FROM STUDENT stud
INNER JOIN ENROLLMENT enr
ON stud.STUDENTID = enr.STUDENTID
INNER JOIN CLASS class
ON class.CLASSID = enr.CLASSID;
Read the above closely and you should see what is happening. What you will get in return is the following data set...
| LASTNAME | FIRSTNAME | COURSE | PROFESSOR |
---------------------------------------------
Patel | Sanjay | CS 301 | McDavid
Lee | Kevin | CS 101 | Smith
Jackson | Steven | CS 201 | Ghandi
Using the JOIN clauses we've limited the data sets of all three tables to only those that match each other. The "matches" are defined using the ON clauses. Note that if you ran this query you would not see the CLASSID 4 row from the CLASS table or the STUDENTID 1 row from the STUDENT table because those IDs don't exist in the matches (in this case the ENROLLMENT table). Look into "LEFT"/"RIGHT"/"FULL OUTER" JOINs for more reading on how to make that work a little differently.
Please note, per my comments on "relationships" earlier, there is no reason why you couldn't run a query relating the STUDENT table and the CLASS table directly on the LASTNAME and PROFESSOR columns. Those two columns match in data type and, well look at that! They even have a value in common! This would probably be a weird data set to get in return. My point is it can be done and you never know what needs you might have in the future for interesting connections in your data. Understand the design of the database but don't think of "relationships" as being rules that can't be ignored.
In the meantime... SELF JOINS!
Consider the following table...
PERSON
| PERSONID | FAMILYID | NAME |
--------------------------------
1 | 1 | John
2 | 1 | Brynn
3 | 2 | Arpan
4 | 2 | Steve
5 | 2 | Tim
6 | 3 | Becca
If you felt so inclined as to make a database of all the people you know and which ones are in the same family this might be what it looks like.
If you wanted to return one person, PERSONID 4, for instance, you would write...
SELECT * FROM PERSON WHERE PERSONID = 4;
You would learn that he is in the family with FAMILYID 2. Then to find all of the PERSONs in his family you would write...
SELECT * FROM PERSON WHERE FAMILYID = 2;
Done and done! SQL, of course, can accomplish this in one query using, you guessed it, a SELF JOIN.
What really triggers the need for a SELF JOIN here is that the table contains a unique column (PERSONID) and a column that serves as sort of a "Category" (FAMILYID). This concept is called Cardinality and in this case represents a one to many or 1:M relationship. There is only one of each PERSON but there are many PERSONs in a FAMILY.
So, what we want to return is all of the members of a family if one member of the family's PERSONID is known...
SELECT fam.*
FROM PERSON per
JOIN PERSON fam
ON per.FamilyID = fam.FamilyID
WHERE per.PERSONID = 4;
Here's what you would get...
| PERSONID | FAMILYID | NAME |
--------------------------------
3 | 2 | Arpan
4 | 2 | Steve
5 | 2 | Tim
Let's note a couple of things. The words SELF JOIN don't occur anywhere. That's because a SELF JOIN is just a concept. The word JOIN in the query above could have been a LEFT JOIN instead and different things would have happened. The point of a SELF JOIN is that you are using the same table twice.
Consider my soapbox from before on data sets. Here we have started with the data set from the PERSON table twice. Neither instance of the data set affects the other one unless we say it does.
Let's start at the bottom of the query. The per data set is being limited to only those rows where PERSONID = 4. Knowing the table we know that will return exactly one row. The FAMILYID column in that row has a value of 2.
In the ON clause we are limiting the fam data set (which at this point is still the entire PERSON table) to only those rows where the value of FAMILYID matches one or more of the FAMILYIDs of the per data set. As we discussed we know the per data set only has one row, therefore one FAMILYID value. Therefore the fam data set now contains only rows where FAMILYID = 2.
Finally, at the top of the query we are SELECTing all of the rows in the fam data set.
Voila! Two queries in one.
In conclusion, an INNER JOIN is one of several kinds of JOIN operations. I would strongly suggest reading further into LEFT, RIGHT and FULL OUTER JOINs (which are, collectively, called OUTER JOINs). I personally missed a job opportunity for having a weak knowledge of OUTER JOINs once and won't let it happen again!
A SELF JOIN is simply any JOIN operation where you are relating a table to itself. The way you choose to JOIN that table to itself can use an INNER JOIN or an OUTER JOIN. Note that with a SELF JOIN, so as not to confuse your SQL engine you must use table aliases (fam and per from above. Make up whatever makes sense for your query) or there is no way to differentiate the different versions of the same table.
Now that you understand the difference open your mind nice and wide and realize that one single query could contain all different kinds of JOINs at once. It's just a matter of what data you want and how you have to twist and bend your query to get it. If you find yourself running one query and taking the result of that query and using it as the input of another query then you can probably use a JOIN to make it one query instead.
To play around with SQL try visiting W3Schools.com There is a locally stored database there with a bunch of tables that are designed to relate to each other in various ways and it's filled with data! You can CREATE, DROP, INSERT, UPDATE and SELECT all you want and return the database back to its default at any time. Try all sorts of SQL out to experiment with different tricks. I've learned a lot there, myself.
Sorry if this was a little wordy but I personally struggled with the concept of JOINs when I was starting to learn SQL and explaining a concept by using a bunch of other complex concepts bogged me down. Best to start at the bottom sometimes.
I hope it helps. If you can put JOINs in your back pocket you can work magic with SQL!
Happy querying!
A self join joins a table to itself. The employee table might be joined to itself in order to show the manager name and the employee name in the same row.
An inner join joins any two tables and returns rows where the key exists in both tables. A self join can be an inner join (most joins are inner joins and most self joins are inner joins). An inner join can be a self join but most inner joins involve joining two different tables (generally a parent table and a child table).
An inner join (sometimes called a simple join) is a join of two or more tables that returns only those rows that satisfy the join condition.
A self join is a join of a table to itself. This table appears twice in the FROM clause and is followed by table aliases that qualify column names in the join condition. To perform a self join, Oracle Database combines and returns rows of the table that satisfy the join condition.

SQLite3 - Complex WHERE expression that uses two tables

Sorry for my English
I have two tables:
Partners
ID | NAME | IS_FAVORITE
PartnerPoints
ID | PARTNER_ID | NAME
And I want to get all rows from PartnerPoints which related to Partners (by PARTNER_ID) with the field IS_FAVORITE set to 1. I.e. I want to get all favorite partner points.
How can I do that?
You just need to use a WHERE clause:
SELECT PartnerPoints.*
FROM PartnerPoints
WHERE EXISTS ( SELECT *
FROM Partners
WHERE Partners.ID = PartnerPoints.PARTNER_ID
AND Partners.IS_FAVORITE = 1
)
You can do this by JOINING the tables.
SELECT PartnerPoints.*
FROM PartnerPoints JOIN Partners ON PartnerPoints.Partner_ID=Partners.ID
WHERE Partners.Is_favorite = 1
This is an INNER JOIN. Oscar Pérez’s answer, with the subquery, is called a SEMI-JOIN. The database may execute the same plan, or this INNER JOIN may be faster. In more complicated cases, you may have to use a semi-join.
You can do this by first computing the IDs of all favorite partners, and then searching for PartnerPoints that have such a partner ID:
SELECT *
FROM PartnerPoints
WHERE Partner_ID IN (SELECT ID
FROM Partners
WHERE Is_Favorite = 1)
Which type of query is fastest depends on the amount and distribution of data in the tables, and of which indexes you have; if the speed actually matters to you, you have to measure all three queries.

SQL JOIN returning multiple rows when I only want one row

I am having a slow brain day...
The tables I am joining:
Policy_Office:
PolicyNumber OfficeCode
1 A
2 B
3 C
4 D
5 A
Office_Info:
OfficeCode AgentCode OfficeName
A 123 Acme
A 456 Acme
A 789 Acme
B 111 Ace
B 222 Ace
B 333 Ace
... ... ....
I want to perform a search to return all policies that are affiliated with an office name. For example, if I search for "Acme", I should get two policies: 1 & 5.
My current query looks like this:
SELECT
*
FROM
Policy_Office P
INNER JOIN Office_Info O ON P.OfficeCode = O.OfficeCode
WHERE
O.OfficeName = 'Acme'
But this query returns multiple rows, which I know is because there are multiple matches from the second table.
How do I write the query to only return two rows?
SELECT DISTINCT a.PolicyNumber
FROM Policy_Office a
INNER JOIN Office_Info b
ON a.OfficeCode = b.OfficeCode
WHERE b.officeName = 'Acme'
SQLFiddle Demo
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
Simple join returns the Cartesian multiplication of the two sets and you have 2 A in the first table and 3 A in the second table and you probably get 6 results. If you want only the policy number then you should do a distinct on it.
(using MS-Sqlserver)
I know this thread is 10 years old, but I don't like distinct (in my head it means that the engine gathers all possible data, computes every selected row in each record into a hash and adds it to a tree ordered by that hash; I may be wrong, but it seems inefficient).
Instead, I use CTE and the function row_number(). The solution may very well be a much slower approach, but it's pretty, easy to maintain and I like it:
Given is a person and a telephone table tied together with a foreign key (in the telephone table). This construct means that a person can have more numbers, but I only want the first, so that each person only appears one time in the result set (I ought to be able concatenate multiple telephone numbers into one string (pivot, I think), but that's another issue).
; -- don't forget this one!
with telephonenumbers
as
(
select [id]
, [person_id]
, [number]
, row_number() over (partition by [person_id] order by [activestart] desc) as rowno
from [dbo].[telephone]
where ([activeuntil] is null or [activeuntil] > getdate()
)
select p.[id]
,p.[name]
,t.[number]
from [dbo].[person] p
left join telephonenumbers t on t.person_id = p.id
and t.rowno = 1
This does the trick (in fact the last line does), and the syntax is readable and easy to expand. The example is simple but when creating large scripts that joins tables left and right (literally), it is difficult to avoid that the result contains unwanted duplets - and difficult to identify which tables creates them. CTE works great for me.