Best way to write a query with too many joins - sql

I have a database table (let's call it project) and many other tables, most of these other table have a foreign key (id_project) referencing the table project.
The goal of this query is to return which phase the project is in right now (a project develops little by little until it reaches its end) and there are over 20 tables that a project may pass by, my solution to this was using too many joins and see what table has null values like this
SELECT
p.id_project
CASE
WHEN po.id IS NOT NULL THEN 'payment completed'
WHEN b.id IS NOT NULL THEN 'bill received'
WHEN e.id IS NOT NULL THEN 'project engaged'
--(and still many other cases)
ELSE 'start of the project'
END AS progress
FROM
project p
LEFT JOIN
decision d ON d.id_project = p.id_project
LEFT JOIN
engagement e ON e.id_project = p.id_project
LEFT JOIN
bill b ON b.id_project = p.id_project
LEFT JOIN
payment_order po ON po.id_project = p.id_project
LEFT JOIN
--..... (many other tables)
This query takes about 9 seconds at best to execute and it is used quite frequently (as a view called from other queries).
Is it possible to have another better solution or is this or another approach?

Now about another approach? A project can be only in one phase at the moment; right? So, you could alter the PROJECT table and add a new column - PROJECT_PHASE - which would contain current phase. That column is to be updated as soon as project moves to another phase; the way I understood it, it is when a new row is created in any of those 20 tables.
Another option is to create a new table, project_phase which would contain id_project and `id_phase' combination (along with e.g. timestamp, whatever).
Any approach would mean that you'd quickly fetch current project phase, without outer joining 20 (large?) tables which takes time.

We don't know your data, but your database design shows a 1:n relation for all tables, i.e. multiple decisions for one project, multiple engagements for one project, multiple bills, etc. Now let's assume there are three decisions, three engagements and four bills so far for a project. You join all rows on the project ID alone. This is called a cartesian product per project, creating all combinations (each row with each other row) producing 3 x 3 x 4 = 36 rows for this one project alone.
I am surprised you haven't noticed this yourself, as you say you are already using the query and there is no aggregation taking place. Or is this what you refer to with "too many joins"?
Instead of cross joining all those rows just look the tables up with EXISTS or IN.
SELECT
p.id_project,
CASE
WHEN p.id_project IN (SELECT po.id_project FROM payment_order po) THEN 'payment completed'
WHEN p.id_project IN (SELECT b.id_project FROM bill b) THEN 'bill received'
WHEN p.id_project IN (SELECT e.id_project FROM engagement e) THEN 'project engaged'
-- (and still many other cases)
ELSE 'start of the project'
END AS progress
FROM project p;
A faster alternative would be to store the status in the project table as suggested by Littlefoot (and ideally a table for the statuses) and then have triggers on all those tables that update that status.

Related

Simple.Data - How to apply WHERE clauses to joined tables

I'm trying to use Simple.Data as my ADO, but I've run into a problem trying to put together a query that joins a couple of tables, then filters the results based on values in the non-primary tables.
Scenario is a job application app (but jobs are like a specific task to be done on a given day). There are 3 relevant tables, jobs, applications and application_history. The can be many applications for each record in the jobs tables, and many application_history records for each applications. In the application_history table, there's a status column as each application gets sent, offered and finally accepted.
So I want a query that returns all the accepted applications that are for jobs in the future; i.e. where the date column in the jobs table is in the future and where there's an associated record in the application_history table where the status column is 5 (meaning accepted).
If this was plain old SQL, I'd use this query:
SELECT A.* FROM application AS A
INNER JOIN application_history AS AH ON AH.application_id = A.id
INNER JOIN job AS J ON J.id = A.job_id
WHERE AH.status_id = 3 AND J.date > date('now')
But I want to know how to achieve the same thing using Simple.Data. For bonus points, if you could start by ignoring the 'job must be in the future' step, that will help me understand what's going on.
As a reference: Simple.Data documentation especially the part about explicit joins.
You should be able to do something like this:
//db is your Simple.Data Database object
db.application
.Join(db.application_history)
.On(db.application.id == db.application_history.application_id)
.Join(db.job )
.On(db.Applications.job_id == db.job.id)
.Where(db.application_history.status_id == 3 && db.job.date > DateTime.Now());
I'm not sure whether or not Simple.Data knows how to handle the Date part.

Multiple JOIN statements returning multiple rows

I believe I need a fresh set of eyes, my attention has been pulled elsewhere at work and I have not had the time to figure this out. So I'm hoping someone may be kind enough to offer a suggestion.
Here is an abbreviated version of my SQL statement:
SELECT
PR.PROJECTNUM,
PR.PROJECTNUMBER,
PR.AMRNUM,
W.WONUM,
C.PONUM,
C.POLINENUM
FROM PROJECT PR
INNER JOIN WORKORDER W
ON PR.PROJECTNUM = W.PROJECTNUM
OR PR.PROJECTNUMBER = W.PROJECTNUMBER
OR PR.AMRNUM = W.AMRNUM
INNER JOIN
(SELECT PL.WONUM, P.PONUM, PL.POLINENUM FROM PO P
INNER JOIN POLINE PL ON P.PONUM = PL.PONUM) C
ON W.WONUM = C.WONUM;
As you can see, I'm joining 4 tables here.PO to POLINE to WORKORDER to PROJECT. The issue lies with the multiple joining attributes between the WORKORDER and PROJECT table.
I do not know beforehand which attribute/field will be populated with a value in the WORKORDER table, but at least one will be...but sometimes all three. The duplication occurs when more than one of the joining attributes in the WORKORDER table is populated with a matching value in the PROJECT table.
It's almost as if I need to test for the presence of a value in the joining attribute from the WORKORDER table before I execute the above SQL....and if more than one is populated with a value, then I need to find which one of the PROJECTattributes has a matching value....geez...even typing it out is making my head spin...lol
I may need to come back in the morning and add a little more context, my brain is fried at the moment :)
Thanks for reading!

Full outer join joining together every record multiple times

Query below:
select
cu.course_id as 'bb_course_id',
cu.user_id as 'bb_user_id',
cu.role as 'bb_role',
cu.available_ind as 'bb_available_ind',
CASE cu.row_status WHEN 0 THEN 'ENABLED' ELSE 'DISABLED' END AS 'bb_row_status',
eff.course_id as 'registrar_course_id',
eff.user_id as 'registrar_user_id',
eff.role as 'registrar_role',
eff.available_ind as 'registrar_available_ind',
CASE eff.row_status WHEN 'DISABLE' THEN 'DISABLED' END as 'registrar_row_status'
into enrollments_comparison_temp
from narrowed_users_enrollments cu
full outer join enrollments_feed_file eff on cu.course_id = eff.course_id
Quick background: I'm taking the data from a replicated table and selecting it into narrowed_users_enrollments based on some criteria. In a script I'm taking a text feed file, with enrollment data, and inserting it into enrollments_feed_file. The purpose is to compare the most recent enrollment data with enrollments already in the database.
However the issue is that joining these tables results in about 160,000 rows when I'm really only expecting about 22,000. The point of doing this comparison is so that I can look for nulled values on either side of the join. For example, if the table on the right contains a null, then disable the enrollment record. If the table on the left contains a null, then add this student's enrollment.
I know it's a little off because I'm not using PKs or FKs. This is what is selected into the table:
Here's a screenshot showing a select * from the enrollments table on the left and a feed file on the right.
http://i.imgur.com/0ZPZ9HS.png
Here's a screenshot showing the newly created table from the full outer join.
http://i.imgur.com/89ssAkS.png
As you can see even though there there's only one matching enrollment(the matching jmartinez12 columns), there's 4 extra rows created for the same record on the left for the enrollments on the right. What I'm trying to get is for it to be 5 rows, with the first being how it is in the screenshot(matching pre-existing enrollment and enrollment in the feed file), BUT, the next 4 rows with the bb_* columns should be NULL up to the registrar_course_id.
Am I overlooking something simple here? I've tried a select distinct and I've added a where clause specifying when the course_ids are equal however that ensures that I won't get null rows which I need. I have also joined the tables on the user_id however the results are still the same.
One quick suggestion is to add the DISTNCT clause. If the records you are setting are complete duplicates that may cut it down to what you are expecting.
The fix was to also join on:
ON cu.course_id = eff.course_id AND cu.user_id = eff.user_id

Derived column results

i'm an sql novice and have to Formulate an SQL query that lists all 5 columns from a QUALITY table and adds two more columns: ProductCode of the items produced in the batch, and a derived column BatchQuality that contains “Poor” if the batch is of poor quality (contains more than 1 defective item) and “Good” otherwise.
I'm pulling from 3 tables that I put in an oracle database: Production table(contains serialno, batchno, and productcode), Quality table (batchno, test1, test2, teste3, test4), and defective table (defectiveid, serialno).
I'm able to get 6 out of 7 columns by using the following:
select q.batchno, q.test1, q.test2, q.test3, q.test4, p.productcode_id
from production p, defective d, quality q
where d.serialno = p.serialno
and p.batchno = q.batchno;
Any ideas on how to get the last column called batchquality that says if it's good or poor? I'm thinking that I need a count function, but once I have that, how would I go about getting a new column that would state poor or good?
Appreciate any help that can be provided.
Your current query is an inner join using an old, outdated implicit join in the where clause. I assume the defective table only contains a row for a product if there was a defect. Your inner join will always return defective parts only, never parts without defects. For that you need an outer join. Another reason to ditch the outdated implicit joins and use an explicit JOIN operator:
select q.batchno, q.test1, q.test2, q.test3, q.test4, p.productcode_id
from production p
JOIN quality q ON p.batchno = q.batchno;
LEFT JOIN defective d ON d.serialno = p.serialno
For products that do not have defects, the values for the columns from the defective table will be null. So to get a flag if a product had is "good" or "bad" you need to check if the value is null:
select q.batchno, q.test1, q.test2, q.test3, q.test4, p.productcode_id,
case
when d.serialno is null then 'good'
else 'bad'
as batch_quality
from production p
JOIN quality q ON p.batchno = q.batchno;
LEFT JOIN defective d ON d.serialno = p.serialno
Due to the nature of joins, the above statement will however repeat each row from the production table for each row in the quality and defective table. It is not clear to me if you want that or not.

Where are Cartesian Joins used in real life?

Where are Cartesian Joins used in real life?
Can some one please give examples of such a Join in any SQL database.
just random example. you have a table of cities: Id, Lat, Lon, Name. You want to show user table of distances from one city to another. You will write something like
SELECT c1.Name, c2.Name, SQRT( (c1.Lat - c2.Lat) * (c1.Lat - c2.Lat) + (c1.Lon - c2.Lon)*(c1.Lon - c2.Lon))
FROM City c1, c2
Here are two examples:
To create multiple copies of an invoice or other document you can populate a temporary table with names of the copies, then cartesian join that table to the actual invoice records. The result set will contain one record for each copy of the invoice, including the "name" of the copy to print in a bar at the top or bottom of the page or as a watermark. Using this technique the program can provide the user with checkboxes letting them choose what copies to print, or even allow them to print "special copies" in which the user inputs the copy name.
CREATE TEMP TABLE tDocCopies (CopyName TEXT(20))
INSERT INTO tDocCopies (CopyName) VALUES ('Customer Copy')
INSERT INTO tDocCopies (CopyName) VALUES ('Office Copy')
...
INSERT INTO tDocCopies (CopyName) VALUES ('File Copy')
SELECT * FROM InvoiceInfo, tDocCopies WHERE InvoiceDate = TODAY()
To create a calendar matrix, with one record per person per day, cartesian join the people table to another table containing all days in a week, month, or year.
SELECT People.PeopleID, People.Name, CalDates.CalDate
FROM People, CalDates
I've noticed this being done to try to deliberately slow down the system either to perform a stress test or an excuse for missing development deliverables.
Usually, to generate a superset for the reports.
In PosgreSQL:
SELECT COALESCE(SUM(sales), 0)
FROM generate_series(1, 12) month
CROSS JOIN
department d
LEFT JOIN
sales s
ON s.department = d.id
AND s.month = month
GROUP BY
d.id, month
This is the only time in my life that I've found a legitimate use for a Cartesian product.
At the last company I worked at, there was a report that was requested on a quarterly basis to determine what FAQs were used at each geographic region for a national website we worked on.
Our database described geographic regions (markets) by a tuple (4, x), where 4 represented a level number in a hierarchy, and x represented a unique marketId.
Each FAQ is identified by an FaqId, and each association to an FAQ is defined by the composite key marketId tuple and FaqId. The associations are set through an admin application, but given that there are 1000 FAQs in the system and 120 markets, it was a hassle to set initial associations whenever a new FAQ was created. So, we created a default market selection, and overrode a marketId tuple of (-1,-1) to represent this.
Back to the report - the report needed to show every FAQ question/answer and the markets that displayed this FAQ in a 2D matrix (we used an Excel spreadsheet). I found that the easiest way to associate each FAQ to each market in the default market selection case was with this query, unioning the exploded result with all other direct FAQ-market associations.
The Faq2LevelDefault table holds all of the markets that are defined as being in the default selection (I believe it was just a list of marketIds).
SELECT FaqId, fld.LevelId, 1 [Exists]
FROM Faq2Levels fl
CROSS JOIN Faq2LevelDefault fld
WHERE fl.LevelId=-1 and fl.LevelNumber=-1 and fld.LevelNumber=4
UNION
SELECT Faqid, LevelId, 1 [Exists] from Faq2Levels WHERE LevelNumber=4
You might want to create a report using all of the possible combinations from two lookup tables, in order to create a report with a value for every possible result.
Consider bug tracking: you've got one table for severity and another for priority and you want to show the counts for each combination. You might end up with something like this:
select severity_name, priority_name, count(*)
from (select severity_id, severity_name,
priority_id, priority_name
from severity, priority) sp
left outer join
errors e
on e.severity_id = sp.severity_id
and e.priority_id = sp.priority_id
group by severity_name, priority_name
In this case, the cartesian join between severity and priority provides a master list that you can create the later outer join against.
When running a query for each date in a given range. For example, for a website, you might want to know for each day, how many users were active in the last N days. You could run a query for each day in a loop, but it's simplest to keep all the logic in the same query, and in some cases the DB can optimize the Cartesian join away.
To create a list of related words in text mining, using similarity functions, e.g. Edit Distance