Simple.Data - How to apply WHERE clauses to joined tables - sql

I'm trying to use Simple.Data as my ADO, but I've run into a problem trying to put together a query that joins a couple of tables, then filters the results based on values in the non-primary tables.
Scenario is a job application app (but jobs are like a specific task to be done on a given day). There are 3 relevant tables, jobs, applications and application_history. The can be many applications for each record in the jobs tables, and many application_history records for each applications. In the application_history table, there's a status column as each application gets sent, offered and finally accepted.
So I want a query that returns all the accepted applications that are for jobs in the future; i.e. where the date column in the jobs table is in the future and where there's an associated record in the application_history table where the status column is 5 (meaning accepted).
If this was plain old SQL, I'd use this query:
SELECT A.* FROM application AS A
INNER JOIN application_history AS AH ON AH.application_id = A.id
INNER JOIN job AS J ON J.id = A.job_id
WHERE AH.status_id = 3 AND J.date > date('now')
But I want to know how to achieve the same thing using Simple.Data. For bonus points, if you could start by ignoring the 'job must be in the future' step, that will help me understand what's going on.

As a reference: Simple.Data documentation especially the part about explicit joins.
You should be able to do something like this:
//db is your Simple.Data Database object
db.application
.Join(db.application_history)
.On(db.application.id == db.application_history.application_id)
.Join(db.job )
.On(db.Applications.job_id == db.job.id)
.Where(db.application_history.status_id == 3 && db.job.date > DateTime.Now());
I'm not sure whether or not Simple.Data knows how to handle the Date part.

Related

Best way to write a query with too many joins

I have a database table (let's call it project) and many other tables, most of these other table have a foreign key (id_project) referencing the table project.
The goal of this query is to return which phase the project is in right now (a project develops little by little until it reaches its end) and there are over 20 tables that a project may pass by, my solution to this was using too many joins and see what table has null values like this
SELECT
p.id_project
CASE
WHEN po.id IS NOT NULL THEN 'payment completed'
WHEN b.id IS NOT NULL THEN 'bill received'
WHEN e.id IS NOT NULL THEN 'project engaged'
--(and still many other cases)
ELSE 'start of the project'
END AS progress
FROM
project p
LEFT JOIN
decision d ON d.id_project = p.id_project
LEFT JOIN
engagement e ON e.id_project = p.id_project
LEFT JOIN
bill b ON b.id_project = p.id_project
LEFT JOIN
payment_order po ON po.id_project = p.id_project
LEFT JOIN
--..... (many other tables)
This query takes about 9 seconds at best to execute and it is used quite frequently (as a view called from other queries).
Is it possible to have another better solution or is this or another approach?
Now about another approach? A project can be only in one phase at the moment; right? So, you could alter the PROJECT table and add a new column - PROJECT_PHASE - which would contain current phase. That column is to be updated as soon as project moves to another phase; the way I understood it, it is when a new row is created in any of those 20 tables.
Another option is to create a new table, project_phase which would contain id_project and `id_phase' combination (along with e.g. timestamp, whatever).
Any approach would mean that you'd quickly fetch current project phase, without outer joining 20 (large?) tables which takes time.
We don't know your data, but your database design shows a 1:n relation for all tables, i.e. multiple decisions for one project, multiple engagements for one project, multiple bills, etc. Now let's assume there are three decisions, three engagements and four bills so far for a project. You join all rows on the project ID alone. This is called a cartesian product per project, creating all combinations (each row with each other row) producing 3 x 3 x 4 = 36 rows for this one project alone.
I am surprised you haven't noticed this yourself, as you say you are already using the query and there is no aggregation taking place. Or is this what you refer to with "too many joins"?
Instead of cross joining all those rows just look the tables up with EXISTS or IN.
SELECT
p.id_project,
CASE
WHEN p.id_project IN (SELECT po.id_project FROM payment_order po) THEN 'payment completed'
WHEN p.id_project IN (SELECT b.id_project FROM bill b) THEN 'bill received'
WHEN p.id_project IN (SELECT e.id_project FROM engagement e) THEN 'project engaged'
-- (and still many other cases)
ELSE 'start of the project'
END AS progress
FROM project p;
A faster alternative would be to store the status in the project table as suggested by Littlefoot (and ideally a table for the statuses) and then have triggers on all those tables that update that status.

Need help wrapping head around joins

I have a database of a service that helps people sell things. If they fail a delivery of a sale, they get penalised. I am trying to extract the number of active listings each user had when a particular penalty was applied.
I have the equivalent to the following tables(and relevant fields):
user (id)
listing (id, user_id, status)
transaction (listing_id, seller_id)
listing_history (id, listing_status, date_created)
penalty (id, transaction_id, user_id, date_created)
The listing_history table saves an entry every time a listing is modified, saving a record of what the new state of the listing is.
My goal is to end with a result table with the field: penalty_id, and number of active listings the penalised user had when the penalty was applied.
So far I have the following:
SELECT s1.penalty_id,
COUNT(s1.record_id) 'active_listings'
FROM (
SELECT penalty.id AS 'penalty_id',
listing_history.id AS 'record_id',
FROM user
JOIN penalty ON penalty.user_id = user.id
JOIN transaction ON transaction.id = penalty.transaction_id
JOIN listing_history ON listing_history.listing_id = listing.id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
) s1
GROUP BY s1.penalty_id
Status = 0 means that the listing is active (or that the listing was active at the time the record was created). I got results similar to what I expected, but I fear I may be missing something or may be doing the JOINs wrong. Would this have your approval? (apart from the obvious non-use of aliases, for clarity problems).
UPDATE - As the comments on this answer indicate that changing the table structure isn't an option, here are more details on some queries you could use with the existing structure.
Note that I made a couple changes to the query before even modifying the logic.
As viki888 pointed out, there was a problem reference to listing.id; I've replaced it.
There was no real need for a subquery in the original query; I've simplified it out.
So the original query is rewritten as
SELECT penalty.id AS 'penalty_id'
, COUNT(listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
GROUP BY penalty.id
Now the most natural way, in my opinion, to write the corrected timeline constraint is with a NOT EXISTS condition that filters out all but the most recent listing_history record for a given id. This does require thinking about some edge cases:
Could two listing history records have the same create date? If so, how do you decide which happened first?
If a listing history record is created on the same day as the penalty, which is treated as happening first?
If the created_date is really a timestamp, then this may not matter much (if at all); if it's really a date, it might be a bigger issue. Since your original query required that the listing history be created before the penalty, I'll continue in that style; but it's still ambiguous how to handle the case where two history records with matching status have the same date. You may need to adjust the date comparisons to get the desired behavior.
SELECT penalty.id AS 'penalty_id'
, COUNT(DISTINCT listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
AND NOT EXISTS (SELECT 1
FROM listing_history h2
WHERE listing_history.date_created < h2.date_created
AND h2.date_created < penalty.date_created
AND h2.id = listing_history.id)
GROUP BY penalty.id
Note that I switched from COUNT(...) to COUNT(DISTINCT ...); this helps with some edge cases where two active records for the same listing might be counted.
If you change the date comparisons to use <= instead of < - or, equivalently, if you use BETWEEN to combine the date comparisons - then you'd want to add AND h2.status != 0 (or AND h2.status <> 0, depending on your database) to the subquery so that two concurrent ACTIVE records don't cancel each other out.
There are several equivalent ways to write this, and unfortunately its the kind of query that doesn't always cooperate with a database query optimizer so some trial and error may be necessary to make it run well with large data volumes. Hopefully that gives enough insight into the intended logic that you could work out some equivalents if need be. You could consider using NOT IN instead of NOT EXISTS; or you could use an outer join to a second instance of LISTING_HISTORY... There are probably others I'm not thinking of off hand.
I don't know that we're in a position to sign off on a general statement that the query is, or is not, "correct". If there's a specific question about whether a query will include/exclude a record in a specific situation (or why it does/doesn't, or how to modify it so it won't/will), those might get more complete answers.
I can say that there are a couple likely issues:
The only glaring logic issue has to do with timeline management, which is something that causes a lot of trouble with SQL. The issue is, while your query demonstrates that the listing was active at some point before the penalty creation date, it doesn't demonstrate that the listing was still active on the penalty creation date. Consider
PENALTY
id transaction date
1 10 2016-02-01
TRANSACTION
id listing_id
10 100
LISTING_HISTORY
listing_id status date
100 0 2016-01-01
100 1 2016-01-15
The joins would create a single record, and the count for penalty 1 would include listing 100 even though its status had changed to something other than 0 before the penalty was created.
This is hard - but not impossible - to fix with your existing table structure. You could add a NOT EXISTS condition looking for another LISTING_HISTORY record matching the ID with a date between the first LISTING_HISTORY date and the PENALTY date, for one.
It would be more efficient to add an end date to the LISTING_HISTORY date, but that may not be so easy depending on how the data is maintained.
The second potential issue is the COUNT(RECORD_ID). This may not do what you mean - what COUNT(x) may intuitively seem like it should do, is what COUNT(DISTINCT RECORD_ID) actually does. As written, if the join produces two matches with the same LISTING_HISTORY.ID value - i.e. the listing became active at two different times before the penalty - the listing would be counted twice.

What's the most efficient way to exclude possible results from an SQL query?

I have a subscription database containing Customers, Subscriptions and Publications tables.
The Subscriptions table contains ALL subscription records and each record has three flags to mark the status: isActive, isExpire and isPending. These are Booleans and only one flag can be True - this is handled by the application.
I need to identify all customers who have not renewed any magazines to which they have previously subscribed and I'm not sure that I've written the most efficient SQL query. If I find a lapsed subscription I need to ignore it if they already have an active or pending subscription for that particular magazine.
Here's what I have:
SELECT DISTINCT Customers.id, Subscriptions.publicationName
FROM Subscriptions
LEFT JOIN Customers
ON Subscriptions.id_Customer = Customers.id
LEFT JOIN Publications
ON Subscriptions.id_Publication = Publications.id
WHERE Subscriptions.isExpired = 1
AND NOT EXISTS
( SELECT * FROM Subscriptions s2
WHERE s2.id_Publication = Subscriptions.id_Publication
AND s2.id_Customer = Subscriptions.id_Customer
AND s2.isPending = 1 )
AND NOT EXISTS
( SELECT * FROM Subscriptions s3
WHERE s3.id_Publication = Subscriptions.id_Publication
AND s3.id_Customer = Subscriptions.id_Customer
AND s3.isActive = 1 )
I have just over 50,000 subscription records and this query takes almost an hour to run which tells me that there's a lot of looping or something going on where for each record the SQL engine is having to search again to find any 'isPending' and 'isActive' records.
This is my first post so please be gentle if I've missed out any information in my question :) Thanks.
I don't have your complete database structure, so I can't test the following query but it may contain some optimization. I will leave it to you to test, but will explain why I have changed, what I have changed.
select Distinct Customers.id, Subscriptions.publicationName
from Subscriptions
join Customers on Subscriptions.id_Customer = Customer.id
join Publications
ON Subscriptions.id_Publication = Publications.id
Where Subscriptions.isExpired = 1
And Not Exists
(select * from Subscriptions s2
join Customers on s2.id_Customer = Customer.id
join Publications
ON s2.id_Publication = Publications.id
where s2.id_Customer = s2.id_customer and
(s2.isPending = 1 or s2.isActive = 1))
If you have no resulting data in Customer or Publications DB, then the Subscription information isn't useful, so I eliminated the LEFT join in favor of simply join. Combine the two Exists subqueries. These are pretty intensive if I recall so the fewer the better. Last thing which I did not list above but may be worth looking into is, can you run a subquery with specific data fields returned and use it in an Exists clause? The use of Select * will return all data fields which slows down processing. I'm not sure if you can limit your result unfortunately, because I don't have an equivalent DB available to me that I can test on (the google probably knows).
I suspect there are further optimizations that could be made on this query. Eliminating the Exists clause in favor of an 'IN' clause may help, but I can't think of a way right now, seeing how you've got to match two unique fields (customer id and the relevant subscription). Let me know if this helps at all.
With a table of 50k rows, you should be able to run a query like this in seconds.

SQL Join Statement based on date range

New SQL dev here. I'm writing a call log application for our Asterisk server. In one table (CDRLogs), I have the call logs from the phone system (src, dst, calldate, duration). In another table I have (Employees) I have empName, empExt, extStartDate extEndDate). I want to join the two together on src and empExt based on who was using a particular ext on the date of the call. One user per extension in a given time frame.
For example, we have had 3 different users sitting at x100 during the month of July. In the Employees table, I have recorded the dates each of these people started and ended their use of that ext. How do I get the join to reflect that?
Thanks in advance
Perhaps something like:
SELECT A.*, B.*
FROM CDRLOGS A
INNER JOIN Employees B
ON A.SRC = B.EmpExt
AND A.CallDate between B.extStartDate and coalesce(B.extEndDate,getdate())
Please replace the * with relevant fields needed
and there may be a better way as a join on a between seems like it would possibly cause some overhead, but I can't think of a better way presently.

What is the best way to reduce sql queries in my situation

Here is the situation,each page will show 30 topics,so I had execute 1 sql statements at least,besides,I also want to show how many relpies with each topic and who the author is,thus
I have to use 30 statements to count the number of replpies and use other 30 statements to find the author.Finally,I got 61 statements,I really worry about the efficiency.
My tables looks like this:
Topic Reply User
------- ---------- ------------
id id id
title topic_id username
... ...
author_id
You should look into joining tables during a query.
Joins in SQLServer http://msdn.microsoft.com/en-us/library/ms191517.aspx
Joins in MySQL http://dev.mysql.com/doc/refman/5.0/en/join.html
As an example, I could do the following:
SELECT reply.id, reply.authorid, reply.text, reply.topicid,
topic.title,
user.username
FROM reply
LEFT JOIN topic ON (topic.id = reply.topicid)
LEFT JOIN user ON (user.id = reply.authorid)
WHERE (reply.isactive = 1)
ORDER BY reply.postdate DESC
LIMIT 10
If I read your requirements correctly, you want the result of the following query:
SELECT Topic.title, User.username, COUNT(Reply.topic_id) Replies
FROM Topic, User, Reply
WHERE Topic.id = Reply.topic_id
AND Topic.author_id = User.id
GROUP BY Topic.title, User.username
When I was first starting out with database driven web applications I had similar problems. I then spent several years working in a database rich environment where I actually learned SQL. If you intend to continue developing web applications (which I find are very fun to create) it would be worth your time to pick up a book or checking out some how-to's on basic and advance SQL.
One thing to add, on top of JOINS
It may be that your groups of data do not match or relate, so JOINs won't work. Another way: you may have 2 main chunks of data that is awkward to join.
Stored procedures can return multiple result sets.
For example, for a summary page you could return one aggregate result set and another "last 20" result set in one SQL call. To JOIN the 2 is awkward because it doesn't "fit" together.
You certainly can use some "left joins" on this one, however since the output only changes if someone updates/adds to your tables you could try to cache it in a xml/text file. Another way could be to build in some redundancy by adding another row to the topic table that keeps the reply count, username etc... and update them only if changes occur...