Joining to a limited subquery? - sql

I have this releases table in a SQLite3 database, listing each released version of an application:
|release_id|release_date|app_id|
|==========|============|======|
| 1001| 2009-01-01 | 1|
| 1003| 2009-01-01 | 1|
| 1004| 2009-02-02 | 2|
| 1005| 2009-01-15 | 1|
So for each app_id, there will be multiple rows. I have another table, apps:
|app_id|name |
|======|========|
| 1|Everest |
| 2|Fuji |
I want to display the name of the application and the newest release, where "newest" means (a) newest release_date, and if there are duplicates, (b) highest release_id.
I can do this for an individual application:
SELECT apps.name,releases.release_id,releases.release_date
FROM apps
INNER JOIN releases
ON apps.app_id = releases.app_id
WHERE releases.release_id = 1003
ORDER BY releases.release_date,releases.release_id
LIMIT 1
but of course that ORDER BY applies to the whole SELECT query, and if I leave out the WHERE clause, it still returns only one row.
It's a one-shot query on a small database, so slow queries, temp tables, etc. are fine - I just can't get my brain around the SQL way to do this.

This is easy to do with the analytic function ROW_NUMBER(), which I guess sqlite3 doesn't support. But you can do it in a way that's a bit more flexible than what's given in the previous answers:
SELECT
apps.name,
releases.release_id,
releases.release_date
FROM apps INNER JOIN releases
ON apps.app_id = releases.app_id
WHERE NOT EXISTS (
-- // where there doesn't exist a more recent release for the same app
SELECT * FROM releases AS R
WHERE R.app_id = apps.app_id
AND R.release_data > releases.release_data
)
For example, if you had multiple ordering columns that define "latest," MAX wouldn't work for you, but you could modify the EXISTS subquery to capture the more complicated meaning of "latest."

This is the "greatest N per group" problem. It comes up several times per week on StackOverflow.
I usually use a solution like the one in #Steve Kass' answer, but I do it without subqueries (I got into the habit years ago with MySQL 4.0, which didn't support subqueries):
SELECT a.name, r1.release_id, r1.release_date
FROM apps a
INNER JOIN releases r1
LEFT OUTER JOIN releases r2 ON (r1.app_id = r2.app_id
AND (r1.release_date < r2.release_date
OR r1.release_date = r2.release_date AND r1.release_id < r2.release_id))
WHERE r2.release_id IS NULL;
Internally, this probably optimizes identically to the NOT EXISTS syntax. You can analyze the query with EXPLAIN to make sure.
Re your comment, you could just skip the test for release_date because release_id is just as useful for establishing the chronological order of releases, and I assume it's guaranteed to be unique, so this simplifies the query:
SELECT a.name, r1.release_id, r1.release_date
FROM apps a
INNER JOIN releases r1
LEFT OUTER JOIN releases r2 ON (r1.app_id = r2.app_id
AND r1.release_id < r2.release_id)
WHERE r2.release_id IS NULL;

It's ugly, but I think it'll work
select apps.name, (select releases.release_id from releases where releases.app_id=apps.app_id order by releases.release_date, releases.release_id), (select releases.release_date from releases where releases.app_id=apps.app_id order by releases.release_date, releases.release_id) from apps order by apps.app_id
I hope there's some way to get both of those columns in one embedded select, but I don't know it.

Try:
SELECT a.name,
t.max_release_id,
t.max_date
FROM APPS a
JOIN (SELECT t.app_id,
MAX(t.release_id) 'max_release_id',
t.max_date
FROM (SELECT r.app_id,
r.release_id,
MAX(r.release_date) 'max_date'
FROM RELEASES r
GROUP BY r.app_id, r.release_id)
GROUP BY t.app_id, t.max_date) t

Err second attempt. Assuming that IDs are monotonically increasing and overflow is not a likely occurance, you can ignore the date and just do:
SELECT apps.name, releases.release_id, releases.release_date
FROM apps INNER JOIN releases on apps.app_id = releases.app_id
WHERE releases.release_id IN
(SELECT Max(release_id) FROM releases
GROUP BY app_id);

Related

Django permissions using too much the database

We started investigation on our database as it is the less scalable component in our infrastructure.
I checked the table pg_stat_statements of our Postgresql database with the following query:
SELECT userid, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent, query
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
Everytime, the same query is first in the list:
16386 | 21564 | 4077324.749363 | 1423094 | 99.9960264252721535 |
SELECT DISTINCT "auth_user"."id", "auth_user"."password", "auth_user"."last_login",
"auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name",
"auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff",
"auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user"
LEFT OUTER JOIN "auth_user_groups" ON ("auth_user"."id" = "auth_user_groups"."user_id")
LEFT OUTER JOIN "auth_group" ON ("auth_user_groups"."group_id" = "auth_group"."id")
LEFT OUTER JOIN "auth_group_permissions" ON ("auth_group"."id" = "auth_group_permissions"."group_id")
LEFT OUTER JOIN "auth_user_user_permissions" ON ("auth_user"."id" = "auth_user_user_permissions"."user_id")
WHERE ("auth_group_permissions"."permission_id" = $1 OR "auth_user_user_permissions"."permission_id" = $2)
This sounds like a permission check and as I understand, it is cached at request level.
I wonder if someone did a package to cache them into memcached for instance, or found a solution to reduce the amount of requests done to check those permissions?
I checked all indices and they seem correct. The request is a bit slow mostly because we have a lot of permissions but still, the amount of calls is crazy.

How to Group_Concat with a 3-table JOIN for genealogy

I am failing to grasp how I can get the following outcome. I thought perhaps via GROUP_CONCAT, but I am also joining on 3 tables, and unclear on the correct syntax or if this is even the best approach.
Generic table layout:
Table Users: user_id | first | last
Table Orgs org_id | org_name
Table Relationship user_id | org_id | start_year | end_year
The relationship table has MANY entries, that may be associated with that specific user_id.
I need to get the User columns: id, first, last. I'd like to try and group the org data into 1 concatenated, delimited field. Maybe a double group_concatenation is needed? Which would consist of the org_id, org_name, start_year & end_year for all records in the relationship table that match the user_id. I'm hoping for an output like this:
Each '|' represents a new column/piece of data.
If there was only 1 org_id associated with the user_id, the output would be (similar) to:
user_id | first | last | org_id-org_name-start_year-end_year
If there were more than 1 org found/associated with that user_id, the output would have more concatenated/delimited data in the same column:
user_id | first | last | org_id-org_name-start_year-end_year^org_id-org_name-start_year-end_year^org_id-org_name-start_year-end_year
(Notice the '-' delimiter between values and the '^' delimiter between new 'org-grouped' data.)
When I grab that data, I can then just break it up (on the backend/PHP side of things) into an array or whatever.
I'm not sure how I can GROUP_CONCAT (if that is even the best approach here?) while I have to JOIN on 3 separate tables.
This is not my REAL query. (I'm not sure if I should post it, as I do not want to cause any confusion as it does NOT match my dummy table/column names.)
I just wanted to show my attempt that gets me 3 individual rows, (using my JOINS) but no GROUP_CONCAT stuff:
SELECT genealogy_users.imis_id, genealogy_users.full_name,
genealogy_users.member_email, genealogy_orgs.org_id,
genealogy_orgs.org_name, genealogy_relations.user_id,
genealogy_relations.relation_type, genealogy_relations.start_year,
genealogy_relations.end_year
FROM genealogy_users
INNER JOIN genealogy_relations ON genealogy_users.imis_id = genealogy_relations.user_id
INNER JOIN genealogy_orgs ON genealogy_relations.org_id = genealogy_orgs.org_id
WHERE genealogy_users.imis_id = '00003';
UPDATE:
Well I seemed to have fudged my way through it. But I'm not sure how legit this is.
Its -ALMOST- there. I believe I still need a JOIN or something? Since the genealogy_orgs.org_id = '84864' is hardcoded, and it should NOT be. Maybe it needs to come from a JOIN or something?
SELECT genealogy_users.*,
(SELECT GROUP_CONCAT(org_id,'-',
(SELECT org_name FROM genealogy_orgs WHERE genealogy_orgs.org_id = '84864'),
'-',start_year,'-',end_year,'^')
FROM genealogy_relations WHERE genealogy_relations.user_id = genealogy_users.imis_id
) AS alumni_list
FROM genealogy_users
WHERE genealogy_users.imis_id = '00003';
UPDATE 2:
My final attempt, which I think is getting me what I need. (But it's late, and I'll check back tomorrow and look at things more closely.)
SELECT genealogy_users.imis_id, genealogy_users.full_name,
genealogy_users.member_email, genealogy_orgs.org_id,
genealogy_orgs.org_name, genealogy_relations.user_id,
genealogy_relations.relation_type, genealogy_relations.start_year,
genealogy_relations.end_year,
(SELECT GROUP_CONCAT(org_id,'-',org_name,'-',start_year,'-',end_year,'^')
FROM genealogy_relations
WHERE genealogy_relations.user_id = genealogy_users.imis_id
) AS alumni_list
FROM genealogy_users
INNER JOIN genealogy_relations ON genealogy_users.imis_id = genealogy_relations.user_id
INNER JOIN genealogy_orgs ON genealogy_relations.org_id = genealogy_orgs.org_id
WHERE genealogy_users.imis_id = '00003';
Is there anything to make note of in the above attempt? Or is there a better approach? Hopefully something easily readable so it makes sense?

Select query with max date

I have this query
SQL query: selecting by branch and machine code, order by branch and date
SELECT
mb.machine_id AS 'MachineId',
MAX(mb.date) AS 'Date',
mi.branch_id AS 'BranchId',
b.branch AS 'Branch',
b.branch_code AS 'BranchCode'
FROM
dbo.machine_beat mb
LEFT JOIN dbo.machine_ids mi
ON mb.machine_id = mi.machine_id
LEFT JOIN dbo.branches b
ON mi.branch_id = b.lookup_key
GROUP BY
mb.machine_id,
mi.branch_id,
b.branch,
b.branch_code
ORDER BY
b.branch, [Date] DESC
Query result:
|==========|=======================|=========|==========|==========|
|MachineId |Date |BranchId |Branch |BranchCode|
|==========|=======================|=========|==========|==========|
|SS10000005|2014-03-31 19:10:17.110|3 |Mamamama |MMMM |
|SS10000043|2014-03-31 17:16:32.760|3 |Mamamama |MMMM |
|SS10000005|2014-02-17 14:58:42.523|3 |Mamamama |MMMM |
|==================================================================|
My problem is how to select the updated machine code? Expected query result:
|==========|=======================|=========|==========|==========|
|MachineId |Date |BranchId |Branch |BranchCode|
|==========|=======================|=========|==========|==========|
|SS10000005|2014-03-31 19:10:17.110|3 |Mamamama |MMMM |
|==================================================================|
Update
I created sqlfiddle. I also added data, aside from MMMM. I need the updated date for each branch. So probably, my result will be:
|==========|=======================|=========|==========|==========|
|MachineId |Date |BranchId |Branch |BranchCode|
|==========|=======================|=========|==========|==========|
|SS10000343|2014-06-03 13:43:40.570|1 |Cacacaca |CCCC |
|SS30000033|2014-03-31 18:59:42.153|8 |Fafafafa |FFFF |
|SS10000005|2014-03-31 19:10:17.110|3 |Mamamama |MMMM |
|==================================================================|
Try using Row_number with partition by
select * from
(
SELECT
mb.machine_id AS 'MachineId',
mb.date AS 'Date',
mi.branch_id AS 'BranchId',
b.branch AS 'Branch',
b.branch_code AS 'BranchCode',rn=row_number()over(partition by mb.machine_id order by mb.date desc)
FROM
dbo.machine_beat mb
LEFT JOIN dbo.machine_ids mi
ON mb.machine_id = mi.machine_id
LEFT JOIN dbo.branches b
ON mi.branch_id = b.lookup_key
WHERE
branch_code = 'MMMM'
/*
GROUP BY
mb.machine_id,
mi.branch_id,
b.branch,
b.branch_code
*/
)x
where x.rn=1
#861051069712110711711710997114 is looking in the right direction - this is a greatest-n-per-group question. Yours is more complicated than the usual because the greatest portion is coming from a different table than the group portion. The only issue with his answer is that you hadn't provided sufficient information to finish it correctly.
The following solves the problem:
WITH Most_Recent_Beat AS (SELECT Machine.branch_id,
Beat.machine_id, Beat.date,
ROW_NUMBER() OVER(PARTITION BY Machine.branch_id
ORDER BY Beat.date DESC) AS rn
FROM machine_id Machine
JOIN machine_beat Beat
ON Beat.machine_id = Machine.machine_id)
SELECT Beat.machine_id, Beat.date,
Branches.lookup_key, Branches.branch, Branches.branch_code
FROM Branches
JOIN Most_Recent_Beat Beat
ON Beat.branch_id = Branches.lookup_key
AND Beat.rn = 1
ORDER BY Branches.branch, Beat.date DESC
(and corrected SQL Fiddle for testing. You shouldn't be using a different RDBMS for the example, especially as there were syntax errors for the db you say you're using.)
Which yields your expected results.
So what's going on here? The key is the ROW_NUMBER()-function line. This function itself simply generates a number series. The OVER(...) clause defines what's known as a window, over which the function will be run. PARTITION BY is akin to GROUP BY - every time a new group occurs (new Machine.branch_id value), the function restarts. The ORDER BY inside the parenthesis simply says that, per group, entries should have the given function run on entries in that order. So, the greatest date (most recent, assuming all dates are in the past) gets 1, the next 2, etc.
This is done in a CTE here (it could also be done as part of a subquery table-reference) because only the most recent date is required - where the generated row number is 1; as SQL Server doesn't allow you to put SELECT-clause aliases into the WHERE clause, it needs to be wrapped in another level to be able to reference it that way.

Timeout running SQL query

I'm trying to using the aggregation features of the django ORM to run a query on a MSSQL 2008R2 database, but I keep getting a timeout error. The query (generated by django) which fails is below. I've tried running it directs the SQL management studio and it works, but takes 3.5 min
It does look it's aggregating over a bunch of fields which it doesn't need to, but I wouldn't have though that should really cause it to take that long. The database isn't that big either, auth_user has 9 records, ticket_ticket has 1210, and ticket_watchers has 1876. Is there something I'm missing?
SELECT
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined],
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined]
HAVING
(COUNT([tickets_ticket].[id]) > 0 OR COUNT(T3.[id]) > 0 )
EDIT:
Here are the relevant indexes (excluding those not used in the query):
auth_user.id (PK)
auth_user.username (Unique)
tickets_ticket.id (PK)
tickets_ticket.capturer_id
tickets_ticket.responsible_id
tickets_ticket_watchers.id (PK)
tickets_ticket_watchers.user_id
tickets_ticket_watchers.ticket_id
EDIT 2:
After a bit of experimentation, I've found that the following query is the smallest that results in the slow execution:
SELECT
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id]
The weird thing is that if I comment out any two lines in the above, it runs in less that 1s, but it doesn't seem to matter which lines I remove (although obviously I can't remove a join without also removing the relevant SELECT line).
EDIT 3:
The python code which generated this is:
User.objects.annotate(
Count('tickets_captured'),
Count('assigned_tickets'),
Count('tickets_watched')
)
A look at the execution plan shows that SQL Server is first doing a cross-join on all the table, resulting in about 280 million rows, and 6Gb of data. I assume that this is where the problem lies, but why is it happening?
SQL Server is doing exactly what it was asked to do. Unfortunately, Django is not generating the right query for what you want. It looks like you need to count distinct, instead of just count: Django annotate() multiple times causes wrong answers
As for why the query works that way: The query says to join the four tables together. So say an author has 2 captured tickets, 3 assigned tickets, and 4 watched tickets, the join will return 2*3*4 tickets, one for each combination of tickets. The distinct part will remove all the duplicates.
what about this?
SELECT auth_user.*,
C1.tickets_captured__count
C2.assigned_tickets__count
C3.tickets_watched__count
FROM
auth_user
LEFT JOIN
( SELECT capturer_id, COUNT(*) AS tickets_captured__count
FROM tickets_ticket GROUP BY capturer_id ) AS C1 ON auth_user.id = C1.capturer_id
LEFT JOIN
( SELECT responsible_id, COUNT(*) AS assigned_tickets__count
FROM tickets_ticket GROUP BY responsible_id ) AS C2 ON auth_user.id = C2.responsible_id
LEFT JOIN
( SELECT user_id, COUNT(*) AS tickets_watched__count
FROM tickets_ticket_watchers GROUP BY user_id ) AS C3 ON auth_user.id = C3.user_id
WHERE C1.tickets_captured__count > 0 OR C2.assigned_tickets__count > 0
--WHERE C1.tickets_captured__count is not null OR C2.assigned_tickets__count is not null -- also works (I think with beter performance)

Complicated Calculation Using Oracle SQL

I have created a database for an imaginary solicitors, my last query to complete is driving me insane. I need to work out the total a solicitor has made in their career with the company, I have time_spent and rate to multiply and special rate to add. (special rate is a one off charge for corporate contracts so not many cases have them). the best I could come up with is the code below. It does what I want but only displays the solicitors working on a case with a special rate applied to it.
I essentially want it to display the result of the query in a table even if the special rate is NULL.
I have ordered the table to show the highest amount first so i can use ROWNUM to only show the top 10% earners.
CREATE VIEW rich_solicitors AS
SELECT notes.time_spent * rate.rate_amnt + special_rate.s_rate_amnt AS solicitor_made,
notes.case_id
FROM notes,
rate,
solicitor_rate,
solicitor,
case,
contract,
special_rate
WHERE notes.solicitor_id = solicitor.solicitor_id
AND solicitor.solicitor_id = solicitor_rate.solicitor_id
AND solicitor_rate.rate_id = rate.rate_id
AND notes.case_id = case.case_id
AND case.contract_id = contract.contract_id
AND contract.contract_id = special_rate.contract_id
ORDER BY -solicitor_made;
Query:
SELECT *
FROM rich_solicitors
WHERE ROWNUM <= (SELECT COUNT(*)/10
FROM rich_solicitors)
I'm suspicious of your use of ROWNUM in your example query...
Oracle9i+ supports analytic functions, like ROW_NUMBER and NTILE, to make queries like your example easier. Analytics are also ANSI, so the syntax is consistent when implemented (IE: Not on MySQL or SQLite). I re-wrote your query as:
SELECT x.*
FROM (SELECT n.time_spent * r.rate_amnt + COALESCE(spr.s_rate_amnt, 0) AS solicitor_made,
n.case_id,
NTILE(10) OVER (ORDER BY solicitor_made) AS rank
FROM NOTES n
JOIN SOLICITOR s ON s.solicitor_id = n.solicitor_id
JOIN SOLICITOR_RATE sr ON sr.solicitor_id = s.solicitor_id
JOIN RATE r ON r.rate_id = sr.rate_id
JOIN CASE c ON c.case_id = n.case_id
JOIN CONTRACT cntrct ON cntrct.contract_id = c.contract_id
LEFT JOIN SPECIAL_RATE spr ON spr.contract_id = cntrct.contract_id) x
WHERE x.rank = 1
If you're new to SQL, I recommend using ANSI-92 syntax. Your example uses ANSI-89, which doesn't support OUTER JOINs and is considered deprecated. I used a LEFT OUTER JOIN against the SPECIAL_RATE table because not all jobs are likely to have a special rate attached to them.
It's also not recommended to include an ORDER BY in views, because views encapsulate the query -- no one will know what the default ordering is, and will likely include their own (waste of resources potentially).
you need to left join in the special rate.
If I recall the oracle syntax is like:
AND contract.contract_id = special_rate.contract_id (+)
but now special_rate.* can be null so:
+ special_rate.s_rate_amnt
will need to be:
+ coalesce(special_rate.s_rate_amnt,0)