Joining in SQLite to get desired records - sql

I am working on a mobile application which takes two inputs - Source Station Name and Destination Station Names. Upon receiving these two inputs, the application would then enlist the names of trains available for the given stations along with their source_arrival and destination_reach timings.
(Note: For now, I am only focusing on the unreserved local trains that operate in the state of West Bengal, India)
I am using SQLite as the RDBMS. I have following three tables as the sources of the data -
train_table (which has the details of the trains available):
station_table (which contains the details of the stations):
route_table (which contains the route details):
Now, my aim is to produce the output in the following manner as specified earlier (suppose I gave Baruipur Jn as source and Sealdah as destination):
I am unable to figure out the query needed for this. Initially, I was trying something like the following:
select r1.trainId, r1.arrival as SrcArrive, r2.arrival
as Reach from route_table r1 cross join route_table r2
where r1.trainId = r2.trainId and r1.stationId <> r2.stationId and
r1.arrival <> r2.arrival;
(Yes, without the trainName)
But I was unable to cut down the unintended source_arrival timings. However, I was able to retrieve the number of different trains available for given two stations with the following:
select _id, trainNO, trainName from train_table where _id in
(select trainId from route_table where stationId = 109
INTERSECT
select trainId from route_table where stationId = 21);
But with this, I am not able to get to the final result that I need.

This might work, try once.
select routeData.*, train_table.* from (select r1.trainId, r1.arrival as SrcArrive, r2.arrival
as Reach from route_table r1 cross join route_table r2
where r1.trainId = r2.trainId and r1.stationId <> r2.stationId and
r1.arrival <> r2.arrival) routeData inner join train_table on routeData.trainId=train_table._id;
I have redifined the selection from route table, try this updated one:
select trainName, SrcArrival, Destination from (select trainData.trainName, route.* from
(select A.trainId, A.arrival as SrcArrival, B.trainId, B.arrival as Destination from
route_table A inner join route_table B on A.trainId=B.trainId where A.stationId=109 and
B.stationId=259 and A.arrival<B.arrival) route inner join train_table trainData on
route.trainId=trainData._id) order by SrcArrival, Destination;

Related

SQL - join three tables based on (different) latest dates in two of them

Using Oracle SQL Developer, I have three tables with some common data that I need to join.
Appreciate any help on this!
Please refer to https://i.stack.imgur.com/f37Jh.png for the input and desired output (table formatting doesn't work on all tables).
These tables are made up in order to anonymize them, and in reality contain other data with millions of entries, but you could think of them as representing:
Product = Main product categories in a grocery store.
Subproduct = Subcategory products to the above. Each time the table is updated, the main product category may loses or get some new suproducts assigned to it. E.g. you can see that from May to June the Pulled pork entered while the Fishsoup was thrown out.
Issues = Status of the products, for example an apple is bad if it has brown spots on it..
What I need to find is: for each P_NAME, find the latest updated set of subproducts (SP_ID and SP_NAME), and append that information with the latest updated issue status (STATUS_FLAG).
Please note that each main product category gets its set of subproducts updated at individual occasions i.e. 1234 and 5678 might be "latest updated" on different dates.
I have tried multiple queries but failed each time. I am using combos of SELECT, LEFT OUTER JOIN, JOIN, MAX and GROUP BY.
Latest attempt, which gives me the combo of the first two tables, but missing the third:
SELECT
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE
FROM SUBPRODUCT
LEFT OUTER JOIN PRODUCT ON PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
JOIN(SELECT SP_PRODUCT_ID, MAX(SP_VALUE_DATE) AS latestdate FROM SUBPRODUCT GROUP BY SP_PRODUCT_ID) sub ON
sub.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID AND sub.latestDate = SUBPRODUCT.SP_VALUE_DATE;
Trying to find a row with a max value is a common SQL pattern - you can do it with a join, like your example, but it's usually more clear to use a subquery or a window function.
Correlated subquery example
select
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE,
ISSUES.STATUS_FLAG, ISSUES.STATUS_LAST_UPDATED
from PRODUCT
join SUBPRODUCT
on PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
and SUBPRODUCT.SP_VALUE_DATE = (select max(S2.SP_VALUE_DATE) as latestDate
from SUBPRODUCT S2
where S2.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID)
join ISSUES
on ISSUES.ISSUE_ID = SUBPRODUCT.SP_ID
and ISSUES.STATUS_LAST_UPDATED = (select max(I2.STATUS_LAST_UPDATED) as latestDate
from ISSUES I2
where I2.ISSUE_ID = ISSUES.ISSUE_ID)
Window function / inline view example
select
PRODUCT.P_NAME,
S.SP_PRODUCT_ID, S.SP_NAME, S.SP_ID, S.SP_VALUE_DATE,
I.STATUS_FLAG, I.STATUS_LAST_UPDATED
from PRODUCT
join (select SUBPRODUCT.*,
max(SP_VALUE_DATE) over (partition by SP_PRODUCT_ID) as latestDate
from SUBPRODUCT) S
on PRODUCT.P_ID = S.SP_PRODUCT_ID
and S.SP_VALUE_DATE = S.latestDate
join (select ISSUES.*,
max(STATUS_LAST_UPDATED) over (partition by ISSUE_ID) as latestDate
from ISSUES) I
on I.ISSUE_ID = S.SP_ID
and I.STATUS_LAST_UPDATED = I.latestDate
This often performs a bit better, but window functions can be tricky to understand.

SQL - getting data twice from the same table

Good people of the internet, I need your help!
I'm trying to put together some SQL code to create a report. Basically, I'm needing to look at one table - tbl_Schedules - and fetch the maximum of the field SchedDone, which is a regular date field.
This is fair enough. I manage this using GROUP BY and MAX.
The same table also contains SchedFrom and SchedTo fields, and I need to get this from another record, but run alongside this.
Basically, I need a "LatestDate" field (which shows the max SchedDone), and then a "next scheduled" field (or fields), showing when it is next to be done, pulled from a row where the SchedDone is null.
My current code, displayed below (ignore everything but the tbl_Schedules stuff) is working to an extent. It shows the LatestDate as described above, but also shows the maximum of that Task's SchedFrom and SchedTo. These are all from different rows in the tbl_Schedules, which is what I want. I just need to know how to set up rules for the SchedFrom and SchedTo, preferably without involving any other tables or setting up multiple views.
I do have this working, but it's taking up several views and the speed involved is not good. I'd hope I can get it working in a single chunk of SQL code.
PS - tbl_PhysicalAssets is a one-to-many relationship with tbl_Operations (one row in tbl_PhysicalAssets to many in tbl_Operations), and tbl_Operations is a one-to-many relationship with tbl_Schedules (one row in tbl_Operations to many in tbl_Schedules).
Current code below (again, please ignore the other tables!) -
SELECT
dbo.tbl_PhysicalAsset.FKID_Contract,
dbo.tbl_PhysicalAsset.MyLevel,
dbo.tbl_PhysicalAsset.L1_Name,
dbo.tbl_PhysicalAsset.L2_Name,
dbo.tbl_PhysicalAsset.L3_Name,
dbo.tbl_OpList.Operation_Name,
dbo.tbl_Teams.Team_Name,
MAX(tbl_Schedules_1.SchedDone) AS LatestDate,
MAX(tbl_Schedules_1.SchedFrom) AS Expr1,
MAX(tbl_Schedules_1.SchedTo) AS Expr2
FROM
dbo.tbl_Schedules AS tbl_Schedules_1
RIGHT OUTER JOIN
dbo.tbl_PhysicalAsset
INNER JOIN
dbo.tbl_Operations ON dbo.tbl_PhysicalAsset.PKID_PhysicalAsset = dbo.tbl_Operations.FKID_PhysicalAsset
INNER JOIN
dbo.tbl_OpList ON dbo.tbl_Operations.FKID_Operation = dbo.tbl_OpList.PKID_Op
INNER JOIN
dbo.tbl_Teams ON dbo.tbl_Operations.FKID_Team = dbo.tbl_Teams.PKID_Team ON tbl_Schedules_1.FKID_Operation = dbo.tbl_Operations.PKID_Operation
GROUP BY
dbo.tbl_PhysicalAsset.FKID_Contract,
dbo.tbl_PhysicalAsset.MyLevel,
dbo.tbl_PhysicalAsset.L1_Name,
dbo.tbl_PhysicalAsset.L2_Name,
dbo.tbl_PhysicalAsset.L3_Name,
dbo.tbl_OpList.Operation_Name,
dbo.tbl_Teams.Team_Name
HAVING
(dbo.tbl_PhysicalAsset.FKID_Contract = 6)
AND (dbo.tbl_PhysicalAsset.MyLevel = 3)
This is rough as we don't know the exact details of your table structure.
But basically you need to first write the queries that get the two separate bits of information, make sure they work in isolation. You can then just join them together.
So something like (I've assumed FKID_Operation is the 'common' piece of info):
select
a.FKID_Operation
, b.LatestDate
, c.NextToDate
, c.NextFromDate
from tbl_Schedules a
inner join
(
select
m.FKID_Operation
, Max(m.SchedDone) as LatestDate
from tbl_Schedules m
where SchedDone is not null
) b
on a.FKID_Operation = b.FKID_Operation
inner join
(
select
n.FKID_Operation
, n.SchedTo as NextToDate
, n.SchedFrom as NextFromDate
from tbl_Schedules n
where SchedDone is null
) c
on a.FKID_Operation = c.FKID_Operation
I'd also look into CTE's as they can make this kind of query much easier to understand.

subquery returning more than one value

SELECT CG.SITEID,
CR.COLLECTIONID,
CG.COLLECTIONNAME,
CASE
WHEN CR.ARCHITECTUREKEY = 5
THEN
N'vSMS_R_System'
WHEN CR.ARCHITECTUREKEY = 0
THEN
(SELECT BASETABLENAME
FROM DISCOVERYARCHITECTURES
JOIN
COLLECTION_RULES
ON DISCOVERYARCHITECTURES.DISCARCHKEY =
COLLECTION_RULES.ARCHITECTUREKEY
JOIN
COLLECTIONS_G
ON COLLECTION_RULES.COLLECTIONID =
COLLECTIONS_G.COLLECTIONID
WHERE COLLECTIONS_G.SITEID = (SELECT TOP 1 SOURCECOLLECTIONID FROM VCOLLECTIONDEPENDENCYCHAIN WHERE DEPENDENTCOLLECTIONID = CG.SITEID ORDER BY LEVEL DESC))
ELSE (SELECT DA.BASETABLENAME FROM DISCOVERYARCHITECTURES DA WHERE DA.DISCARCHKEY=CR.ARCHITECTUREKEY) END AS TABLENAME
FROM COLLECTIONS_G CG
JOIN COLLECTIONS_L CL ON CG.COLLECTIONID=CL.COLLECTIONID
JOIN COLLECTION_RULES CR ON CG.COLLECTIONID=CR.COLLECTIONID
WHERE (CG.FLAGS&4)=4 AND CL.CURRENTSTATUS!=5
I am having a problem with the code above, around the line:
when cr.ArchitectureKey=0 then...
The problem is that the sub-query returns more than one value, and I'm not too sure how to invert the query so that I get rid of the error.
To make matters worse, cr.ArchitectureKey would normally join with da.DiscArchKey, but while cr.ArchitectureKey can have a value of 0, that does not exist in da.DiscArchKey, meaning if I join the two directly I lose data.
EDIT
More information regarding the problem itself:
This is a stored procedure for a Microsoft product that has a 'bug' (probably considered a feature though) which I'm trying to fix. Don't worry, this is only in my own little test server.
Anyway, there's the concept of a Collection. All Collections must have a parent (determined through VCOLLECTIONDEPENDENCYCHAIN), with the exception of the very top level Collection that is a system collection and cannot be modified.
Each collection can have 0 or more rules, and each rule has a rule type, where the ID of the rule type is saved onto COLLECTION_RULES and the matching string for that ID is saved onto DISCOVERYARCHITECTURES.
In most cases, a rule is a WQL query, and the rule type is determined by what tables are queried on the WQL query.
However, and this is where the problem lies, collections can also have a query of type 'include' or 'exclude', which basically forces it to borrow the query of another Collection. So effectively you include the results of another Collection's query onto your own Collection, and that's the query.
As far as COLLECTION_RULES is concerned, when that happens, the ID of the rule type is 0, which is a value that doesn't exist in DISCOVERYARCHITECTURES.
What I was trying to modify was so that when the rule type is 0, get and use the rule type(s) of the highest up parent (not the direct parent since the parent Collection could also have a single include rule, in which case the rule type would still be 0).
The problem is that because each rule can have multiple rule types, it returns multiple rows in some instances.
I tried to invert the query to remove the SELECT and use joins only, but failed because I found I always needed to join it to DISCOVERYARCHITECTURES and I have nothing to join it on when the rule type = 0.
EDIT2
Sample data:
Collections_G
Collections_L
Collection_Rules
DiscoveryArchitectures
vCollectionDependencyChain
Original Query and Original Results
SELECT cg.SiteID,
CASE
WHEN da.DiscArchKey=5
THEN N'vSMS_R_System'
ELSE da.BaseTableName END AS TableName
FROM Collections_G cg
JOIN Collections_L cl ON cg.CollectionID=cl.CollectionID
JOIN Collection_Rules cr ON cg.CollectionID=cr.CollectionID
JOIN DiscoveryArchitectures da ON cr.ArchitectureKey=da.DiscArchKey
WHERE (cg.Flags&4)=4 AND cl.CurrentStatus!=5
As you can see from the results picture above, some collections appear multiple times but with different TableNames. This is because each collection have have several rules, and each rule has one cr.ArchitectureKey
Also, and more importantly, collections PS10000B and PS10000C do not show up because their cr.ArchitectureKey = 0 which is a value that doesn't exist in da.DiscArchKey.
My goal is to have collections that have a cr.ArchitectureKey appear, but I need to assign them a cr.ArchitectureKey
My thought (which is slightly flawed, but don't know enough SQL to make it better, so if someone could help with that it would be appreciated too) was to get use the da.DiscArchKey from the top level parent. But the top level parent can have multiple DiscArchKeys, which is what is causing the problem.
As mentioned above getting the top level parent is slightly flawed, and ideally I would get the top level cr.ReferencedCollectionID. In other words, if PS10000B has a cr.ReferencedCollectionID of PS10000C and PS10000C has a cr.ReferencedCollectionID of SMS00002 but because SMS00002 has no cr.ReferencedCollectionID then SMS00002 is the top level cr.ReferencedCollectionID and both PS10000B and PS10000C should have da.DiscArchKey(s) equal to those of SMS00002.
Please have a look at a wired solution that comes into mind. You may face some syntax errors(most probably in 2nd and 3rd CTE) but it just an idea.
Get each case values in separate CTEs and then combine them at the end.
;WITH CTE
AS
(
SELECT CG.SITEID,
CR.COLLECTIONID,
CG.COLLECTIONNAME
FROM COLLECTIONS_G CG
JOIN COLLECTIONS_L CL ON CG.COLLECTIONID=CL.COLLECTIONID
JOIN COLLECTION_RULES CR ON CG.COLLECTIONID=CR.COLLECTIONID
WHERE (CG.FLAGS&4)=4 AND CL.CURRENTSTATUS!=5
),
ARCHITECTUREKEY5
AS
(
SELECT C.SITEID,
C.COLLECTIONID,
C.COLLECTIONNAME,
N'vSMS_R_System' as TABLENAME
FROM CTE C WHERE C.ARCHITECTUREKEY = 5
),
ARCHITECTUREKEY0
AS
(
SELECT C.SITEID,
C.COLLECTIONID,
C.COLLECTIONNAME,
BASETABLENAME as TABLENAME
FROM CTE C,
DISCOVERYARCHITECTURES
JOIN
COLLECTION_RULES
ON DISCOVERYARCHITECTURES.DISCARCHKEY =
COLLECTION_RULES.ARCHITECTUREKEY
JOIN
COLLECTIONS_G
ON COLLECTION_RULES.COLLECTIONID =
COLLECTIONS_G.COLLECTIONID
WHERE COLLECTIONS_G.SITEID = (SELECT TOP 1 SOURCECOLLECTIONID FROM VCOLLECTIONDEPENDENCYCHAIN WHERE DEPENDENTCOLLECTIONID = C.SITEID ORDER BY LEVEL DESC))
and C.ARCHITECTUREKEY = 0
),
ARCHITECTUREKEYOTHER
AS
(
SELECT C.SITEID,
C.COLLECTIONID,
C.COLLECTIONNAME,
DA.BASETABLENAME as TABLENAME
FROM DISCOVERYARCHITECTURES DA, CTE C WHERE DA.DISCARCHKEY=CR.ARCHITECTUREKEY AND C.ARCHITECTUREKEY not in (0,1)
)
Select * from ARCHITECTUREKEY5
UNION
Select * from ARCHITECTUREKEY0
UNION
Select * from ARCHITECTUREKEYOTHER

Query Shows multiple listings of same info

I'm using this query to find all the machines on the network (using dell kace) that have an expired warranty according to their service tag.
However, when I run the query, some of the machines are listed twice but should only be listed once.
Here is an example of the output where machine example3 is correctly listed but example1 is listed twice.
# Machine Name Service Tag
1 example1 abcd123
2 example1 abcd123
3 example3 abcd124
Code:
SELECT
M.NAME AS MACHINE_NAME, M.CS_MODEL AS MODEL, DA.SERVICE_TAG,
DA.SHIP_DATE,M.USER_LOGGED AS LAST_LOGGED_IN_USER, DW.SERVICE_LEVEL_CODE,
DW.SERVICE_LEVEL_DESCRIPTION, DW.END_DATE AS EXPIRATION_DATE
FROM
DELL_WARRANTY DW
JOIN
DELL_ASSET DA ON (DW.SERVICE_TAG = DA.SERVICE_TAG)
JOIN
MACHINE M
ON (M.BIOS_SERIAL_NUMBER = DA.PARENT_SERVICE_TAG OR M.BIOS_SERIAL_NUMBER = DA.SERVICE_TAG)
LEFT JOIN
DELL_WARRANTY DW2 ON DW2.SERVICE_TAG=DW.SERVICE_TAG and DW2.END_DATE > NOW()
WHERE
M.CS_MANUFACTURER LIKE '%dell%'
AND
M.BIOS_SERIAL_NUMBER!=''
AND
DA.DISABLED != 1
AND
DW.END_DATE < NOW()
AND
DW2.SERVICE_TAG IS NULL;
Any ideas on how to make computers with the same machine name and service tags only output once? Thanks.
Let me make the assumption that you have a reasonable data model and reasonably populated data. That means that the duplicates are not coming from inappropriate data stored in the database.
Your query (formatted so I can read it) is:
SELECT M.NAME AS MACHINE_NAME, M.CS_MODEL AS MODEL, DA.SERVICE_TAG,
DA.SHIP_DATE, M.USER_LOGGED AS LAST_LOGGED_IN_USER, DW.SERVICE_LEVEL_CODE,
DW.SERVICE_LEVEL_DESCRIPTION, DW.END_DATE AS EXPIRATION_DATE
FROM DELL_WARRANTY DW JOIN
DELL_ASSET DA
ON DW.SERVICE_TAG = DA.SERVICE_TAG JOIN
MACHINE M
ON M.BIOS_SERIAL_NUMBER = DA.PARENT_SERVICE_TAG OR
M.BIOS_SERIAL_NUMBER = DA.SERVICE_TAG LEFT JOIN
DELL_WARRANTY DW2
ON DW2.SERVICE_TAG = DW.SERVICE_TAG and DW2.END_DATE > NOW()
WHERE M.CS_MANUFACTURER LIKE '%dell%' AND
M.BIOS_SERIAL_NUMBER <> '' AND
DA.DISABLED <> 1;
The suspect join is the one on Machine because it has an or. So, two machines might match different parts of the service tag, resulting in multiple very similar rows.
If your concern is specifically about machine names and service tags (the two columns you highlighted in the question), then you can fix those duplicates by ending the query with:
group by M.NAME, DA.SERVICE_TAG
(This assumes that you are using MySQL -- based on the syntax of your query. Most other databases would require aggregation functions around the rest of the columns in select.)
Try putting a DISTINCT infront of M.Name
OR playing with the joins like
SELECT
M.NAME AS MACHINE_NAME, M.CS_MODEL AS MODEL, DA.SERVICE_TAG,
DA.SHIP_DATE,M.USER_LOGGED AS LAST_LOGGED_IN_USER, DW.SERVICE_LEVEL_CODE,
DW.SERVICE_LEVEL_DESCRIPTION, DW.END_DATE AS EXPIRATION_DATE
FROM
DELL_WARRANTY DW
INNER JOIN
DELL_ASSET DA ON (DW.SERVICE_TAG = DA.SERVICE_TAG)
INNER JOIN
MACHINE M
ON (M.BIOS_SERIAL_NUMBER = DA.PARENT_SERVICE_TAG OR M.BIOS_SERIAL_NUMBER = DA.SERVICE_TAG)
INNER JOIN
DELL_WARRANTY DW2 ON DW2.SERVICE_TAG=DW.SERVICE_TAG and DW2.END_DATE > NOW()
WHERE
M.CS_MANUFACTURER LIKE '%dell%'
AND
M.BIOS_SERIAL_NUMBER!=''
AND
DA.DISABLED != 1
AND
DW.END_DATE < NOW()
AND
DW2.SERVICE_TAG IS NULL;

Timeout running SQL query

I'm trying to using the aggregation features of the django ORM to run a query on a MSSQL 2008R2 database, but I keep getting a timeout error. The query (generated by django) which fails is below. I've tried running it directs the SQL management studio and it works, but takes 3.5 min
It does look it's aggregating over a bunch of fields which it doesn't need to, but I wouldn't have though that should really cause it to take that long. The database isn't that big either, auth_user has 9 records, ticket_ticket has 1210, and ticket_watchers has 1876. Is there something I'm missing?
SELECT
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined],
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id],
[auth_user].[password],
[auth_user].[last_login],
[auth_user].[is_superuser],
[auth_user].[username],
[auth_user].[first_name],
[auth_user].[last_name],
[auth_user].[email],
[auth_user].[is_staff],
[auth_user].[is_active],
[auth_user].[date_joined]
HAVING
(COUNT([tickets_ticket].[id]) > 0 OR COUNT(T3.[id]) > 0 )
EDIT:
Here are the relevant indexes (excluding those not used in the query):
auth_user.id (PK)
auth_user.username (Unique)
tickets_ticket.id (PK)
tickets_ticket.capturer_id
tickets_ticket.responsible_id
tickets_ticket_watchers.id (PK)
tickets_ticket_watchers.user_id
tickets_ticket_watchers.ticket_id
EDIT 2:
After a bit of experimentation, I've found that the following query is the smallest that results in the slow execution:
SELECT
COUNT([tickets_ticket].[id]) AS [tickets_captured__count],
COUNT(T3.[id]) AS [assigned_tickets__count],
COUNT([tickets_ticket_watchers].[ticket_id]) AS [tickets_watched__count]
FROM
[auth_user]
LEFT OUTER JOIN [tickets_ticket] ON ([auth_user].[id] = [tickets_ticket].[capturer_id])
LEFT OUTER JOIN [tickets_ticket] T3 ON ([auth_user].[id] = T3.[responsible_id])
LEFT OUTER JOIN [tickets_ticket_watchers] ON ([auth_user].[id] = [tickets_ticket_watchers].[user_id])
GROUP BY
[auth_user].[id]
The weird thing is that if I comment out any two lines in the above, it runs in less that 1s, but it doesn't seem to matter which lines I remove (although obviously I can't remove a join without also removing the relevant SELECT line).
EDIT 3:
The python code which generated this is:
User.objects.annotate(
Count('tickets_captured'),
Count('assigned_tickets'),
Count('tickets_watched')
)
A look at the execution plan shows that SQL Server is first doing a cross-join on all the table, resulting in about 280 million rows, and 6Gb of data. I assume that this is where the problem lies, but why is it happening?
SQL Server is doing exactly what it was asked to do. Unfortunately, Django is not generating the right query for what you want. It looks like you need to count distinct, instead of just count: Django annotate() multiple times causes wrong answers
As for why the query works that way: The query says to join the four tables together. So say an author has 2 captured tickets, 3 assigned tickets, and 4 watched tickets, the join will return 2*3*4 tickets, one for each combination of tickets. The distinct part will remove all the duplicates.
what about this?
SELECT auth_user.*,
C1.tickets_captured__count
C2.assigned_tickets__count
C3.tickets_watched__count
FROM
auth_user
LEFT JOIN
( SELECT capturer_id, COUNT(*) AS tickets_captured__count
FROM tickets_ticket GROUP BY capturer_id ) AS C1 ON auth_user.id = C1.capturer_id
LEFT JOIN
( SELECT responsible_id, COUNT(*) AS assigned_tickets__count
FROM tickets_ticket GROUP BY responsible_id ) AS C2 ON auth_user.id = C2.responsible_id
LEFT JOIN
( SELECT user_id, COUNT(*) AS tickets_watched__count
FROM tickets_ticket_watchers GROUP BY user_id ) AS C3 ON auth_user.id = C3.user_id
WHERE C1.tickets_captured__count > 0 OR C2.assigned_tickets__count > 0
--WHERE C1.tickets_captured__count is not null OR C2.assigned_tickets__count is not null -- also works (I think with beter performance)