Could this SQL query be made more efficient? - sql

I have a very large table of nodes (cardinality of about 600,000), each record in this table can have one or more types associated with it. There is a node_types table that contains these (30 or so) type definitions.
To connect the two, I have a third table called node_type_relations that simply links node ids to type ids.
I am trying to clean up orphaned node_type_relation entries after a cull of the node table. My query to delete any type relations for which the node no longer exists is;
DELETE FROM node_type_relations WHERE node_id NOT IN (SELECT id FROM nodes)
But judging by the speed at which this is running (one record being deleted per 10 seconds or so), it looks like Postgres is loading up the entire nodes table once for every record in the node_type_relations table (which is about 1.4million records in size).
I was about to dive in and write some code to do it more sensibly when I thought I'd ask here if the query could be turned inside-out somehow. Anything to avoid loading the nodes table more than once.
Thanks as always.
Edit with solution
Executing the query;
DELETE FROM node_type_relations WHERE NOT EXISTS (SELECT 1 FROM nodes WHERE nodes.id=node_type_relations.node_id)
appears to have had the desired effect and deleted all orphaned records (some 170,000) in a matter of seconds.

Maybe do a left join, and then delete where null.
So:
DELETE ntr
FROM node_type_relations ntr
LEFT JOIN nodes n
ON n.id = ntr.node_id
WHERE n.id IS NULL

#lynks' found the optimal query for his case himself - with an EXISTS semi-join:
DELETE FROM node_type_relations ntr
WHERE NOT EXISTS (
SELECT 1
FROM nodes n
WHERE n.id = ntr.node_id
);
A solution with JOIN syntax would have to be constructed like this in PostgreSQL:
DELETE FROM node_type_relations d
USING node_type_relations ntr
LEFT JOIN nodes n ON n.id = ntr.node_id
WHERE ntr.node_id = d.node_id
AND n.id IS NULL;

Related

Request optimisation

I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?
Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.
Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.

Clean up parent table rows that are no longer reference by any child

The obvious queries are
delete from in_pipe
where id in (
select id
from in_pipe
where id not in
(select distinct inpipeid from out_pipe)
fetch first 1000 rows only
)
or
delete from in_pipe
where id in (
select i.id
from in_pipe i
left join out_pipe o on o.inpipeid = i.id
where o.id is null
fetch first 1000 rows only
)
There is primary key index on in_pipe.id and out_pipe.inpipeid has index CREATE INDEX ix_outpipe_inpipeid ON out_pipe(inpipeid)
Both of these queries would do the job and the execution plans look fine.
BUT I'm afraid about the performance of these queries once we get to production and tables have millions of rows (tens of millions). The performance of the clean up is not critical, but I'm afraid these queries would never finish.
Clean up should not effect performance of deletes/inserts from out_pipe or in_pipe, thus I would not use a trigger for this. I'd rather have this clean up done in the background during idle hours. It can (and should) be done little by little.
So I guess I'm looking for clever ideas...
Edit: I'm thinking browsing in_pipe ids in batches, starting for lowest and moving up, and checking for existence of the batch in out_pipe, until I reach the end and then start from the beginning again.
How about a two-and-a-half steps?
First step: table of IDs which aren't used:
create table not_used as
select id from in_pipe
minus
select inpipeid from out_pipe;
A half of a step: index:
create index i1nu on not_used (id);
The second step: delete IDs which aren't used:
delete from in_pipe a
where exists (select null
from not_used n
where n.id = a.id
);
I would recommend not exists:
delete from in_pipe i
where not exists (
select 1 from out_pipe o where o.inpipeid = i.id
)

SQL get leaves of directed acyclic graph

I am quite new to SQL and have a rather basic question. Suppose I'm dealing with the following table structure:
CREATE TABLE nodes (
id INTEGER NOT NULL PRIMARY KEY,
parent INTEGER REFERENCES nodes(id)
);
If we hold an invariant that says, the parent of a node cannot be equivalent to any of its children, then by definition we will not have any loops in our graph. Now we are left with a disjoint directed acyclic graph.
The two questions I have then are:
If we cannot change the structure of the database: What select statement would I have to write to efficiently get all of the leaves in my database? I.e. the ids that don't have any children.
If we can change the structure of the tables: What could we change or add to make this select statement more efficient?
An example of output for the graph with five nodes whose parents where 3->2, 2->1, and 5->4 would output 3 and 5 because they are the only nodes that don't have children.
You can use NOT EXISTS and a correlated subquery that checks for node where the current not is the parent. For leafs no such record can exist.
SELECT *
FROM nodes n1
WHERE NOT EXISTS (SELECT *
FROM nodes n2
WHERE n2.parent = n1.id);
Another option is a left join joining possible children of a node. If there's a null for an id of the "children's side" of the join no child exists for the current node, it's a leaf.
SELECT *
FROM nodes n1
LEFT JOIN nodes n2
ON n2.parent = n1.id
WHERE n2.id IS NULL;
And, leaving denormalization away, I don't think there's much to change in the table's structure. Indexes could help though. One should be on id (but that's already the case because of the primary key constraint) and one on parent (but again such an index already exists because MySQL creates indexes for foreign key tuples).
For more complex graph queries, you may use Common Table Expressions (CTEs), standardized in SQL:99 and supported in MySQL since 8.0.1 (reference)
But as others pointed out, for the query you're interested in, a simple NOT EXISTS subquery or equivalent is enough. Yet another equivalent to those already posted would be using the EXCEPT set operation:
SELECT id FROM nodes
EXCEPT SELECT parent FROM nodes
I would do:
select *
from nodes
where id not in (select parent from nodes where parent is not null)

Hive: Can't select one random match on right table in left outer join

EDIT - I don't care about the skewness or things being slow. I found out that the slowness was more so caused by a many times many join on many matches in my left outer join... Please skip down to the bottom.
I have an issue of a skewed table, that is, many more keys than other keys to join. My problem is that I have more than one key with many appearances in the rows.
Stats on this table and table I am joining with:
Larger table: totalSize=47431500000, numRows=509500000, rawDataSize=47022050000 21052 distinct keys
Smaller table: totalSize=1154984612, numRows=13780692, rawDataSize=1141203920 AND 39313 distinct keys
The smaller table also has repeated rows of keys. The other challenge is that I need to randomly select a matching key from the smaller table.
What I have tried so far:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1155mb;
and
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
LEFT OUTER JOIN smaller_table as b
on a.key = b.key
ORDER BY RAND()
)b;
It has been running for a day now.
I thought about manually doing something like this but, I have more than one key that there are a ton of, so I would have to make a bunch of tables and merge them. Which I can do if that is my only option :O
But I wanted to reach out to you all on SO.
Thanks for the help in advance
EDIT June 20th
I found to try:
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = 200000;
I had already created a few separate tables to separate and join up the highest appearing keys, such that now the highest appearing key in the rest was 200k. Running the query to join the rest now took 25 minutes, finished all tasks successfully, according to the job tracker on the web interface. On the command line in the Hive shell, it is still sitting there, and when I go to check, the table does not exist.
**EDIT #2 After a lot of reading and trying out a lot of sql hive code... the 1 solution that should have worked in theory did not work, specifically the order by rand() never even happened...
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
JOIN
(SELECT *, row_number() over (partition by key order by rand() )
from smaller_table) as b
on a.key = b.key
and b.row_num=1
)b;
In the results it is being matched with the first row, not a rand() row at all..
Any other options or anything I did incorrectly here?

Delete all records that have no foreign key constraints

I have a SQL 2005 table with millions of rows in it that is being hit by users all day and night. This table is referenced by 20 or so other tables that have foreign key constraints. What I am needing to do on a regular basis is delete all records from this table where the "Active" field is set to false AND there are no other records in any of the child tables that reference the parent record. What is the most efficient way of doing this short of trying to delete each one at a time and letting it cause SQL errors on the ones that violate constraints? Also it is not an option to disable the constraints and I cannot cause locks on the parent table for any significant amount of time.
If it's not likely that inactive rows which are not linked will become linked, you can run (or even dynamically build, based on the foreign key metadata):
SELECT k.*
FROM k WITH(NOLOCK)
WHERE k.Active = 0
AND NOT EXISTS (SELECT * FROM f_1 WITH(NOLOCK) WHERE f_1.fk = k.pk)
AND NOT EXISTS (SELECT * FROM f_2 WITH(NOLOCK) WHERE f_2.fk = k.pk)
...
AND NOT EXISTS (SELECT * FROM f_n WITH(NOLOCK) WHERE f_n.fk = k.pk)
And you can turn it into a DELETE pretty easily. But a large delete could hold a lot of locks, so you might want to put this in a table and then delete in batches - a batch shouldn't fail unless a record got linked.
For this to be efficient, you really need to have indexes on the FK columns in the related tables.
You can also do this with left joins, but then you (sometimes) have to de-dupe with a DISTINCT or GROUP BY and the execution plan isn't really usually any better and it's not as conducive to code-generation:
SELECT k.*
FROM k WITH(NOLOCK)
LEFT JOIN f_1 WITH(NOLOCK) ON f_1.fk = k.pk
LEFT JOIN f_2 WITH(NOLOCK) ON f_2.fk = k.pk
...
LEFT JOIN f_n WITH(NOLOCK) ON f_n.fk = k.pk
WHERE k.Active = 0
AND f_1.fk IS NULL
AND f_2.fk IS NULL
...
AND f_n.fk IS NULL
Let us we have parent table with the name Parent and it has at "id" field of any type and an "Active" field of the type bit. We have also a second Child table with his own "id" field and "fk" field which is the reference to the "id" field of the Parent table. Then you can use following statement:
DELETE Parent
FROM Parent AS p LEFT OUTER JOIN Child AS c ON p.id=c.fk
WHERE c.id IS NULL AND p.Active=0
Slightly confused about your question. But you can do a LeftOuterJoin from your main table, To a table that it should supposedly have a foreign key. You can then use a Where statement to check for null values inside the connecting table.
Check here for outer joins : http://en.wikipedia.org/wiki/Join_%28SQL%29#Left_outer_join
You should also write up triggers to do all this for you when a record is deleted or set to false etc.