Slow self-join delete query - sql

Does it get any simpler than this query?
delete a.* from matches a
inner join matches b ON (a.uid = b.matcheduid)
Yes, apparently it does... because the performance on the above query is really bad when the matches table is very large.
matches is about 220 million records. I am hoping that this DELETE query takes the size down to about 15,000 records. How can I improve the performance of the query? I have indexes on both columns. UID and MatchedUID are the only two columns in this InnoDB table, both are of type INT(10) unsigned. The query has been running for over 14 hours on my laptop (i7 processor).

Deleting so many records can take a while, I think this is as fast as it can get if you're doing it this way. If you don't want to invest into faster hardware, I suggest another approach:
If you really want to delete 220 million records, so that the table then has only 15.000 records left, thats about 99,999% of all entries. Why not
Create a new table,
just insert all the records you want to survive,
and replace your old one with the new one?
Something like this might work a little bit faster:
/* creating the new table */
CREATE TABLE matches_new
SELECT a.* FROM matches a
LEFT JOIN matches b ON (a.uid = b.matcheduid)
WHERE ISNULL (b.matcheduid)
/* renaming tables */
RENAME TABLE matches TO matches_old;
RENAME TABLE matches_new TO matches;
After this you just have to check and create your desired indexes, which should be rather fast if only dealing with 15.000 records.

running explain select a.* from matches a inner join matches b ON (a.uid = b. matcheduid) would explain how your indexes are present and being used

I might be setting myself up to be roasted here, but in performing a delete operation like this in the midst of a self-join, isn;t the query having to recompute the join index after each deletion?
While it is clunky and brute force, you might consider either:
A. Create a temp table to store the uid's resulting from the inner join, then join to THAT, THEN perfoorm the delete.
OR
B. Add a boolean (bit) typed column, use the join to flag each match (this operation should be FAST), and THEN use:
DELETE * FROM matches WHERE YourBitFlagColumn = True
Then delete the boolean column.

You probably need to batch your delete. You can do this with a recursive delete using a common table expression or just iterate it on some batch size.

Related

Creating a view from JOIN two massive tables

Context
I have a big table, say table_A, with roughly 20 billion rows and 600 columns. I don't own this table but I can read from it.
For a fraction of these columns I produce a few extra columns (50) which I store in a separate table, say table_B, which is therefore roughly 20 bn X 50 large.
Now I have the need to expose the join of table table_A and table_B to users, which I tried as
CREATE VIEW table_AB
AS
SELECT *
FROM table_A AS ta
LEFT JOIN table_B AS tb ON (ta.tec_key = tb.tec_key)
The problem is that for any simple query like SELECT * FROM table_AB LIMIT 2 will fail because of memory issues: apparently Impala attempts to do a full join first in memory which would result into a table of 0.5 Petabyte. Hence the failure.
Question
What is the best way to create such a view?
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
I have also tried with [...] SELECT STRAIGHT_JOIN * [...] but did not help.
What is the best way to create such a view?
Since both tables are huge, there will be memory problem. here are some points i would recommend,
Assuming table a and b have same tec_key, do a inner join
Keep (smaller) table b as driver. create vw as select ... from b join a on .... Impala stores driver table in memory and so it will require less memory.
Select only columns required and do not select all.
put filter in view.
Do partitions in table b if you can on some dtae/year/region/anything that can evenly distribute the data.
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
You can not ensure filter goes before or after join. Only way to ensure a filter will improve perf is if you have partition on the filter column. Else, you can try to filter first and the join to see if it improves perf like this
select ... from b
join ( select ... from a where region='Asia') a on ... -- wont improve much
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
Completely agree on this. Multiple smaller tables is way better than one giant table with 600 columns. So, create few stg table with only required fields and then enrich that data. Its a difficult data set, but no one will change 20b rows everyday - so some sort of incremental is also possible to implement.

Does it make sense to index a table with just one column?

I wonder if it makes sense to index a table, which contains just a single column? The table will be populated with 100's or 1000's of records and will be used to JOIN to another (larger table) in order to filter its records.
Thank you!
Yes and no. An explicit index probably does not make sense. However, defining the single column as a primary key is often done (assuming it is never NULL and unique).
This is actually a common practice. It is not uncommon for me to create exclusion tables, with logic such as:
from . . .
where not exists (select 1 from exclusion_table et where et.id = ?.id)
A primary key index can speed up such a query.
In your case, it might not make a difference if the larger table has an index on the id used for the join. However, you can give the optimizer of option of choosing which index to use.
My vote is that it probably doesn't really make sense in your scenario. You're saying this table with a single column will be joined to another table to filter records in the other table, so why not just delete this table, index the other column in the other table, and filter that?
Essentially, why are you writing:
SELECT * FROM manycols M INNER JOIN singlecol s ON m.id = s.id WHERE s.id = 123
When that is this:
SELECT * FROM manycols m WHERE m.id = 123
Suppose the argument is that manycols has a million rows, and singlecol has a thousand. You want the thousand matching rows, it's manycols that would need to be indexed then for the benefit.
Suppose the argument is you want all rows except those in singlecol; you could index singlecol but the optimiser might choose to just load the entire table into a hash anyway, so again, indexing it wouldn't necessarily help
It feels like there's probably another way to do what you require that ditches this single column table entirely

SQL Query with multiple possible joins (or condition in join)

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.
It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.
I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

INNER JOINs with where on the joined table

Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks
The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.
It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.
The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.