I have two tables having 50 million unique rows each.
Row number from one table corresponds to row number in the second table.
i.e
1st row in the 1st table joins with 1st row in the second table, 2nd row in first table joins with 2nd row in the second table and so on. Doing inner join is costly.
It takes more than 5 hours on clusters. Is there an efficient way to do this in SQL?
To start with: tables are just sets. So the row number of a record can be considered pure coincidence. You must not join two tables based on row numbers. So you would join on IDs rather than on row numbers.
There is nothing more efficient than a simple inner join. As the whole tables must be read, you might not even gain anything from indexes (but as we are talking of IDs, there will be indexes anyhow, so nothing we must ponder on).
Depending on the DBMS you may be able to parallelize the query. In Oracle for example you would use a hint such as /*+ parallel( tablename , parallel_factor ) */.
Try to sort both tables by rows (if isnt sorted),then use normal SELECT (maybe you could use LIMIT to get it part by part) for both tables anddata connect line by line wherever you want
Related
I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;
I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE Modified < (SELECT Modified FROM PhraseSource WHERE Id = Phrase.PhraseId)
The intention of the SQL is to delete rows from Phrase where there are more recent rows in the PhraseSource table.
Now I know the tables Phrase and PhraseSource have the same columns and Modified holds the number of seconds since 1970 but I cannot understand how/why this works or what it is doing. When I look at it then it seems like on the left of the < it is just one column and on the right side of the > it would be many rows. Does it even make any sense?
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
The SELECT statement in parenthesis is a sub-query or nested query.
What happens is that for each row, the Modified column value is compared with the result of the sub-query (which is run once for each of the rows in the Phrase table).
The sub-query has a WHERE statement, so it finds a row that has the same ID as the row from Phrase table that we are currently evaluating and returns the Modified value (which is for a sigle row, actually a single scalar value).
The two Modified values are compared and in case the Phrase's row has been modified before the row in PhraseSource, it is deleted.
As you can see this approach is not efficient, because it requires the database to run a separate query for each of the rows in the Phrase table (although I imagine that some databases might be smart enough to optimize this a little bit).
A better solution
The more efficient solution would be to use INNER JOIN:
DELETE p FROM Phrase p
INNER JOIN PhraseSource ps
ON p.PhraseId=ps.Id
WHERE p.Modified < ps.Modified
This should do the exact same thing as your query, but using efficient JOIN mechanism. INNER JOIN uses the ON statement to choose how to "match" rows in two different tables (which is done very efficiently by the DB) and then again compares the Modified values of matching rows.
I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!
Which will perform better?
Ouery 1:
select (select a from innertable I where i.val=o.val)
, val1, val2
from outertable o
Query 2:
select i.a
,o.val1
,o.val2
from outertable o
join innertable i on i.val=o.val
Why ? Please advise.
As Ollie suggests, the only definitive way to determine which of two queries is more efficient is to benchmark the two approaches using your data since the performance of the two alternatives is likely to depend on data volumes, data structures, what indexes are present, etc.
In general, the two queries that you posted will return different results. Unless you are guaranteed that every row in outertable has exactly one corresponding row in innertable, the two queries will return a different number of rows. The first query will return a row for every row in outertable with a NULL as the first column if there is no matching row in innertable. The second query will not return anything if there is no matching row in innertable. Similarly, if there are multiple matching rows in innertable for any particular row in outertable, the first query will return an error while the second query will return multiple rows for that row in outertable.
If you are confident that the two queries return identical result sets in your particular case because you can guarantee that there is exactly one row in innertable for every row in outertable (in which case it is at least somewhat odd that your data model separates the tables), the second option would be the much more natural way to write the query and thus the one for which the optimizer is most likely to find the more efficient plan.
Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks
The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.
It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.
The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.