Postgres running parallel queries (cross joins on large tables) - sql

I need to run queries of the following type:
SELECT * FROM A CROSS JOIN B WHERE myfunction(A.x,B.y) = Z
Because the query is slow I would like to use all processors available to speed it up.
I have only very basic knowledge of relational databases so even "obvious" comments are welcome.
Postgres v 9.4.4 (upgrade is not an option due to some constraints)
A has 3 mil rows
B can have 100k rows (but could have like 10M rows in future)
A,B have indexed columns
myfunction (A.x,B.y) takes advantage of indexes on A.x, B.y - without them it is even much more slower.
What would be a reasonable solution?
At present 10k x 2M query using 50 processors with naive split suggested below took about 20 min.
I am considering running cross joins on parts of B in parallel. B would be split by values of ID (integer primary key)
SELECT * FROM A CROSS JOIN B WHERE myfunction(A.x,B.y) = Z AND A.id BETWEEN N and M.
and the run multiple "psql -d mydatabase subqueryNumberX.sql" commands using gnu parallel.
Some questions:
If I have an indexed table T and use a SELECT from it within another query would index of T used in search? or this subSELECT destroys it?
In my query above, would selection of a part of the table (WHERE A.id BETWEEN N and M) prevent using index?
When a (slow) cross-join on a table is in progress is such table accessible for other operations (next cross-join)?

Your question is (still) rather vague.
For a cross join, indexes are not necessarily of much use, but it depends on the columns which are indexed and the columns referenced in the query and the size of the rows in the table. If the index is on the relevant columns, then maybe the optimizer will do an 'index only' scan instead of a 'full table scan' and benefit from the smaller amount of I/O. However, since you have SELECT *, you are selecting all columns from A and B so the full rows will need to be read (but see point 2). There isn't a sub-select in the query, so it is mystifying to ask about the sub-select destroying anything.
Nominally, you might get some benefit from moving the WHERE clause into a sub-select such as:
SELECT *
FROM (SELECT * FROM A WHERE A.id BETWEEN N AND M) AS A1
CROSS JOIN B
WHERE myFunction(A1.x, B.y) = Z
However, it would be a feeble optimizer that would not do that automatically. The range condition might make an index on A.id attractive, especially if M and N represent a small fraction of the total range of values in A.id. So, the optimizer should use an index with A.id as the leading or sole component to allow it to speed up the query. The condition won't prevent the use of any index — indexes almost certainly won't be used otherwise.
A slow query does not inhibit other queries; it may inhibit updates while it is running, or it may stress the MVCC (multi-version concurrency control) mechanisms of the DBMS.

Related

is it true if we switch the position of table in join query will increase load data speed?

For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id
No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.
It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.
I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....
First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.
Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.

SQL Server Logical Query Processing - how does it manage the huge datasets?

I'm doing some reading on SQL Server performance:
http://www.amazon.com/Inside-Microsoft-SQL-Server-2005/dp/0735623139/ref=sr_1_6?ie=UTF8&s=books&qid=1267032068&sr=8-6
One of the surprising things I came across was how it processes the "FROM" phase in its Logical Processing. From what I understand, SQL Server will do the following:
1) For the first two tables, it will create a virtual table (VT1) consisting of a Cartesian join of the two tables
2) For every additional table, it will create a Cartesian join of VT1 and the additional table, with the result becoming VT1
I'm sure there is alot more to it under the covers, but at face value, this seems like it would involve a huge amount of processing/memory if you're dealing with big tables (and big queries).
I was just wondering whether anyone had a quick explanation of how SQL Server is able to do this in any sort of realistic time/space frame?
The carthesian join is just a description of the result, not an actual result. After the full carthesian join of tables A, B, C...X, the filter operators are applied (still as a definition), things like ON clauses of the join and WHERE clauses of the query. In the end this definition is in turn transformed into an execution plan, which will contain physicall operators like Nested Loops or Hash Join or Merge Join, and this operators, when iterated, will produce the results as requested in the query definition.
So the big 100x100x100x100... carthesian cube is never materialized, is just a definition.
If you are really interested in how SQL Server does what it does, please read this book:
http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&s=books&qid=1267033666&sr=8-1
In reality the optimiser will look at the whole query, estimated rows, statistics, constraints etc
Logically, it is in the order mentioned though
Contrived example:
SELECT
BT.col1, LT.col2
FROm
BigTable BT
JOIN
LT.Table LT ON BT.FKCol = LT.PKCol
WHERE
LT.PKCol = 2
ORDER BY
BT.col1
The cartesian of BT and LT could be 100s of millions.
But the optimiser:
knows PKCol is unique so it expects only one row
can use statistics to estimate the number of rows from BT
looks for indexes (eg covering index on BT for FKCol INLCUDE col1)
will probably apply the WHERE first
will look ahead for an ORDER BY or GROUP BY for example to see if it can save some spooling (resorting)
I don't know the resource you are reading, but what you describe is the behavior of:
SELECT ... FROM tableA, tableB, tableC, ....
This uses a cartesian join (also called a cross join) and is very expensive. With large enough datasets SQL Server (or any RDBMS) can't do this in any sort of realistic time/space frame.
Using an ON clause and specifying the JOIN type performs vastly better:
SELECT ... FROM tableA JOIN tableB on tableB.a_id = tableA.a_id
In real applications cross joins should be rare or at least limited to very small datasets. For many applications it's not uncommon to never have a cross join.

INNER JOINs with where on the joined table

Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks
The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.
It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.
The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.