Left Join Lateral and array aggregates - sql

I'm using Postgres 9.3.
I have two tables T1 and T2 and a n:m relation T1_T2_rel between them. Now I'd like to create a view that in addition to the columns of T1 provides a column that, for each record in T1, contains an array with the primary key ids of all related records of T2. If there are no related entries in T2, corresponding fields of this column shall contain null-values.
An abstracted version of my schema would look like this:
CREATE TABLE T1 ( t1_id serial primary key, t1_data int );
CREATE TABLE T2 ( t2_id serial primary key );
CREATE TABLE T1_T2_rel (
t1_id int references T1( t1_id )
, t2_id int references T2( t2_id )
);
Corresponding sample data could be generated as follows:
INSERT INTO T1 (t1_data)
SELECT cast(random()*100 as int) FROM generate_series(0,9) c(i);
INSERT INTO T2 (t2_id) SELECT nextval('T2_t2_id_seq') FROM generate_series(0,99);
INSERT INTO T1_T2_rel
SELECT cast(random()*10 as int) % 10 + 1 as t1_id
, cast(random()*99+1 as int) as t2_id
FROM generate_series(0,99);
So far, I've come up with the following query:
SELECT T1.t1_id, T1.t1_data, agg
FROM T1
LEFT JOIN LATERAL (
SELECT t1_id, array_agg(t2_id) as agg
FROM T1_T2_rel
WHERE t1_id=T1.t1_id
GROUP BY t1_id
) as temp ON temp.t1_id=T1.t1_id;
This works. However, can it be simplified?
A corresponding fiddle can be found here: sql-fiddle. Unfortunately, sql-fiddle does not support Postgres 9.3 (yet) which is required for lateral joins.
[Update] As has been pointed out, a simple left join using a subquery in principle is enough. However, If I compare the query plans, Postgres resorts to sequential scans on the aggregated tables when using a left join whereas index scans are used in the case of the left join lateral.

As #Denis already commented: no need for LATERAL.
Also, your subquery selected the wrong column. This works:
SELECT t1.t1_id, t1.t1_data, t2_ids
FROM t1
LEFT JOIN (
SELECT t1_id, array_agg(t2_id) AS t2_ids
FROM t1_t2_rel
GROUP BY 1
) sub USING (t1_id);
-SQL fiddle.
Performance and testing
Concerning the ensuing sequential scan you mention: If you query the whole table, a sequential scan is often faster. Depends on the version you are running, your hardware, your settings and statistics of cardinalities and distribution of your data. Experiment with selective WHERE clauses like WHERE t1.t1_id < 1000 or WHERE t1.t1_id = 1000 and combine with planner settings to learn about choices:
SET enable_seqscan = off;
SET enable_indexscan = off;
To reset:
RESET enable_seqscan;
RESET enable_indexscan;
Only in your local session, mind you! This related answer on dba.SE has more instructions.
Of course, your setting may be off, too:
Keep PostgreSQL from sometimes choosing a bad query plan

Related

Oracle database -- how can I combine these two tables so that the where clause looks at multiple values?

Really not sure how to word this. I have two tables, t1 and t2. t1 contains a very long history of messages, so it has [message, t1_id, t2_d]. t2, for the sake of the example just has t2_id. I want to find the message correlated with the highest t1_id. That's easy
select message
from t1
where t1_id = (
select MAX(t1_id) as mx
from t1
where t2_id = 1
);
But as you can see, I have hard coded the t2_id. My intention is for this code (which currently just returns the highest message from the one id) to do so an ALL distinct id's. There's a few ways to do this, but I know you can do
select t2_id
from t2;
So how can I essentially replace the hard coding with the above select statement? I know I can't directly, but even using GROUP BY I was having no luck.
Clarification: the t2_id column in t1 correlates to the t2_id column in t2. So in t1 there will be many repeats of t2_id's from t2. So I just want the message for EACH t2_id in t1 that has the highest t1_id. So t1 has thousands of rows but only about 20 different t2_id's, so I should have 20 row result.
A more efficient solution (probably the most efficient one), using slightly more advanced features.
select t2_id, -- you probably need this!
max(t1_id) as max_t1_id, -- if needed
max(message) keep (dense_rank last order by t1_id) as message
from t1
group by t2_id;
Use a correlated subquery:
select message
from t1
where t1_id = (
select MAX(t1_id) as mx
from t1 ttt
where t1.t2_id = ttt.t2_id
);
There are other ways to express this query -- using row_number(), keep, join are typical ways. But keeping with your form, a correlated subquery is one way to go.

Delete from table A joining on table A in Redshift

I am trying to write the following MySQL query in PostgreSQL 8.0 (specifically, using Redshift):
DELETE t1 FROM table t1
LEFT JOIN table t2 ON (
t1.field = t2.field AND
t1.field2 = t2.field2
)
WHERE t1.field > 0
PostgreSQL 8.0 does not support DELETE FROM table USING. The examples in the docs say that you can reference columns in other tables in the where clause, but that doesn't work here as I'm joining on the same table I'm deleting from. The other example is a subselect query, but the primary key of the table I'm working with has four columns so I can't see a way to make that work either.
Amazon Redshift was forked from Postgres 8.0, but is a very much different beast. The manual informs, that the USING clause is supported in DELETE statements:
Just use the modern form:
DELETE FROM tbl
USING tbl t2
WHERE t2.field = tbl.field
AND t2.field2 = tbl.field2
AND t2.pkey <> tbl.pkey -- exclude self-join
AND tbl.field > 0;
This is assuming JOIN instead of LEFT JOIN in your MySQL statement, which would not make any sense. I also added the condition AND t2.pkey <> t1.pkey, to make it a useful query. This excludes rows joining itself. pkey being the primary key column.
What this query does:
Delete all rows where at least one other row exists in the same table with the same not-null values in field and field2. All such duplicates are deleted without leaving a single row per set.
To keep (for example) the row with the smallest pkey per set of duplicates, use t2.pkey < t2.pkey.
An EXISTS semi-join (as #wilplasser already hinted) might be a better choice, especially if multiple rows could be joined (a row can only be deleted once anyway):
DELETE FROM tbl
WHERE field > 0
AND EXISTS (
SELECT 1
FROM tbl t2
WHERE t2.field = tbl.field
AND t2.field2 = tbl.field2
AND t2.pkey <> tbl.pkey
);
I don't understand the mysql syntax, but you probably want this:
DELETE FROM mytablet1
WHERE t1.field > 0
-- don't need this self-join if {field,field2}
-- are a candidate key for mytable
-- (in that case, the exists-subquery would detect _exactly_ the
-- same tuples as the ones to be deleted, which always succeeds)
-- AND EXISTS (
-- SELECT *
-- FROM mytable t2
-- WHERE t1.field = t2.field
-- AND t1.field2 = t2.field2
-- )
;
Note: For testing purposes, you can replace the DELETE keyword by SELECT * or SELECT COUNT(*), and see which rows would be affected by the query.

Poor performance on pagination using SQL Server ROW_NUMBER()

I am trying to build a pagination mechanism. I am using a ORM that creates SQL looking like this:
SELECT * FROM
(SELECT t1.colX, t2.colY
ROW_NUMBER() OVER (ORDER BY t1.col3) AS row
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.col1=t2.col2
)a
WHERE row >= n AND row <= m
Table1 has >500k rows and Table2 has >10k records
I execute the queries directly in the SQL Server 2008 R2 Management Studio. The subquery takes 2-3sec to execute but the whole query takes > 2 min.
I know SQL Server 2012 accepts the OFFSET .. LIMIT .. option but I cannot upgrade the software.
Can anyone help me in improving the performance of the query or suggest other pagination mechanism that can be imposed through the ORM software.
Update:
Testing Roman Pekar's solution (see comments on the solution) proved that ROW_NUMBER() might not be the cause of the performance problems. Unfortunately the problems persist.
Thanks
As I understand your table structure from comments.
create table Table2
(
col2 int identity primary key,
colY int
)
create table Table1
(
col3 int identity primary key,
col1 int not null references Table2(col2),
colX int
)
That means that the rows returned from Table1 can never be filtered by the join to Table2 because Table1.col1 is not null. Neither can the join to Table2 add rows to the result since Table2.Col2 is the primary key.
You can then rewrite your query to generate row numbers on Table1 before the join to Table2. And the where clause is also applied before the join to Table2 meaning that you will only locate the rows in Table2 that is actually part of the result set.
select T1.colX,
T2.colY,
T1.row
from
(
select col1,
colX,
row_number() over(order by col3) as row
from Table1
) as T1
inner join Table2 as T2
on T1.col1 = T2.col2
where row >= #n and row <= #m
SQL Fiddle
I have no idea if you can make your ORM (Lightspeed by Mindscape) to generated the paging query like this instead of what you have now.
The query plan from this answer:
The query plan using the query in the question:
There is a huge difference in reads between the two.
Insert just the primary key column(s) of the paginated table into a temp table with an identity column, ordering by the ordered-by columns. (You may have to include the ordered-by columns to ensure the ordering comes out right.) Then, join back to the main table using the temp table as a key for the rows you want. If the data is fairly static, you could save the ordering data to a session-keyed permanent table instead of a temp table, and reuse it for a short period of time (so subsequent page requests within a few minutes are nearly instant).
Row_Number() tends to perform well with small sets of data, but it can hit serious performance snags once you get some serious rows, as you have with 500k.
I suggest you to check indexes on your tables. I think it'll help your query if you at least have index on col2 on table2. You could also try to rewrite your query like
;with cte1 as (
select top (#m) t1.colX, t2.colY, t1.col3
from Table1 as t1
inner join Table2 as t2 on t1.col1=t2.col2
order by t1.col3 asc
),
cte2 as (
select top (#m - #n + 1) *
from cte1
order by col3 desc
)
select *
from cte2 as t1
but it could still be slow if you don't have indexes

SQL Optimizer / Execution Plan - Repeated Subqueries

NOTE: I'm currently running my queries in question on a sqlite3 DB, though answers from expertise in any other DBMS will be welcome insight...
I was wondering if the query optimizer makes any attempt to identify repeated queries/subqueries and run them only once if so.
Here is my example query:
SELECT *
FROM table1 AS t1
WHERE t1.fk_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
OR t1.fk_id =
(
SELECT local_id
FROM ID_MAP
WHERE remote_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
);
Will the nested query
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
be run only once (and its results cached for further access) ?
Its not a big deal in this example, since its a simple query that executes only twice, however I need it to run
about 4-5 more times (x2, twice for each child record, so 8-10 really) in my actual program (its grabbing all child records (table1)
associated to a parent record (table2), bound by a foreign key. Its also checking an id mapping table to make sure it queries
for both a locally generated id, as well as the real/updated/new key).
I really appreciate any help with this, thank you.
SQLite has a very simple query optimizer, and does not even try to detect identical subqueries:
> create table t(x);
> explain query plan
select * from t
where x in (select x from t) or
x in (select x from t);
0|0|0|SCAN TABLE t (~500000 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE t (~1000000 rows)
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE t (~1000000 rows)
The same applies to CTEs and views; if the performance actually matters, your best bet is to create a temporary table for the result of the subquery.
As you asked for insight from other DBs....
In Oracle DBMS, any independent subquery will be executed only once.
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111 -- The result will be the same for any row in t1.
Dependant subqueries will need to executed repeatedly, of course.
Example of dependent subquery:
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = t1.t2_id -- t1.t2_id will have different values for different rows in t1.

Should I avoid IN() because slower than EXISTS() [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?
Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.
Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)
If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.