Which method to use: UNION of JOINS vs LEFT JOINS - sql

So I currently have a pretty big table that I need to query frequently. This table has a lot of columns and rows and I need to filter the rows based on 2 lists that are placed in 2 virtual tables and I want the results that match with either one of them
I'm doing a procedure to get the results but our development environment doesn't have enough sample size to test the procedure performance. So now I have 2 options.
Use LEFT JOINS to do something like this
SELECT GT.* FROM GiantTable GT
LEFT JOIN List1 L1 ON GT.X = L1.X
LEFT JOIN List2 L2 ON GT.Y = L2.Y
WHERE L1.X IS NOT NULL OR L2.Y IS NOT NULL
Use 2 JOIN queries and UNION the results
SELECT GT.* FROM GiantTable GT JOIN List1 L1 ON GT.X = L1.X
UNION
SELECT GT.* FROM GiantTable GT JOIN List2 L2 ON GT.Y = L1.Y
I'm pretty sure that the 2nd should wield much better performance but I would like to know if I'm mistaken

You can probably do better with
select *
from gianttable
where x in (select x from L1) or y in (select y from L2)
Edit:
OP reports that this version is slower than his original attempts. This may be true, especially if x and y are indexed in gianttable (they are primary keys in their respective lists, so they are indexed in those smaller tables). What gets in the way is the OR operator in the where condition - that means neither condition can be used as an access predicate.
There's one more thing to try... where indexes should actually help. The query below is equivalent to the OP's first attempt, and also to my first attempt (above) and dasblinkenlight's solution. (That is so because x and y are PK in their respective lists, so we don't need to handle null there.)
select * from gianttable where x in (select x from L1)
union all
select * from gianttable where y in (select y from L2) and x not in (select x from L1)

UNION is problematic, because it requires filtering out duplicates from the results of two subqueries. You should be able to obtain a decent performance with EXISTS operator and a pair of correlated subqueries:
SELECT *
FROM GiantTable GT
WHERE (EXISTS (SELECT * FROM List1 L1 WHERE L1.X = GT.X))
OR (EXISTS (SELECT * FROM List2 L2 WHERE L2.Y = GT.Y))

Related

Postgres SQL slow when using OR statement

I have a query that take 12 seconds (journal table 1.2m rows), the root cause is below
select 1
from
myTable myTable
join myJournal Journal on (Journal.status=0 and myTable.id = Journal.myTableId)
join arrival Arrival on (myTable.id = Arrival.myTableId)
join calc Calc on (myTable.id = Calculated.myTableId)
join ms ms on (Parent.id = ms.myTableId)
join perf Perf on (myTable.id = Perf.myTableId)
join ref Ref on (myTable.id = Ref.myTableId)
where
((myTable.name like 'cheese%' or
Journal.algorithm like 'cheese%' or --if this is removed, its fine <1sec
myTable.client like 'cheese%' or
myTable.something like 'cheese%'))
However the journal table performs ok. Doing
select * from myJournal where algorithm like 'cheese%' --takes < 1 sec.
The query also performs fine, if I remove four of the joins (that are not used in the where clause).
When I join on more than 3 tables the performance degrades dramatically/exponentially.
I would start by writing the query as:
select 1
from myTable t join
myJournal j
on t.id = j.myTableId
where j.status = 0 and
(t.name like 'cheese%' or
j.algorithm like 'cheese%' or
t.client like 'cheese%' or
t.something like 'cheese%'
)
Then the optimal indexes for this are probably myJournal(status, myTableId, algorithm) and myTable(id, name, client, something).
These indexes are primarily for the join and the first filter condition. They won't help much with the string comparisons. But, those are hard to optimize for due to the or conditions.

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

SQL, Self join on 3 joined tables

Tables and requested output
I'm using National Instrumets Teststand default database setup. I've tried to simplify the DB layout in the picture above.
I can manage to get what i want through some rather "complicated" sql, and it's very slow.
I think there is a better way, and then i stumbled over SELF JOIN. Basically what I want is to get data values from several different rows, from one "serial number".
My problem is to combine the self Join with the "general" join of my tables.
I'm using an Access Databdase at the moment.
This will give you the output you're aiming for with the sample data:
with x as (
select
row_number() over (partition by t1.Serial order by t1.Serial) as [RN],
t1.Serial,
case when t3.Sub_Test_Name = 'AAA' then t3.Value end as [AAA],
case when t3.Sub_Test_Name = 'BBB' then t3.Value end as [BBB],
case when t3.Sub_Test_Name = 'CCC' then t3.Value end as [CCC],
case when t3.Sub_Test_Name = 'DDD' then t3.Value end as [DDD]
from Table_1 t1
inner join Table_2 t2 on t2.Table_1_Id = t1.Id
inner join Table_3 t3 on t3.Table_2_Id = t2.Id
)
select
x.Serial,
AAA.AAA,
BBB.BBB,
CCC.CCC,
DDD.DDD
from x
left outer join x AAA on AAA.Serial = x.Serial and AAA.RN = x.rn + 0
left outer join x BBB on BBB.Serial = x.Serial and BBB.RN = x.rn + 1
left outer join x CCC on CCC.Serial = x.Serial and CCC.RN = x.rn + 2
left outer join x DDD on DDD.Serial = x.Serial and DDD.RN = x.rn + 3
where x.rn = 1
This uses self joins as you mentioned (where you see x being left joined to itself multiple times in the final select statement).
I've deliberately added extra columns CCC and DDD so it is easier to see how you would build this out for a larger data set, incrementing the row_number offset for each join.
I've tested this in SQL Fiddle and you're welcome to play around with it. If you need to apply additional filters, your where clause should be placed inside the CTE.
Note, you're effectively pivoting the data with this sort of query (except we're not aggregating anything, so we can't use the built in PIVOT option). The downside of both this method and real pivots is that you have to manually specify every column header with its own CASE statement in the CTE, and a left join in the final select statement. This can get unwieldy in medium - large data sets, so it best suited in cases where you will have a small number of known column headers in your results.

syntax error when joining tables

I am trying to merge multiple tables. This is how I am doing it:
CREATE TABLE big3 AS SELECT *
FROM trainSearchStream a
LEFT OUTER JOIN SearchInfo b ON b.SearchID=a.SearchID LIMIT 3
LEFT OUTER JOIN AdsInfo c ON c.AdID=a.AdID LIMIT 3;
However, I get this error:
Error: near "LEFT": syntax error
As jarlh already mentioned, there can be only one LIMIT per select statement.
So this is forbidden:
select *
from a limit 5
join b limit 6 an a.x = b.x;
But this would be allowed:
select *
from (select * from a limit 5) alim
join (select * from b limit 6) blim on alim.x = blim.x;
As you simply want to test your query however, I'd suggest, you take a sample from trainSearchStream to test it. The modulo operator % is great for taking samples:
CREATE TABLE big3 AS SELECT *
FROM (select * from trainSearchStream where searchid % 12345 = 6789) a
LEFT OUTER JOIN SearchInfo b ON b.SearchID = a.SearchID
LEFT OUTER JOIN AdsInfo c ON c.AdID = a.AdID;
Choose whatever numbers you like for the modulo operation. Above statement divides your trainSearchStream count by about 12345 (provided the IDs are evenly spread).

Alternatives to full outer join for logical OR in tree structure query

I hope the title is clear enough. I've been implementing logical AND/OR for tree structures which are kept in the database, using a simple nodes and a parent-child association table.
A sample tree has a structure like this:
A sample tree structure query is as follows:
The double lines in the query pattern mean that A has a child of type B (somewhere down its child nodes) OR C. I have implemented A -> HASCHILD -> C -> HASCHILD -> E with an inner join, and A -> HASCHILD -> B -> HASCHILD -> E is implemented like this.
The trick is joining these two branches on A. Since this is an OR operation, either B branch or C branch may not exist. The only method I could think of if to use full outer joins of two branches with A's node_id as the key. To avoid details, let me give this simplified snippet from my SQL query:
WITH A as (....),
B as (....),
C as (....),
......
SELECT *
from
A
INNER JOIN A_CONTAINS_B ON A.NODE_ID = A_CONTAINS_B.parent
INNER JOIN B ON A_CONTAINS_B.children #> ARRAY[B.NODE_ID]
INNER JOIN .....
full OUTER JOIN -- THIS IS WHERE TWO As ARE JOINED
(select
A2.NODE_ID AS A2_NODE_ID
from
A2
INNER JOIN A_CONTAINS_C ON A2.NODE_ID = C_CONTAINS_C.parent
INNER JOIN C ON A_CONTAINS_C.children #> ARRAY[C.NODE_ID]
INNER JOIN ....)
as other_branch
ON other_branch.A2_NODE_ID = A.NODE_ID
This query links two As which actually represent the same A using node_id, and if B or C does not exist, nothing breaks.
The result set has duplicates of course, but I can live with that. I can't however think of another way to implement OR in this context. ANDs are easy, they are inner joins, but left outer join is the only approach that lets me connect As. UNION ALL with dummy columns for both branches is not an option because I can't connect As in that case.
Do you have any alternatives to what I'm doing here?
UPDATE
TokenMacGuy's suggestion gives me a cleaner route than what I have at the moment. I should have remembered UNION.
Using the first approach he has suggested, I can apply a query pattern decomposition, which would be a consistent way of breaking down queries with logical operators. The following is a visual representation of what I'm going to do, just in case it helps someone else visualize the process:
This helps me do a lot of nice things, including creating a nice result set where query pattern components are linked to results.
I've deliberately avoided details of tables or other context, because my question is about how to join results of queries. How I handle the hierarchy in DB is a different topic which I'd like to avoid. I'll add more details into comments. This is basically an EAV table accomponied by a hierarchy table. Just in case someone would like to see it, here is the query I'm running without any simplifications, after following TokenMacGuy's suggestion:
WITH
COMPOSITION1 as (select comp1.* from temp_eav_table_global as comp1
WHERE
comp1.actualrmtypename = 'COMPOSITION'),
composition_contains_observation as (select * from parent_child_arr_based),
OBSERVATION as (select obs.* from temp_eav_table_global as obs
WHERE
obs.actualrmtypename = 'OBSERVATION'),
observation_cnt_element as (select * from parent_child_arr_based),
OBS_ELM as (select obs_elm.* from temp_eav_table_global as obs_elm
WHERE
obs_elm.actualrmtypename= 'ELEMENT'),
COMPOSITION2 as (select comp_node_tbl2.* from temp_eav_table_global as comp_node_tbl2
where
comp_node_tbl2.actualrmtypename = 'COMPOSITION'),
composition_contains_evaluation as (select * from parent_child_arr_based),
EVALUATION as (select eva_node_tbl.* from temp_eav_table_global as eva_node_tbl
where
eva_node_tbl.actualrmtypename = 'EVALUATION'),
eval_contains_element as (select * from parent_child_arr_based),
ELEMENT as (select el_node_tbl.* from temp_eav_table_global as el_node_tbl
where
el_node_tbl.actualrmtypename = 'ELEMENT')
select
'branch1' as branchid,
COMPOSITION1.featuremappingid as comprootid,
OBSERVATION.featuremappingid as obs_ftid,
OBSERVATION.actualrmtypename as obs_tn,
null as ev_ftid,
null as ev_tn,
OBS_ELM.featuremappingid as obs_elm_fid,
OBS_ELm.actualrmtypename as obs_elm_tn,
null as ev_el_ftid,
null as ev_el_tn
from
COMPOSITION1
INNER JOIN composition_contains_observation ON COMPOSITION1.featuremappingid = composition_contains_observation.parent
INNER JOIN OBSERVATION ON composition_contains_observation.children #> ARRAY[OBSERVATION.featuremappingid]
INNER JOIN observation_cnt_element on observation_cnt_element.parent = OBSERVATION.featuremappingid
INNER JOIN OBS_ELM ON observation_cnt_element.children #> ARRAY[obs_elm.featuremappingid]
UNION
SELECT
'branch2' as branchid,
COMPOSITION2.featuremappingid as comprootid,
null as obs_ftid,
null as obs_tn,
EVALUATION.featuremappingid as ev_ftid,
EVALUATION.actualrmtypename as ev_tn,
null as obs_elm_fid,
null as obs_elm_tn,
ELEMENT.featuremappingid as ev_el_ftid,
ELEMENT.actualrmtypename as ev_el_tn
from
COMPOSITION2
INNER JOIN composition_contains_evaluation ON COMPOSITION2.featuremappingid = composition_contains_evaluation.parent
INNER JOIN EVALUATION ON composition_contains_evaluation.children #> ARRAY[EVALUATION.featuremappingid]
INNER JOIN eval_contains_element ON EVALUATION.featuremappingid = eval_contains_element.parent
INNER JOIN ELEMENT on eval_contains_element.children #> ARRAY[ELEMENT.featuremappingid]
the relational equivalent to ∨ is &Union;. You could either use union to combine a JOIN b JOIN e with a JOIN c JOIN e or just use the union of b and c and join on the resulting, combined relation, something like a JOIN (b UNION c) JOIN e
More completely:
SELECT *
FROM a
JOIN (
SELECT
'B' source_relation,
parent,
b.child,
b_thing row_from_b,
NULL row_from_c
FROM a_contains_b JOIN b ON a_contains_b.child = b.node_id
UNION
SELECT
'C',
parent
c.child,
NULL,
c_thing
FROM a_contains_c JOIN c ON a_contains_c.child = c.node_id
) a_c ON A.NODE_ID = a_e.parent
JOIN e ON a_c.child = e.node_id;