One composite index or two separate indices - sql

I have three tables A, B and C that must be joined.
The columns in A are a1, a2, and a3.
Columns in B include b1, b2, b3, b4 and b5.
The columns in C are c1, c2 and c3.
The query is going to be
SELECT B.*
FROM A
INNER JOIN B ON a1 = b1 and a2 = b2
INNER JOIN C ON b3 = c1
WHERE <some conditions>
Should I make a composite index using the values b1, b2, and b3 or two indices with idx1 = (b1, b2), and idx2 = (b3)? Of course, table A has an index that contains (a1, a2), and table C has an index on (c1).

It all depends on your data volume and the nature of how you will be querying the table. In many situations (a report against a significant portion of your table's rows) it's better to not use indexes at all. Even if indexes are appropriate, you are using inner joins - if your WHERE clause has predicates that makes Oracle decide to start with table B or C instead of A, that totally changes your indexing requirements. That's why nobody can answer your question without knowing a lot more about your query plans and data volumes.
But assuming indexes are appropriate because you're going after only a tiny percentage of your data, and assuming your predicate will get Oracle to start with table A every time (big assumptions here), you'll get the best query performance out of concatenated indexes on the tables you are joining to that line up with your predicates. So, you may want to index:
B(b1,b2) -- single, two-column index, to satisfy your join to B
C(c1) -- single, one-column index, to satisfy your join to C.

Related

find difference between two tables with different keys

I have table A that looks like this :
id
name
a1
A1
a2
A2
b2
B2
and table B that looks like this:
volume_id
volume_name
a1
A1
b1
B1
b2
B2
I want to make a query (or multiple) that would give me the id (or volume_id as they represent the same thing ) that exists in table A but not table B and vice versa.
I am using psql as my postgres cli.
You can use full join:
select a.id, b.volume_id
from a full join
b
on a.id = b.volume_id
where a.id is null or b.volume_id is null;
This puts the results in separate columns, so you can see which is missing.
You can use a FULL JOIN which would display the values that are present in column A but not column B.
select t1.id, t2.volumeid from a as t1 full join b as t2 on t1.id=t2.volumeid;
As a side note, you can also use a LEFT JOIN in similar circumstances to accomplish this, but you would want to make sure that the column on the left contains all the values included in a and b, or you will find a situation whereby extra values from the table will not display if they are in the right-hand column.
This is not the case here, i.e. table a does not include the value b1, and therefore this is why you must use a full join in this particular example.

How can I copy records and all its related records from one database to another database using the index of the highest parent table?

The database that I'm using is Informix; the version is 9.4.
I have a scenario where I'm trying to migration some specific records from one database to another database. Here's an example of what I'm trying to do,
Let's say I have three tables A, B, C in database D1. I need to copy some records from these three tables to database D2.
The relations between A, B, C are here below,
A - parent with primary key a1
B - child to A with primary key b1 and the reference key a1
C - child to B with primary key c1 and the reference key b1
I want to move some records from database D1 with a specific condition a1 = 'something'. Along with A, I need to copy records from B and C that are related to A directly (A<->B) or indirectly (A<->C through B).
What is the easiest and most reliable way to copy the data?
FYI. This is a one time job, not a continuous one.
On the face of it, if the volume of data to be transferred is small enough, then you could use:
BEGIN WORK;
INSERT INTO D2:A
SELECT * FROM A WHERE a1 = 'something';
INSERT INTO D2:B
SELECT B.* FROM B JOIN A ON B.a1 = A.a1
WHERE A.a1 = 'something';
INSERT INTO D2:C
SELECT C.*
FROM C
JOIN B ON C.b1 = B.b1
JOIN A ON B.a1 = A.a1
WHERE A.a1 = 'something';
COMMIT WORK;
It might be possible to simplify things if the condition on A is really as simple as a1 = 'something' so that there is only one record from A to transfer (since a1 is the primary key of A).
BEGIN WORK;
INSERT INTO D2:A
SELECT * FROM A WHERE A.a1 = 'something';
INSERT INTO D2:B SELECT B.* FROM B
WHERE B.a1 = 'something';
INSERT INTO D2:C
SELECT C.*
FROM C
JOIN B ON C.b1 = B.b1
WHERE B.a1 = 'something';
COMMIT WORK;
This avoids joins back to table A.
If the volume of data makes this preposterous, you're probably stuck with something like unloading and reloading the data. You'd be wise to lock the tables in share mode while unloading them.
What volume makes the triple-insert operation preposterous? That's hard to answer, but if the transferred data requires more logical log space than you've got on the server running D2, then you've got problems. Whether it is then best to split the transactions or whether to go for unload/reload is hard to decide. On the whole, unload/reload is probably better if the space required is too large.

Choosing between values using left join or union

I have a table A with columns ABK and ACK. Each row can have a value in either ABK or ACK but not in both at the same time
ABK and ACK are keys to be used to fetch more detailed information from tables B and C, respectively
B has columns named BK (key) and B1 and C has columns named CK (key) and C1
When fetching information from B and C, I want to select between B1 and C1 depending on which column in A (ABK or ACK) is NOT null
What would be better considering readability and performance:
1
select COALESCE(B.B1, C.C1) as X from A
left join B on A.ABK = B.BK
left join C on A.ACK = C.CK
OR
2
select B.B1 as X from A join B on A.ABK = B.BK
UNION
select C.C1 as X from A join C on A.ACK = C.CK
In other words should I do a left join with all the tables I want to use or do union?
I am guessing that readability wise the UNION is better, but I am not sure about performance
Also B and C do not overlap, i.e. there is no duplicates between B and C
I don't think the answer in the question pointed as a duplicate of mine is correct for my case since it focus on the fact that there could be duplicates among tables B and C, but as stated B and C are mutually exclusive
your two queries aren't necessarily accomplishing the same thing.
Using LEFT JOIN will create duplicate rows in the result if there are duplicate values in either table, whereas UNION (as opposed to UNION ALL) automatically limits to remove duplicate values if applicable. Therefore, before thinking about performance, I would decide which method to use based on whether you are interested in preserving duplicates in your results.
See here for more info: Performance of two left joins versus union.
First, you can test the performance on your database and your data. You can also look at the execution plans.
Second, the queries are not equivalent, because the union removes duplicates.
That said, I would go for the first version using left join for both readability and performance. Readability is obviously a matter of taste. I think having all the logic in the select makes it more apparent what the query is doing.
More importantly, the first will start with table A and be able to use indexes for the additional lookups. There is no additional phase for removing duplicates. My guess is that two joins is faster than two joins and duplicate removal.

Optimizing a two-level SQL query

Here's the layout of the relevant parts of my database:
(BTW, I made this diagram with wwwsqldesigner)
Now, I like to query all rows of C which match a particular row of A.
The query I came up with myself works. E.g, to look up rows in C matching A's row 123:
SELECT C.* FROM C
LEFT JOIN B1 ON (B1.id = C.id_B1)
LEFT JOIN B2 ON (B2.id = C.id_B2)
WHERE B1.id_A = 123 OR B2.id_A = 123
However, I believe the above query is rather inefficient as it collects all rows of B1 and B2 in a large set before reducing it down again, right?
I believe I should be able to first make a query for B1 and B2 each, selecting for their id_A values, then joins those results somehow into the matching C rows.
I've looked at sqlite.org's docs for the SELECT command but the possibilities overwhelm me.
How does one figure this out? A bit of explaining the thought process of solving this would be appreciated.
(Also, if you could suggest a better title for this question - I don't really know how to pinpoint this)
Your method is fine, although it seems like it might be returning duplicates.
You might see if one of these is faster:
SELECT C.*
FROM C
WHERE EXISTS (SELECT 1 FROM B1 WHERE B1.id = C.id_B1 AND B1.id_A = 123) OR
EXISTS (SELECT 1 FROM B2 WHERE B2.id = C.id_B2 AND B2.id_A = 123);
This will work best with indexes. An index on id in the "B" tables is fine, although (id, id_A) would be better.
OR:
SELECT DISTINCT C.*
FROM C JOIN
B1 ON B1.id = C.id_B1
WHERE B1.id_A = 123
UNION
SELECT DISTINCT C.*
FROM C JOIN
B2 ON B1.id = C.id_B2
WHERE B2.id_A = 123;
If you know there are no duplicates, then use union all instead of union.
I believe the above query is rather inefficient as it collects all rows of B1 and B2 in a large set before reducing it down again, right?
I may be wrong for SQLite, but any database engine worth its salt should be able to optimize the query by finding the rows in B1 and B2 that match your where clause, so no, it would not load the entire tables into memory.
You can see the plan that the query uses by prefacing the query with EXPLAIN QUERY PLAN. As long as the engine doesn't do a SCAN TABLE on B1 and/or B2 then the query should be fine.
Note that you can improve the performance of this query dramatically by adding indexes on B1.id_A and B2.id_A

Distinct based on collection

Does anybody know how to distinct by items in child table?
I am not sure if it is possible but maybe somebody can clarify that.
For example I have 2 tables: A and B (1 to many).
I need to join them and select rows from A only once if B has the same items for corresponding A-row.
I need to do it on SQL server side because I want to have paging there using ROW_NUMBER.
UPDATE:
CREATE TABLE [dbo].[A] ([ID] int NOT NULL, [Name] varchar(Max) NOT NULL)
CREATE TABLE [dbo].[B] ([ID] int NOT NULL, [A_ID] int NOT NULL, [Name] varchar(Max) NOT NULL)
INSERT INTO A VALUES (1, 'A1')
INSERT INTO A VALUES (2, 'A2')
INSERT INTO B VALUES (1, 1, 'B1')
INSERT INTO B VALUES (2, 1, 'B2')
INSERT INTO B VALUES (3, 2, 'B1')
INSERT INTO B VALUES (4, 2, 'B2')
This should return only A1, B1, B2 thinking that A1 and A2 are equals by their B1 and B2.
Let me know if it is clear now.
http://sqlfiddle.com/#!3/897704/1/0
This should do the trick. I threw in a little extra CHECKSUM_AGG magic so that if you have a really big table, it will still perform very well.
WITH Grouped AS (
SELECT
A_ID,
GroupID = Checksum_Agg(Checksum(B.Name))
FROM
B
GROUP BY
B.A_ID
), DistinctA AS (
SELECT G1.A_ID
FROM
Grouped G1
WHERE
NOT EXISTS (
SELECT *
FROM Grouped G2
WHERE
G1.GroupID = G2.GroupID
AND G1.A_ID > G2.A_ID
AND NOT EXISTS (
SELECT *
FROM
(SELECT * FROM B B1 WHERE G1.A_ID = B1.A_ID) B1
FULL JOIN (SELECT * FROM B B2 WHERE G2.A_ID = B2.A_ID ) B2
ON B1.Name = B2.Name
WHERE
B1.A_ID IS NULL
OR B2.A_ID IS NULL
)
)
)
SELECT
A_ID = A.ID,
AName = A.Name,
B_ID = B.ID,
BName = B.Name
FROM
DistinctA DA
INNER JOIN A
ON DA.A_ID = A.ID
INNER JOIN B
ON A.ID = B.A_ID
;
See this query live in a SQL Fiddle
Please note that the CHECKSUM_AGG by itself will NOT guarantee correctness. Actual correctness is guaranteed by the FULL JOIN part of the query, that checks for any Names in the B table that don't match (one or the other is NULL for that group of A_IDs). But the CHECKSUM_AGG acts as a hashing function so that the FULL JOIN only has to compare a very few potential duplicates, instead of every other group of A_IDs in the entire B table.
A few of things I see:
It is not good to name columns just ID. Then you have to alias them all over the place. The column A_ID should be the same in every table, including A.
I realize this was just an example, but underscores are the bane of professional SQL writers. They take just as many keys to type as PascalCase, but require moving the weakest finger off the home row and two keys to the right. WhateverID is far superior to WHATEVER_ID. Even WHATEVERID is tolerable.
It sounds like your B table could use another table to join to, to look up the Name value. If two rows in B with the same name mean the same thing, then the chance of one of them having a misspelling becomes a very real proposition. Instead, B may need to join to another table that has all the unique names and provides an ID.
Your B table may not need an ID column (as show in in this example). In many cases, even if individual B rows are going to be accessed via a key, it can be better for them to reached via the A_ID and the Name (or better, an ID representing the name). Of course, I don't know what the data in these tables actually is, and there are many times when a surrogate key is appropriate. In the case that B is really a many-to-many join table, even if it has a few extra columns, most of the time such tables should not have a surrogate key.
One more thing: if multiple columns in the B table, besides Name, are needed to know if the row is a duplicate compared to another A's B row, then that can be accomplished. Please let me know if this is the case.
It would actually be helpful if your data was meaningful in the real world. Exposing a tiny bit about the business objects can hardly reveal anything truly private, and it helps people who are answering you to answer better when your example data is more concrete. It also actually helps you understand better and faster the answers you get--since my above query is 100% abstract, it will be hard to grok.
Another note: I sure wish SQL Server would provide DISUNION to complement UNION, INTERSECT, and EXCEPT. Then that would be another option for the part of the query where I used FULL JOIN.
Alternate query. Shorter than ErikE's, but may break when you try to add additional columns.
Essentially though, the inner query concatenates all the values of B.Name into a single string (ie "B1,B2," etc). If all the parts of B are the same, then you can GROUP BY in the outer query. As you mentioned above the choice of A.Name is arbitrary at this stage, so I have done MIN to get the first.
SELECT MIN(sub1.Name), sub1.conc
FROM (SELECT A.Name, (SELECT [Name]+','
FROM [b]
WHERE B.A_ID = A.ID
FOR XML PATH('')) conc
FROM [A]) sub1
GROUP BY sub1.conc