SQL - DB2 Group By and indexes with distinct aggregate functions - sql

I'm trying to flatten out a query result set with some COUNT functions and a GROUP BY clause. Basically, I have a query that can return over a dozen rows for what will essentially be processed as one object. Furthermore, the columns that cause this are only added up after the result set is processed, so this seems an ideal time for aggregation. For example:
SELECT
A.ID, A.NAME, A.DETAILS,
COUNT(DISTINCT CASE WHEN B.TYPE = 'ONE' THEN B.ID2 END) AS B1
COUNT(DISTINCT CASE WHEN B.TYPE = 'TWO' THEN B.ID2 END) AS B2
COUNT(DISTINCT CASE WHEN B.TYPE = 'THREE'
AND B.SUBTYPE = 'ONE-ONE' THEN B.ID END) AS B3
FROM A
LEFT JOIN B ON B.A_ID = A.ID
GROUP BY A.ID, A.NAME, A.DETAILS
The idea is to get a count of all unique associated B objects of various types/subtypes. The problem is that due to the nature of the possible queries and the way the database is structured (This is greatly simplified. There would be numerous joins, subqueries, and some parameters, but this is enough to get the gist), I could possibly get duplicate results for B.ID2 on each instance of A, which necessitates the DISTINCT, otherwise it counts all B.ID2 for that value of A and I get incorrect results. Unfortunately, this causes each aggregate function beyond the first to do a tablescan and create a TEMP table in the explain, and no amount of indexing seems to fix it. I'm not sure I can eliminate the duplicates without some significant changes to the queries that could themselves cause larger performance problems. Without this, I have to join to B on every type/subtype I want, include B.ID2 in the select, and then count them up once the query returns. This is going to seriously bloat the result set and I'd rather avoid it.
Is there a viable alternative here that I'm missing or perhaps some method of indexing those columns that might eliminate the tablescan and TEMP tables? Or is there really no good solution to this?

Related

Oracle: Use only few tables in WHERE clause but mentioned more tables in 'FROM' in a jon SQL

What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.

SQL Method of checking that INNER / LEFT join doesn't duplicate rows

Is there a good or standard SQL method of asserting that a join does not duplicate any rows (produces 0 or 1 copies of the source table row)? Assert as in causes the query to fail or otherwise indicate that there are duplicate rows.
A common problem in a lot of queries is when a table is expected to be 1:1 with another table, but there might exist 2 rows that match the join criteria. This can cause errors that are hard to track down, especially for people not necessarily entirely familiar with the tables.
It seems like there should be something simple and elegant - this would be very easy for the SQL engine to detect (have I already joined this source row to a row in the other table? ok, error out) but I can't seem to find anything on this. I'm aware that there are long / intrusive solutions to this problem, but for many ad hoc queries those just aren't very fun to work out.
EDIT / CLARIFICATION: I'm looking for a one-step query-level fix. Not a verification step on the results of that query.
If you are only testing for linked rows rather than requiring output, then you'd use EXISTS.
More correctly, you need a "semi-join" but this isn't supported by most RDBMS unless as EXISTS
SELECT a.*
FROM TableA a
WHERE EXISTS (SELECT * FROM TableB b WHERE a.id = b.id)
Also see:
Using 'IN' with a sub-query in SQL Statements
EXISTS vs JOIN and use of EXISTS clause
SELECT JoinField
FROM MyJoinTable
GROUP BY JoinField
HAVING COUNT(*) > 1
LIMIT 1
Is that simple enough? Don't have Postgres but I think it's valid syntax.
Something along the lines of
SELECT a.id, COUNT(b.id)
FROM TableA a
JOIN TableB b ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.id) > 1
Should return rows in TableA that have more than one associated row in TableB.

SQLite join optimisation

If you have a query such as:
select a.Name, a.Description from a
inner join b on a.id1 = b.id1
inner join c on b.id2 = c.id2
group by a.Name, a.Description
What would be the most optimal columns to index for this query in SQLite if you consider that there are over 100,000 rows in each of the tables?
The reason that I ask is that I do not get the performance with the query with the group by that I would expect from another RDBMS (SQL Server) when I apply the same optimisation.
Would I be right in thinking that all columns referenced on a single table in a query in SQLite need to be included in a single composite index for best performance?
The problem is that you're expecting SQLite to have the same performance characteristics as a full RDBMS. It won't. SQLLite doesn't have the luxury of getting to cache quite as much in memory, has to rebuild the cache every time you run the application, is probably limited to set number of cores, etc, etc, etc. Tradeoffs for using an embedded RDBMS over a full one.
As far as optimizations go, try indexing the lookup columns and test. Then try creating a covering index. Be sure to test both selects and code paths that update the database, you're speeding up one at the expense of the other. Find the indexing that gives the best balance between the two for your needs and go with it.
From the SQLite query optimization overview:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches. If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
For any other RDBMS, I'd say to put a clustered index on b.id1 and c.id2. For SQLite, you might be better off including any columns from b and c that you want to lookup in those indexes too.
Beware: I know nothing of possible intricacies of SQLite and its execution plans.
You definitely need indexes on a.id1, b.id1, b.id2 and c.id2. I think a composite index (b.id1, b.id2) could yield a small performance increase. The same goes for (a.id1, a.Name, a.Description).
Since you're not using the other tables for your return columns, perhaps this will be faster:
SELECT DISTINCT a.Name, a.Description
FROM a, b, c
WHERE a.id1 = b.id1
AND b.id2 = c.id2
Looking at the returned columns, since the criteria seems to be only that they must be linked from a to b to c, you could look for all unique a.Name and a.Description pairs.
SELECT DISTINCT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
Or, depending on if every pair of a.Name and a.Description is already unique, there should be some gain in finding out first the unique id's then fetching the other columns.
SELECT a.Name, a.Description
FROM a
WHERE a.id1 IN (
SELECT DISTINCT a.id1
FROM a
WHERE a.id1 IN (
SELECT b.id1
FROM b
WHERE b.id2 IN (
SELECT c.id2
FROM c
)
)
)
I think indexes on a.id1 and b.id2 would give you about as much benefit as you could get in terms of the JOINs. But SQLite offers EXPLAIN, and it might help you determine if there's an avoidable in efficiency in the current execution plan.

How To Create a SQL Index to Improve ORDER BY performance

I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)
I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.
I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.
have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help