SQL Join with GROUP BY query optimisation

SQL Join with GROUP BY query optimisation - sql

I'm trying to optimise the following query.
SELECT C.name, COUNT(DISTINCT I.id), COUNT(B.id)
FROM Categories C, Items I, Bids B
WHERE C.id = I.category
AND I.id = B.item_id
GROUP BY C.name
ORDER BY 2 DESC, 3 DESC;
Categories is a small table with 20 records.
Items is a large table with over 50,000 records.
Bids is a even larger table with over 600,000 records.
I have an index on
Categories(name, id), Items(category), and Bids(item_id, id).
The PRIMARY KEY for each table is: Items(id), Categories(id), Bids(id)
Is there any possibility to optimise the query? Very appreciated.

Without EXPLAIN (ANALYZE, BUFFERS) output this is guesswork.
The query is so simple that nothing can be optimized there.
Make sore that you cave correct table statistics; check EXPLAIN (ANALYZE) to see if PostgreSQL's estimates are correct.
Increase shared_buffers so that the whole database fits into RAM (if you can).
Increase work_mem so that all hashes and sorts are performed in memory.

Not really you are scanning all records.
How many of the item records are hit with the data from bids. I would imagine all tables are full scanned and hash joined , and indexes disregarded.

ِYour query seems really boiler plate and I am sure that with the size of your tables, any not-really-low-hardware server can run this query in a heartbeat. But you can always make things better. Here's a list of optimizations you can make that are supposed to boost up your query's performance, theoretically:
Theoretically speaking, your biggest inefficiency here is that you are calculating cross product of your tables instead of joining them. You can rewrite the query with joins like:
...
FROM Items I
INNER JOIN Bids B
ON I.id = B.item_id
INNER JOIN Categories C
ON C.id = I.category
...
If we are considering everything performance wise, your index on the category for the Items table is inefficient, since your index has only 20 entries that are mapped to 50K entries. This here is an inefficient index, and you may even get better performance without this index. However, from a practical point of view, there are a lot of other stuff to consider here, so this may not actually be a big deal.
You have no index on the ID column of the Items table and having an index on that column speeds up your first join. (However PostgreSQL has default index on primary key columns so this is not a big deal either)
Also, adding explain analyze to the beginning of your query shows you the plan that the PostgreSQL query planner uses to run you queries. If you know a thing or two about query plans, I suggest you take a look a the results of that too to find any missing inefficiencies.

Related

is it true if we switch the position of table in join query will increase load data speed?

For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id

No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.

It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.

Why does my query in oracle take longer when I use temporary tables?

I am facing a peculiar issue when using an inner query in ORACLE DB. I am fetching data from a table which is having huge number of records.
The query I am using contains an inner query.
When I provide the values directly in the inner query it is much
faster.
But when I use exactly the same values from another (temporary) table
by either inner query or JOIN, it takes too longer.
Below is the query:
Faster performance
SELECT assembly_item_id menuItemId,
location_id restId,
bill_sequence_id,
bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v
WHERE assembly_item_id = 8321
AND location_id IN (82, 85, 116, .........)
Low in performance when used select query in inner section
Without JOIN
SELECT assembly_item_id menuItemId,
location_id restId,
bill_sequence_id,
bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v
WHERE assembly_item_id = 8321
AND location_id IN (SELECT temp_id FROM global_temp_ids)
With JOIN
SELECT assembly_item_id menuItemId, location_id restId, bill_sequence_id, bill_config_id
FROM zil_ibat_resolve_bmi_ai_max_v t1
join global_temp_ids t2
on t1.location_id = t2.temp_id
WHERE t1.assembly_item_id = 8321
Note: zil_ibat_resolve_bmi_ai_max_v is a view.
What is wrong with this query? Why is it taking so much time when I query table instead of putting the IDs directly in the inner section? Is there an alternate for this?
Explain Plan
usedSelectQueryInInnerSection.png
usedJoin
enterNumbersInInnerQuery

The second and third query are slow because of the NESTED LOOP join between the view results and the temporary table. Changing it to a HASH join, perhaps through better optimizer statistics or a USE_HASH hint, should speed up the query.
Problem
This part at the top of the execution plan:
NESTED LOOPS
zil_ibat_resolve_bmi_ai_max_v
global_temp_ids
is similar to this pseudo-code:
for each row of zil_ibat_resolve_bmi_ai_max_v
search index of global_temp_ids
Based on the images the execution plan for the view does not change between queries, that part must be relatively fast. And the look-up of the temporary table uses a unique index search, that must also be fast. But it is only fast to do it once. And we can tell from the the Cardinality 1 that the Oracle optimizer thinks it will only execute the inner part of the join once.
NESTED LOOPs are great when joining a small number of rows. HASH JOINs work much better when joining a large number of rows.
Solutions
There are many ways to change the join method, here are the two to try first:
1. Gather statistics. Better optimizer statistics will improve the cardinality estimates, which will usually improve execution plans. There are many ways to gather stats but usually the default settings are the best. In this case they can be gathered by running a procedure like this: exec dbms_stats.gather_schema_stats('SMART'); Repeat that for the schemas ZILADMIN and XCBAIRAG. If the statistics were missing or stale it would also be a good idea to investigate why the default statistics gathering job did not run.
2. Hint. Hints should generally be avoided in production code but they can still at least be helpful to diagnose the problem. Run the query with the hint SELECT /*+ USE_HASH(t1 t2) */ ... and see if that improves things. If that works you can either keep the hint or consider using some other form of plan management. For example, a SQL Profile may solve this and other problems in a cleaner way. Check with other developers or DBAs to find out what types of plan management features are common in your system.

Inefficient JOIN Method?

I'm trying to query two fairly large tables here to pull some results and having some trouble with effeciency.
Note: I've only included relevant columns to make this not look so messy!
TableA (Stock) has productID, ownerID, and count columns
TableB (Owners) has ID, accountHolderID, and name columns
What I'm trying to do is query TableA and where productID = X pull up Stock.productID, Stock.accountHolderID and Owners.name. The relation between these two tables is Stock.ownerID = Owners.ID so if the WHERE condition pulled say five productIDs then I'd want the name from TableB that matched up to the ownerID from TableA.
The only unique ID in this situation is Owners.ID from TableB
Just doing a basic SELECT query on TableA for those products takes 15 seconds however when I add an INNER JOIN to match things up to TableB the query takes significantly longer, upwards of 10 minutes. I'm guessing I've designed this query inefficiently.
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID = 42301679
How can I make this query more efficient?
Would adding ORs to the WHERE condition allow me to pull multiple productIDs at once?

Based on your comment, it looks like you're missing a very critical index on the owners.id field. Now, keep in mind this index will help this query, but you have to take into consideration all of the other queries that run against this table to determine if it is a good idea to add that index.
At 29M rows, having an index on a table that is frequently inserted to may have a noticeable effect on insert times.
This may be a situation where different applications need different indexes - namely your OLTP app and your reporting app (which may just be you running ad hoc queries). A common solution is to have a second server that runs your reporting/data warehouse queries that has indexes properly tuned to this function.
Best of luck.

Your'e query looks right
perhaps we can see the schema
In order to pull multiple productIDs at once you can use the IN operator instead of OR
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID IN (42301679,123232,232324)

If the productID is unique in the Stock table, it makes sense to make this the index and this can greatly improve performance as others have mentioned.
Another performance gain comes from setting a specific length Owner.name field. In mySQL, VARCHAR can be used for Strings of varied length while a CHAR(32) column indicates that the name will always occupy 32 characters. The extra unused space is just padded, so you can really think of the (32) as indicating a maximum length. The performance advantage comes from the fact that the database now knows exactly how many bytes each row occupies and it can use this information to improve lookup time.

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.

For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.

Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)

No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Is there an alternative to joining 3 or more tables?

Is it a good idea to join three or more tables together as in the following example. I'm trying to focus on performance. Is there any way to re-write this query that would be more efficient and faster performing? I've tried to make is as simplistic as possible.
select * from a
join b on a.id = b.id
join c on a.id = c.id
join d on c.id = d.id
where a.property1 = 50
and b.property2 = 4
and c.property3 = 9
and d.property4 = 'square'

If you want faster performance, make sure that all of the join's are covered by an index (either clustered or non-clustered). It looks like this could all be done in your query above by creating an index on the id and appropriate property columns of each table

You could make it faster if you only selected a subset of the columns, at the moment you're selecting everything from all 3 tables.

Performance wise, I think it really depends on the number of records in each table, and making sure that you have the proper indexes defined. (I'm also assuming that SELECT * is a placeholder; you should avoid wildcards)
I'd start off by checking out your execution plan, and start optimizing there. If you're still getting suboptimal performance, you could try using temp tables to break up the 4 table join into separate smaller joins.

Assuming a normalized database, this is the best you can do, in terms of structuring a query and the joins in place.
There are other options to look at, including adding indexes on the different join and select clause columns, denormalizing the table structures and narrowing the result set.
Adding indexes on the join columns (which appear to be primary keys, so may already be indexed) will help with the join performance, indexing the columns in the select clause will help with speeding up the filtering on each table.
If you denormalize, you get a structure with duplicate data with all the implications of duplicate data (data maintenance issues mostly), but you gain performance as you no longer need to join.
When selecting columns, you should specify which ones you want - using * is generally a bad idea. This way you only transfer the data that the application really needs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas