Inefficient JOIN Method? - sql

I'm trying to query two fairly large tables here to pull some results and having some trouble with effeciency.
Note: I've only included relevant columns to make this not look so messy!
TableA (Stock) has productID, ownerID, and count columns
TableB (Owners) has ID, accountHolderID, and name columns
What I'm trying to do is query TableA and where productID = X pull up Stock.productID, Stock.accountHolderID and Owners.name. The relation between these two tables is Stock.ownerID = Owners.ID so if the WHERE condition pulled say five productIDs then I'd want the name from TableB that matched up to the ownerID from TableA.
The only unique ID in this situation is Owners.ID from TableB
Just doing a basic SELECT query on TableA for those products takes 15 seconds however when I add an INNER JOIN to match things up to TableB the query takes significantly longer, upwards of 10 minutes. I'm guessing I've designed this query inefficiently.
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID = 42301679
How can I make this query more efficient?
Would adding ORs to the WHERE condition allow me to pull multiple productIDs at once?

Based on your comment, it looks like you're missing a very critical index on the owners.id field. Now, keep in mind this index will help this query, but you have to take into consideration all of the other queries that run against this table to determine if it is a good idea to add that index.
At 29M rows, having an index on a table that is frequently inserted to may have a noticeable effect on insert times.
This may be a situation where different applications need different indexes - namely your OLTP app and your reporting app (which may just be you running ad hoc queries). A common solution is to have a second server that runs your reporting/data warehouse queries that has indexes properly tuned to this function.
Best of luck.

Your'e query looks right
perhaps we can see the schema
In order to pull multiple productIDs at once you can use the IN operator instead of OR
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID IN (42301679,123232,232324)

If the productID is unique in the Stock table, it makes sense to make this the index and this can greatly improve performance as others have mentioned.
Another performance gain comes from setting a specific length Owner.name field. In mySQL, VARCHAR can be used for Strings of varied length while a CHAR(32) column indicates that the name will always occupy 32 characters. The extra unused space is just padded, so you can really think of the (32) as indicating a maximum length. The performance advantage comes from the fact that the database now knows exactly how many bytes each row occupies and it can use this information to improve lookup time.

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

SQL Join with GROUP BY query optimisation

I'm trying to optimise the following query.
SELECT C.name, COUNT(DISTINCT I.id), COUNT(B.id)
FROM Categories C, Items I, Bids B
WHERE C.id = I.category
AND I.id = B.item_id
GROUP BY C.name
ORDER BY 2 DESC, 3 DESC;
Categories is a small table with 20 records.
Items is a large table with over 50,000 records.
Bids is a even larger table with over 600,000 records.
I have an index on
Categories(name, id), Items(category), and Bids(item_id, id).
The PRIMARY KEY for each table is: Items(id), Categories(id), Bids(id)
Is there any possibility to optimise the query? Very appreciated.
Without EXPLAIN (ANALYZE, BUFFERS) output this is guesswork.
The query is so simple that nothing can be optimized there.
Make sore that you cave correct table statistics; check EXPLAIN (ANALYZE) to see if PostgreSQL's estimates are correct.
Increase shared_buffers so that the whole database fits into RAM (if you can).
Increase work_mem so that all hashes and sorts are performed in memory.
Not really you are scanning all records.
How many of the item records are hit with the data from bids. I would imagine all tables are full scanned and hash joined , and indexes disregarded.
ِYour query seems really boiler plate and I am sure that with the size of your tables, any not-really-low-hardware server can run this query in a heartbeat. But you can always make things better. Here's a list of optimizations you can make that are supposed to boost up your query's performance, theoretically:
Theoretically speaking, your biggest inefficiency here is that you are calculating cross product of your tables instead of joining them. You can rewrite the query with joins like:
...
FROM Items I
INNER JOIN Bids B
ON I.id = B.item_id
INNER JOIN Categories C
ON C.id = I.category
...
If we are considering everything performance wise, your index on the category for the Items table is inefficient, since your index has only 20 entries that are mapped to 50K entries. This here is an inefficient index, and you may even get better performance without this index. However, from a practical point of view, there are a lot of other stuff to consider here, so this may not actually be a big deal.
You have no index on the ID column of the Items table and having an index on that column speeds up your first join. (However PostgreSQL has default index on primary key columns so this is not a big deal either)
Also, adding explain analyze to the beginning of your query shows you the plan that the PostgreSQL query planner uses to run you queries. If you know a thing or two about query plans, I suggest you take a look a the results of that too to find any missing inefficiencies.

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Is there an alternative to joining 3 or more tables?

Is it a good idea to join three or more tables together as in the following example. I'm trying to focus on performance. Is there any way to re-write this query that would be more efficient and faster performing? I've tried to make is as simplistic as possible.
select * from a
join b on a.id = b.id
join c on a.id = c.id
join d on c.id = d.id
where a.property1 = 50
and b.property2 = 4
and c.property3 = 9
and d.property4 = 'square'
If you want faster performance, make sure that all of the join's are covered by an index (either clustered or non-clustered). It looks like this could all be done in your query above by creating an index on the id and appropriate property columns of each table
You could make it faster if you only selected a subset of the columns, at the moment you're selecting everything from all 3 tables.
Performance wise, I think it really depends on the number of records in each table, and making sure that you have the proper indexes defined. (I'm also assuming that SELECT * is a placeholder; you should avoid wildcards)
I'd start off by checking out your execution plan, and start optimizing there. If you're still getting suboptimal performance, you could try using temp tables to break up the 4 table join into separate smaller joins.
Assuming a normalized database, this is the best you can do, in terms of structuring a query and the joins in place.
There are other options to look at, including adding indexes on the different join and select clause columns, denormalizing the table structures and narrowing the result set.
Adding indexes on the join columns (which appear to be primary keys, so may already be indexed) will help with the join performance, indexing the columns in the select clause will help with speeding up the filtering on each table.
If you denormalize, you get a structure with duplicate data with all the implications of duplicate data (data maintenance issues mostly), but you gain performance as you no longer need to join.
When selecting columns, you should specify which ones you want - using * is generally a bad idea. This way you only transfer the data that the application really needs.