Improving SQL Query Join - sql

I have a SQL query that is taking hours to run. My join is on the descriptions of products. Would it be more efficient to create a unique numerical id and join on this instead since the product description is a few sentences long?
Example:
SELECT A*, B.something
FROM tableA A JOIN TABLE B
ON A.product_details = B.product_details

For this query:
SELECT A.*, B.something
FROM tableA A JOIN
TABLE B
ON A.product_details = B.product_details
The best index is on B(product_details, something) -- however product_details is most important as the first key.
I generally recommend a numeric index. They are a bit more efficient. And they reduce the number of things to worry about, such as spaces at the ends of keys and collation conflicts.

Related

Best way to get distinct count from a query joining two tables

I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';
You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.
I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.
Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?

Which is more efficient "Int to Char" or "Char to Int"?

I am trying to join two tables A and B. The key variable is in integer format in Table A and in character format in Table B (It is all made up of numbers though). So, I can either convert the column from table A to varchar or the column from Table B to int.
Query1:
select a.*
from tableA a
inner join tableB b on Cast(a.key as varhcar(10))=b.key
Query2:
select a.*
from tableA a
inner join tableB B on a.key=cast(b.key as int)
My question is which among these queries is the most efficient and why?
You should fix your data model!
This has nothing to do with the efficiency of the type conversion. Basically, if indexes cannot be used SQL Server will use a nested loops join (I would love to hear that they actually use a hash-join, but I don't recall seeing that occur in this case).
So, if you have no indexes, it really makes little difference.
If you have an index on one table -- well, you want to avoid the conversion on that column. For example, this query:
select a.*
from tableA a inner join
tableB B
on a.key = cast(b.key as int)
Could use an index on a(key). The execution plan would scan b and use the index to "lookup" the value. However, it would not use an index on b(key).
All that said, fix your data model. Foreign key relationships should be properly declared -- and that requires that the types match.
It's better to convert from char to int for these reasons:
1- comparison operations in computers basically done between numeric values.
2-the process will be faster.
3-more precise results.

Improving SQL cartesian product performance by reducing columns

I have an SQL query which uses cartesian product on a large table. However, I only need one column from one of the tables. Would it actually perform better, if I selected only that one column before using the cartesian product?
So, in other words, would this:
SELECT A.Id, B.Id
FROM (SELECT Id FROM Table1) AS A , Table2 AS B;
be faster than this, given that Table1 has more columns than Id?:
SELECT A.Id, B.Id
FROM Table1 AS A , Table2 AS B;
Or does the number of columns not matter?
On most databases, the two forms would have the same execution plan.
The first could would be worse on a database (such as MySQL) that materializes subqueries.
The second should be better with indexes on the two tables . . . table1(id) and table2(id). The index would be used to get the value rather than the base data.
Try it out yourself! But generally speaking having a subquery reduce the number of rows will help improve the performance. Your query should, however, be written differently:
select a.id aid, b.id bid from
(Select id from table1 where id=<specific_id>) a, table2 b

SQL Server 2012 join perfomance features?

Consider a simple 3 table database i SQL Server 2012.
Table A
AId
Name
Other1
Other2
Table B
BId
Name
Table A_B
BId
AId
Simple example query:
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A
INNER JOIN A_B ON A.AId = A_B.Aid
INNER JOIN A as AA ON AA.Aid = A_B.Aid
INNER JOIN B ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
There are millions of rows in table A.
There are thousands of rows in table B.
There are ten times more rows in table A_B than A.
The Other1 and Other2 fields can be used to filter the queries.
Join queries using Top(20) could be done at a rate of 100 requests per second or more (specs are unclear).
The queries will almost always be using different parameters so result caching would not help that much.
What features in SQL Server 2012 can help to improve join query perfomance given the example above?
My initial thought is that since it's all PK int joins there isn't much that I could do. However I don't know if partitioned views could help.
I'm thinking that probably it's just about adding memory.
Well the first thing to understand (well maybe not the first) is that a performance model is built into all current versions which is dependant on head seek times vs continuous reads, This may well change with solidstate drives. Your choice of clusted indexes will be important keeping likely frequently queried data together. Also having a covering index for each part of the query will mean that the data can be accessed without reading the table its self. Partitoning may help (but its probably a long way down the list). Keeping stats up do date is essential. To often poor performance comes from undermaintained indexes and stats. Actully all these things are true right back to SQL7 (except I dont think SQL7 had partitioned views). Having the right RAID structure can alter performace by a factor of 4. The number of tempdbs should be equivalent to the number of processors (upto about 16) and the tempdb load balancing option should be set to true. Having Tempdbs, logs and data distributed across diffent i/os. No auto shrink - its evil. These are the more obvious ones. If you really want to get to grips with large db, then "Inside SQL" by Kalen Delany is almost mandatory reading though probably costs more that a few GB of RAM. And as you said - more RAM.
First yes have a clustered index for the PK
If Table B is smaller than Int16 use Int16
Not for disk space but for more rows in the same amount of memory
The interesting part is Table A_B
The order of that PK will probably effect in performance
Against just a single PK index which ever is second will be a slower join
Try the order each way
Check the query plan
Check the tuning adviser
My thought is
PK AId, BId
Non clustered index on BId based on that index is smaller
Then flop them around and compare
If the same then go with AId, BId for smaller index size and speed of insert
Then you can go into hints on the joins
Defrag on a regular basis
Insert in the order of the PK
If the data comes in natural order and insert speed is an issue then use that order for the PK
If insert speed is a problem then it may help to disable the non clustered index, insert, and then rebuild the non clustered index
Millions and thousands is still not enormous.
And I would not write the query like that
Keep the number joins down
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
JOIN B
ON B.BId = A_B.Bid
WHERE AA.Aid = #aid
AND A.Other1 = #other1
That query is very wasteful
Why join on all A.Aid = A_B.Aid to filter to a single AA.Aid in the where
Get the filter to execute early
This may perform better
SELECT TOP(20) A.Aid, A.Name, B.Bid, B.Name
FROM A_B
JOIN A
ON A.Aid = A_B.Aid
AND A.Aid = #aid
AND A.Other1 = #other1
JOIN B
ON B.BId = A_B.Bid
If you can get it to filter before it joins then less work
Check the query plan
A CTE on A with the conditions may coerce it to perform the filter first.
If you cannot get the filter to happen first with a single statement then create a #tempA with ID as a declared PK
(not a CTE the purpose is to materialize)
Insert into #tempA
select Id, Name
from Table A
where A.Aid = #aid
AND A.Other1 = #other1
If Id is PK on Table A then that query returns 0 or 1 records
The join to #tempA is trivial

Query takes time on comparing non numeric data of two tables, how to optimize it?

I have two DBs. The 1st db has CallsRecords table and 2nd db has Contacts table, both are on SQL Server 2005.
Below is the sample of two tables.
Contact table has 1,50,000 records
CallsRecords has 75,000 records
Indexes on CallsRecords:
CallFrom
CallTo
PickUP
Indexes on Contacts:
PhoneNumber
alt text http://img688.imageshack.us/img688/8422/calls.png
I am using this query to find matches but it take more than 7 minutes.
SELECT *
FROM CallsRecords r INNER JOIN Contact c ON r.CallFrom = c.PhoneNumber
OR r.CallTo = c.PhoneNumber OR r.PickUp = c.PhoneNumber
In Estimated execution plan inner join cost 95%
Any help to optimize it.
You could try getting rid of the or in the join condition and replace with union all statements. Also NEVER, and I do mean NEVER, use select * in production code especially when you have a join.
SELECT <Specify Fields here>
FROM CallsRecords r INNER JOIN Contact c ON r.CallFrom = c.PhoneNumber
UNION ALL
SELECT <Specify Fields here>
FROM CallsRecords r INNER JOIN Contact c ON r.CallTo = c.PhoneNumber
UNION ALL
SELECT <Specify Fields here>
FROM CallsRecords r INNER JOIN Contact c ON r.PickUp = c.PhoneNumber
Alternatively you could try not using phone number to join on. Instead create the contacts phone list with an identity field and store that in the call records instead of the phone number. An int field will likely be a faster join.
Is there an index on the fields you are comparing? Is this index being used in the execution plan?
Your select * is probably causing SQL Server to ignore your indexes, and causing each table to be scanned. Instead, try listing out only the columns you need to select.
There is so much room for optimization
take out * (never use it, use column names)
specify the schema for tables (should be dbo.CallRecords and dbo.Contact)
Finally the way the data is stored is also a problem. I see that there are a lot of "1" in CallID as well as ContactID. Is there any Clustered Index (primary key) in those two tables?
I would rather take out your joins and implement union all as suggested by HLGem. And I agree with him it is better to search on IDs than long strings like this.
HTH