Optimize sub query - sql

Suppose there three columns ename , city , salary. There are millions of rows in this table named emp.
ename city salary
ak newyork $5000
bk abcd $4000
ck Delhi $4000
....................
...................
Maverick newyork $8000
I want to retrieve all employees having the same city name as Maverick.
select * from emp where
city = (select city from emp where ename= 'maverick' )
I know it will work, but for performance reasons, this query is not good because there are two where clauses present in this query.
I need a query having better performance than above query.

Oracle is probably going to do a good job getting the optimal execution plan for this query:
select *
from emp
where city = (select city from emp where ename= 'maverick' ) ;
What would help the query are two indexes:
create index idx_emp_ename_city on emp(ename, city)
create index idx_emp_ename_city on emp(city)
The first would be used for the subquery. The second to look up all the matching rows. Without indexes, Oracle is going to have to read the table at least once (I think at least twice) and that is going to affect performance on such a large table.

This would give you the same output but I doubt it will perform any better.
You could compare the plans though.
select x.*
from emp x
join (select city from emp where ename = 'maverick') y
on x.city = y.city
You can also add 2 indexes, one on the ENAME column, and a separate one on the CITY column.
create index emp_idx_ename on emp(ename);
create index emp_idx_city on emp(city);
The first index will speed up the inline view whose results are being joined to, because it is searching the table on employee.
The second index will speed up the parent query, because it is searching the table for a given city.
You could create a composite index on emp(city, ename) as others have suggested since you're select only the city column where the ename is X, allowing the query in the inline view to use only the index and not the table, which I didn't initially think of. It may provide an additional boost, more or less, depending on the size of the table, although the index will also be larger.
To make sure the indexes will immediately use updated statistics related to that table, I would also run the following after you create the above indexes, so that your query will immediately start using them:
analyze table emp compute statistics;

You could use with statement... Users sugest you many dicisions
WITH new_city_tab AS (
SELECT city AS ncity
FROM emp WHERE ename='Maverick'
GROUP BY city)
SELECT *
FROM emp e,
new_city_tab c
WHERE E.city = c.ncity;

Sometimes complexity wins from the desire to narrow down the query further. Just isn't possible to optimize this query further.
You could opt to add an index to create better performance. The index should come on city and ename.
Try this to create these indexes:
create index emp_city -- for the outer where clause
on emp
( city
)
create index emp_ename_city -- for the sub query
on emp
( ename
, city
)

Related

Have multiple index with same column

I have 2 queries like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.age = 36
and employee.address = 'street 6'
and a second query like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.address = 'street 6'
I created an index with multiple columns like this
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, age, address)
This index work for the first query and performance is fast, but the second query took longer.
How can I reproduce this issue?
I expected create multiple index with same column will improve a second query but there is no different same took a longer
Use such an index:
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, address, age)
Note the order of columns and the fact that name + address covers the WHERE clause of the second query (therefore will make it seekable, that is fast), and that this index is usable in the first query as well.
It would work equally well if order of columns was (address, name, age). For those queries, select the one from those two that has the greatest amount of unique values (check it with SELECT COUNT(DISTINCT address) FROM dbo.employee or try to predict if you don't have the data yet).
You may consider removing the "age" column from the index if there are not many people with the same name at the same address in the worst case. It will seek to the first name + address, and then range scan through all the 'lucas'es on 'street 6' to find if any of them matches the age. If data model allows that, it'd be a reasonable change. However "age" column is probably narrow, so the savings won't be huge, in contrast to "address" column which contains more data (but needs to be first or second in order for those queries to be seekable).

Postgres - How to find id's that are not used in different multiple tables (inactive id's) - badly written query

I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.
There are 3 referent table that are using my town_id as reference to towns.
And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).
as you can see towns is used in this 2 different tables:
employees
offices
and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.
so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)
I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:
select count(*)
from towns
where vendor_id is null
and id not in (select town_id from banks)
and id not in (select town_id from employees)
So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..
In face I'm not able to execute given query since server terminated abnormally..
Here is full error message:
ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.
Any kind of help would be awesome
Thanks!
You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :
WITH list AS
( SELECT t.town_id
FROM towns AS t
LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
WHERE t.vendor_id IS NULL
AND b.town_id IS NULL
AND e.town_id IS NULL
LIMIT 1000
)
DELETE FROM tbl.towns AS t
USING list AS l
WHERE t.town_id = l.town_id ;
Before launching the DELETE, you can check the indexes on your tables.
Adding an index as follow can be usefull :
CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.
You can try an JOIN on big tables it would be faster then two IN
you could also try UNION ALL and live with the duplicates, as it is faster as UNION
Finally you can use a combined Index on id and vendor_id, to speed up the query
CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*)
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1
fiddle
The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there.
By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.
-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep
AS
SELECT town_id
FROM tbl.towns
WHERE 1 = 2;
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id
FROM tbl.banks;
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id
FROM tbl.employees
EXCEPT -- auto-distincts
SELECT town_id
FROM TEMP_must_keep;
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;
-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE
FROM tbl.towns as del
WHERE vendor_id is null
AND NOT EXISTS ( SELECT *
FROM TEMP_must_keep mk
WHERE mk.town_id = del.town_id);
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;
The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.

What indexes do I need to speed up AND/OR SQL queries

Let's assume I have a table named customer like this:
+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ... | .... | ... |
and I need to perform the following query:
SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)
I'm aware of how single and multi-column indexes work, so I created these ones:
(name, age)
(lastname, age)
Is that all the indexes I need?
The above condition can be rephrased as:
... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)
but I'm not sure how smart RDBMS are, and if those indexes are the correct ones
Your approach is reasonable. Two factors are essential here:
Postgres can combine multiple indexes very efficiently with bitmap index scans.
PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
Test case
If you don't have enough data to measure tests, you can always whip up a quick test case like this:
CREATE TABLE customer (id int, name text, lastname text, age int);
INSERT INTO customer
SELECT g
, left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
, left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
, ((random()^2) * 100)::int
FROM generate_series(1, 30000) g; -- 30k rows for quick test case
For your query (reformatted):
SELECT *
FROM customer
WHERE (name = 'john' OR lastname = 'doe')
AND age = 21;
I would go with
CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);
However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.
CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);
The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.
Updated SQL Fiddle.
Why age first in the index?
This is not very important and needs deeper understanding to explain. But since you ask ...
The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:
Multicolumn index and performance
I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:
Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.
And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.
age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.
For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
Configuring PostgreSQL for read performance
Measure the size of a PostgreSQL table row
Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.
UNION query equivalent?
Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.
Turn the OR into two queries UNIONed:
SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'
Then create an index over (Age, Name) and another over (Age, LastName).

Efficient way of getting group ID without sorting

Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?
Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
Which columns are indexed?
How large is the data? Does it fit in memory?
How many different countries are there?
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE #country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO #country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN #country c
ON c.countryName = p.countryName
group by use also sort operator in background (group is based on 'sort and compare' like Icomparable in C#)

Non index column in join

SQL> desc emp_1;
Name Type Nullable Default Comments
-------- ------------ -------- ------- --------
EMP_ID NUMBER
EMP_NAME VARCHAR2(20) Y
DEPTNO NUMBER(10) Y
SQL> desc dept
Name Type Nullable Default Comments
--------- ------------ -------- ------- --------
DEPT_ID NUMBER Y
DEPT_NAME VARCHAR2(20) Y
SQL> CREATE INDEX abc_idex ON emp_1 (deptno);
Index created
select /*+ index(emp_1 abc_idex) */ emp_name from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id
Explain Plan :-
SELECT STATEMENT, GOAL = ALL_ROWS 271 100000 800000
MERGE JOIN 271 100000 800000
TABLE ACCESS BY INDEX ROWID EXAMINBI EMP_1 267 100000 500000
INDEX FULL SCAN EXAMINBI ABC_IDEX 131 100000
SORT JOIN 4 4 12
TABLE ACCESS FULL EXAMINBI DEPT 3 4 12
select /*+ index(emp_1 abc_idex) */ emp_name from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id
and emp_1.emp_name=dept.dept_name
Explain Plan:-
SELECT STATEMENT, GOAL = ALL_ROWS 272 1 11
HASH JOIN 272 1 11
TABLE ACCESS FULL EXAMINBI DEPT 3 4 24
TABLE ACCESS BY INDEX ROWID EXAMINBI EMP_1 267 100000 500000
INDEX FULL SCAN EXAMINBI ABC_IDEX 131 100000
I m clearing my index conncept with your help. My understanding was oracle will skip my index hint as it needs to other column also which is not indexed(emp_name) but still emp_1 table was scanned by index in 2nd case. My question will it help in such case, where I m using another column for join where index is not used(in our example emp_name)? Should we use index hint in such case?
*Note:- I know this is emp_name and dept_name is not logical join but just for testing purpose I created the same.*
I want to know if its recommened to use index hint when in join you
are using non index columns from same table. Will it help?
Under most circumstances No
Under normal circumstances you simply do not use hints. As you can see here, you've used a hint, Oracle has followed it and done something dumb. You only use hints in very limited circumstances, usually only when you know something about the nature of the data that Oracle can not work out itself. Generally the only hint I use is the cardinality hint, as Oracle can sometimes genuinely not work out the cardinality correctly.
Do not assume that you need to regularly use hints. You don't. Even if a hint works now, it might stop working when the nature of the data changes.
On your case using index probably slows down the whole statement. It is because you are querying the whole table DEPT and EMP_1. Because of hint, Oracle has to query both full tables AND index. Do you really want this?
In simple cases like this I prefere not using hints. Optimizer does its job quite good.
If you use statement for specific department, then the result would be better
select emp_name
from emp_1 INNER JOIN dept ON emp_1.deptno = dept.dept_id
where dept.dname = 'any department'
and so:
select /*+ cardinality(0)*/ emp_name
from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id