Have multiple index with same column - sql

I have 2 queries like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.age = 36
and employee.address = 'street 6'
and a second query like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.address = 'street 6'
I created an index with multiple columns like this
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, age, address)
This index work for the first query and performance is fast, but the second query took longer.
How can I reproduce this issue?
I expected create multiple index with same column will improve a second query but there is no different same took a longer

Use such an index:
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, address, age)
Note the order of columns and the fact that name + address covers the WHERE clause of the second query (therefore will make it seekable, that is fast), and that this index is usable in the first query as well.
It would work equally well if order of columns was (address, name, age). For those queries, select the one from those two that has the greatest amount of unique values (check it with SELECT COUNT(DISTINCT address) FROM dbo.employee or try to predict if you don't have the data yet).
You may consider removing the "age" column from the index if there are not many people with the same name at the same address in the worst case. It will seek to the first name + address, and then range scan through all the 'lucas'es on 'street 6' to find if any of them matches the age. If data model allows that, it'd be a reasonable change. However "age" column is probably narrow, so the savings won't be huge, in contrast to "address" column which contains more data (but needs to be first or second in order for those queries to be seekable).

Related

SQLITE: How to make indexing work for you?

I have a sqlite db of employees with about a million entries.
company:
emp_id(primary) | first_name | last_name | company_name | job_title
The db contains only 10 distinct company names (i.e. let's say each company has about 100k employees)
I created an index on company name:
CREATE INDEX cmp_name ON company(company_name)
But I have not gained any speed while performing query:
WITH INDEX:
select * from company INDEXED BY cmp_name where company_name = 'XYZ corp';
Time: 88.45 sec
WITHOUT INDEX:
select * from company where company_name = 'XYZ corp';
Time: 89.12 sec
What am I doing wrong?
A database is organized into pages. If more than ten rows fit into a page, then on average, reading all the "XYZ Corp" rows still requires reading most pages. Furthermore, with the index entries not having the same order as the table rows, the table's page are no longer read in order.
The only way to speed up this query would be to use a covering index. First, reduce the number of columns read to the absolute minimum that you actually need, then add all those columns to the company name index (the INTEGER PRIMARY KEY column is implicitly part of every index):
CREATE INDEX cmp_name_and_other_stuff ON company(company_name, last_name);
SELECT emp_id, last_name FROM company WHERE company_name = 'XYZ Corp';
Doing this for every query will waste lots of storage space.

What indexes do I need to speed up AND/OR SQL queries

Let's assume I have a table named customer like this:
+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ... | .... | ... |
and I need to perform the following query:
SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)
I'm aware of how single and multi-column indexes work, so I created these ones:
(name, age)
(lastname, age)
Is that all the indexes I need?
The above condition can be rephrased as:
... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)
but I'm not sure how smart RDBMS are, and if those indexes are the correct ones
Your approach is reasonable. Two factors are essential here:
Postgres can combine multiple indexes very efficiently with bitmap index scans.
PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
Test case
If you don't have enough data to measure tests, you can always whip up a quick test case like this:
CREATE TABLE customer (id int, name text, lastname text, age int);
INSERT INTO customer
SELECT g
, left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
, left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
, ((random()^2) * 100)::int
FROM generate_series(1, 30000) g; -- 30k rows for quick test case
For your query (reformatted):
SELECT *
FROM customer
WHERE (name = 'john' OR lastname = 'doe')
AND age = 21;
I would go with
CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);
However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.
CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);
The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.
Updated SQL Fiddle.
Why age first in the index?
This is not very important and needs deeper understanding to explain. But since you ask ...
The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:
Multicolumn index and performance
I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:
Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.
And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.
age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.
For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
Configuring PostgreSQL for read performance
Measure the size of a PostgreSQL table row
Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.
UNION query equivalent?
Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.
Turn the OR into two queries UNIONed:
SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'
Then create an index over (Age, Name) and another over (Age, LastName).

Optimize sub query

Suppose there three columns ename , city , salary. There are millions of rows in this table named emp.
ename city salary
ak newyork $5000
bk abcd $4000
ck Delhi $4000
....................
...................
Maverick newyork $8000
I want to retrieve all employees having the same city name as Maverick.
select * from emp where
city = (select city from emp where ename= 'maverick' )
I know it will work, but for performance reasons, this query is not good because there are two where clauses present in this query.
I need a query having better performance than above query.
Oracle is probably going to do a good job getting the optimal execution plan for this query:
select *
from emp
where city = (select city from emp where ename= 'maverick' ) ;
What would help the query are two indexes:
create index idx_emp_ename_city on emp(ename, city)
create index idx_emp_ename_city on emp(city)
The first would be used for the subquery. The second to look up all the matching rows. Without indexes, Oracle is going to have to read the table at least once (I think at least twice) and that is going to affect performance on such a large table.
This would give you the same output but I doubt it will perform any better.
You could compare the plans though.
select x.*
from emp x
join (select city from emp where ename = 'maverick') y
on x.city = y.city
You can also add 2 indexes, one on the ENAME column, and a separate one on the CITY column.
create index emp_idx_ename on emp(ename);
create index emp_idx_city on emp(city);
The first index will speed up the inline view whose results are being joined to, because it is searching the table on employee.
The second index will speed up the parent query, because it is searching the table for a given city.
You could create a composite index on emp(city, ename) as others have suggested since you're select only the city column where the ename is X, allowing the query in the inline view to use only the index and not the table, which I didn't initially think of. It may provide an additional boost, more or less, depending on the size of the table, although the index will also be larger.
To make sure the indexes will immediately use updated statistics related to that table, I would also run the following after you create the above indexes, so that your query will immediately start using them:
analyze table emp compute statistics;
You could use with statement... Users sugest you many dicisions
WITH new_city_tab AS (
SELECT city AS ncity
FROM emp WHERE ename='Maverick'
GROUP BY city)
SELECT *
FROM emp e,
new_city_tab c
WHERE E.city = c.ncity;
Sometimes complexity wins from the desire to narrow down the query further. Just isn't possible to optimize this query further.
You could opt to add an index to create better performance. The index should come on city and ename.
Try this to create these indexes:
create index emp_city -- for the outer where clause
on emp
( city
)
create index emp_ename_city -- for the sub query
on emp
( ename
, city
)

How to efficiently write DISTINCT query in Django with table having foreign keys

I want to show distinct cities of Users in the front end dropdown. For that, i make a db query which fetches distinct city_name from table City but only those cities where users are present.
Something like below works for a small size of User table, but takes a very long time if User table in of size 10 million. Distinct cities of these users are still ~100 though.
class City(models.Model):
city_code = models.IntegerField(unique=True)
city_name = models.CharField(max_length=256)
class User(models.Model):
city = models.ForeignKey('City', to_field='city_code')
Now i try to search distinct city names as:
City.objects.filter().values_list('city__city_name').distinct()
which translates to this on PostgreSQL:
SELECT DISTINCT "city"."city_name"
FROM "user"
LEFT OUTER JOIN "city"
ON ("user"."city_id" = "city"."city_code");
Time: 9760.302 ms
That clearly showed that PostgreSQL was not making use of index on 'user'.'city_id'. I also read about a workaround solution here which involved writing a custom SQL query which somehow utilizes index.
I tried to find distinct 'user'.'city_id' using the above query, and that actually turned out to be pretty fast.
WITH
RECURSIVE t(n) AS
(SELECT min(city_id)
FROM user
UNION
SELECT
(SELECT city_id
FROM user
WHERE city_id > n order by city_id limit 1)
FROM t
WHERE n is not null)
SELECT n
FROM t;
Time: 79.056 ms
But now i am finding it hard to incorporate this in my Django code. I still think it is a kind of hack adding custom query in the code for this. But a bigger concern for me is that the column name can be totally dynamic, and i can not hardcode these column names (eg. city_id, etc.) in the code.
#original_fields could be a list from input, like ['area_code__district_code__name']
dataset_klass.objects.filter().values_list(*original_fields).distinct()
Using the custom query would need atleast splitting the field name with '__' as delimiter and process the first part. But it looks like a bad hack to me.
How can i improve this?
PS. The City User example is just shown to explain the scenario. The syntax might not be correct.
I finally reached to this workaround solution.
from django.db import connection, transaction
original_field = 'city__city_name'
dataset_name = 'user'
dataset_klass = eval(camelize(dataset_name))
split_arr = original_field.split("__",1)
"""If a foreign key relation is present
"""
if len(split_arr) > 1:
parent_field = dataset_klass._meta.get_field_by_name(split_arr[0])[0]
cursor = connection.cursor()
"""This query will run fast only if parent_field is indexed (city_id)
"""
cursor.execute('WITH RECURSIVE t(n) AS ( select min({0}) from {1} '
'union select (select {0} from {1} where {0} > n'
' order by {0} limit 1) from t where n is not null) '
'select n from t;'.format(parent_field.get_attname_column()[1], dataset_name))
"""Create a list of all distinct city_id's"""
distinct_values = [single[0] for single in cursor.fetchall()]
"""create a dict of foreign key field to the above list"""
"""to get the actual city_name's using _meta information"""
filter_dict = {parent_field.rel.field_name+'__in':distinct_values}
values = parent_field.rel.to.objects.filter(**filter_dict).values_list(split_arr[1])
else:
values = dataset_klass.objects.filter().values_list(original_field).distinct()
Which utilizes the index on city_id in user table, runs pretty fast.

Fetch only part of your result set at a time?

I am fetching a huge result set of about 5 million rows (with 10-15 columns) with my query. There is no ID column and one cannot even be created (not my fault), so I cannot even partition my data on the basis of ID and then load it in parts. What makes it worse is that this is SQL server 2000, so most of the convenient SQL coding features might not even be available for this DB. Is there any way i can do something like -
Select top 10000 column_list from myTable
then, select next top 10000 column_list from myTable (ie 10001 to 20000)
and so on...
If you have a useful index, you can grab 10000 rows at a time by tracking the value based on the index.
Suppose the useful index is LastName + FirstName
Select top 10000 column_list from MyTable
order by LastName, FirstName
Then when you get the next 10000 rows, use the query
Select top 10000 column_list from MyTable
where LastName >= PreviousLastname && FirstName > PreviousFirstname
order by LastName, FirstName
Pseudocode above assumes no duplicates on the combination, if you could have duplicates, easiest method is to add another column (even if not indexed), that makes it unique. You would need that 3rd column in the order by clause.
PreviousLastname is the value from the 10,000 record of the previous query.
ADDED
A useful index in this context is any index that high a high cardinality -- mostly distinct values or at most a minimal numbers of non distinct values. An extremely non-useful index would be something like gender (M/F/null)
Since you are using this for data loading, the index selection is not important (ignoring performance considerations) as long as it has a high cardinality. Note that the index and and order by clause must match or you will put a heavy load on your database.
REVISION -- I saw an obvious mistake for the additional data where clause
where LastName >= PreviousLastname && FirstName > PreviousFirstname
This should have been
where (LastName > PreviousLastname)
or (LastName = PreviousLastname && FirstName > PreviousFirstname)