SQLITE: How to make indexing work for you? - sql

I have a sqlite db of employees with about a million entries.
company:
emp_id(primary) | first_name | last_name | company_name | job_title
The db contains only 10 distinct company names (i.e. let's say each company has about 100k employees)
I created an index on company name:
CREATE INDEX cmp_name ON company(company_name)
But I have not gained any speed while performing query:
WITH INDEX:
select * from company INDEXED BY cmp_name where company_name = 'XYZ corp';
Time: 88.45 sec
WITHOUT INDEX:
select * from company where company_name = 'XYZ corp';
Time: 89.12 sec
What am I doing wrong?

A database is organized into pages. If more than ten rows fit into a page, then on average, reading all the "XYZ Corp" rows still requires reading most pages. Furthermore, with the index entries not having the same order as the table rows, the table's page are no longer read in order.
The only way to speed up this query would be to use a covering index. First, reduce the number of columns read to the absolute minimum that you actually need, then add all those columns to the company name index (the INTEGER PRIMARY KEY column is implicitly part of every index):
CREATE INDEX cmp_name_and_other_stuff ON company(company_name, last_name);
SELECT emp_id, last_name FROM company WHERE company_name = 'XYZ Corp';
Doing this for every query will waste lots of storage space.

Related

Have multiple index with same column

I have 2 queries like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.age = 36
and employee.address = 'street 6'
and a second query like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.address = 'street 6'
I created an index with multiple columns like this
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, age, address)
This index work for the first query and performance is fast, but the second query took longer.
How can I reproduce this issue?
I expected create multiple index with same column will improve a second query but there is no different same took a longer
Use such an index:
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, address, age)
Note the order of columns and the fact that name + address covers the WHERE clause of the second query (therefore will make it seekable, that is fast), and that this index is usable in the first query as well.
It would work equally well if order of columns was (address, name, age). For those queries, select the one from those two that has the greatest amount of unique values (check it with SELECT COUNT(DISTINCT address) FROM dbo.employee or try to predict if you don't have the data yet).
You may consider removing the "age" column from the index if there are not many people with the same name at the same address in the worst case. It will seek to the first name + address, and then range scan through all the 'lucas'es on 'street 6' to find if any of them matches the age. If data model allows that, it'd be a reasonable change. However "age" column is probably narrow, so the savings won't be huge, in contrast to "address" column which contains more data (but needs to be first or second in order for those queries to be seekable).

Count number of rows returned in a SQL statement

Are there any DB engines that allow you to run an EXPLAIN (or other function) where it will give you an approximate count of values that may be returned before an aggregation is run (not rows scanned but that actually would be returned)? For example, in the following query:
SELECT gender, COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
GROUP BY gender
Let's say this query returns 3 rows after being aggregated, but would return 170M rows if unaggregated.
Are there any tools where you can run the query to get this '170M' number or does this have to do with complexity theory (or something similar) where it's almost just as expensive to run the query (without the final aggregation/having/sort/limit/etc) to get the count? In other words, doing a rewrite to:
SELECT COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
But having to execute the query nonetheless.
As an example of using the current (mysql) explain to show how 'off' it is to get what I'm looking for:
explain select * from movies where title>'a';
# rows=147900
select count(1) from _tracktitle where title>'a';
# 144647 --> OK, pretty close
explain select * from movies where title>'u';
# rows=147900
select * from movies where title>'u';
# 11816 --> Not close at all
Assuming you can use MS SQL Server, you could tap into the same data the Optimiser is using for cardinality estimation: DBCC SHOW_STATISTICS (table, index) WITH HISTOGRAM
Part of data sets you get back is per-column histogram, which is essentially number of rows for each value range found in the table.
You probably want to query the data programmatically, one way to achieve this would be to insert it into a temp table:
CREATE TABLE #histogram (
RANGE_HI_KEY datetime PRIMARY KEY,
RANGE_ROWS INT,
EQ_ROWS INT,
DISTINCT_RANGE_ROWS INT,
AVG_RANGE_ROWS FLOAT
)
INSERT INTO #histogram
EXEC ('DBCC SHOW_STATISTICS (Users, CreationDate) WITH HISTOGRAM')
SELECT 'Estimate', SUM(RANGE_ROWS+EQ_ROWS) FROM #histogram WHERE RANGE_HI_KEY BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
UNION ALL
select 'Actual', COUNT(1) from Users u WHERE u.CreationDate BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
For example, check out what this same query run against Stack Overflow Database.
| -------- | ----- |
| Estimate | 98092 |
| Actual | 11715 |
it seems like a lot but then keep in mind that the whole table has almost 15mil records.
A note on precision and other gotchas
The maximum number of histogram steps is capped at 200 - which is not a lot, so you are not getting guaranteed 10% margin of error, but neither does SQL Server.
As you insert data into table, histograms may get stale so your results would get skewed even more.
There are different ways to update this data, some are reasonably quick while others effectively require full table scan
not all columns will have statistics. You can either create it manually or (I believe) it gets created automatically if you run a search with the column as predicate
MS Sql Server offers "execution plans". In the picture below I have queries and I press (Ctrl-L) to see the plans.
In my queries I return all records in first and just the count in the other, using the same table.
Look at metric corresponding to red arrows- estimated # of rows that WILL be scanned when queries are run. In this case, that number is same regardless whether count(*) or *, your point in case!

What indexes do I need to speed up AND/OR SQL queries

Let's assume I have a table named customer like this:
+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ... | .... | ... |
and I need to perform the following query:
SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)
I'm aware of how single and multi-column indexes work, so I created these ones:
(name, age)
(lastname, age)
Is that all the indexes I need?
The above condition can be rephrased as:
... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)
but I'm not sure how smart RDBMS are, and if those indexes are the correct ones
Your approach is reasonable. Two factors are essential here:
Postgres can combine multiple indexes very efficiently with bitmap index scans.
PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
Test case
If you don't have enough data to measure tests, you can always whip up a quick test case like this:
CREATE TABLE customer (id int, name text, lastname text, age int);
INSERT INTO customer
SELECT g
, left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
, left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
, ((random()^2) * 100)::int
FROM generate_series(1, 30000) g; -- 30k rows for quick test case
For your query (reformatted):
SELECT *
FROM customer
WHERE (name = 'john' OR lastname = 'doe')
AND age = 21;
I would go with
CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);
However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.
CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);
The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.
Updated SQL Fiddle.
Why age first in the index?
This is not very important and needs deeper understanding to explain. But since you ask ...
The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:
Multicolumn index and performance
I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:
Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.
And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.
age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.
For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
Configuring PostgreSQL for read performance
Measure the size of a PostgreSQL table row
Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.
UNION query equivalent?
Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.
Turn the OR into two queries UNIONed:
SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'
Then create an index over (Age, Name) and another over (Age, LastName).

SQL Huge Read Only Table Performance Filter and Ordering

I have a table with 1 billion rows that holds possible solutions to a goal setting program.
The combination of each column's value creates a successful goal path. I want to filter records to show the top 10 rows that are ordered by the choice of the user. Someone may want the lowest possible retirement age, then lowest deposit amount. Someone else may want the highest possible survival chance, then highest ending balance, ...
Here are my columns:
age tinyint
retirement_age tinyint
retirement_length tinyint
survival smallint
deposit int
balance_start int
balance_end int
SLOW 10 MIN QUERY:
select top(10) age,retirement_age,retirement_length,survival,deposit,balance_start,balance_end
from TABLE
where
age >= 30
and survival >= 8000 --OUT OF 10000
and balance_start <= 20000
and retirement_age >= 60
and retirement_age <= 75
and retirement_length >= 10
and retirement_length <= 25
and deposit >= 1000
and deposit <= 20000
ORDER BY -- (COLUMN ORDER PREFERENCES UNKNOWN)
retirement_age,
deposit,
retirement_length desc,
balance_end desc,
age desc,
survival desc
That query takes 10 min.
All of the records are generated once, so there is no more writing/updating to the database. I was thinking I should index each column, but have not done so. The database is 30GB right now, but space is not an issue.
I have run the Estimated Execution plan:
select: 0%
parallelism: 0%
sort: 23%
table scan: 77%
Have you tried creating an index like
CREATE INDEX IX_TABLE ON [TABLE]
(age,survival,balance_start,retirement_age,retirement_length,deposit)
INCLUDE (balance_end)
The order of the index fields (age,survival,balance_start,retirement_age,retirement_length,deposit) will make a difference if not all the fields are used in the WHERE clause, so make sure to put them in order of most used.
Also, the order of the included columns does not make any difference.
Seeing as the table values will not change, you can create more than one such index to improve the performance of other queries where it does not use all the fields in the WHERE clause
I ended up creating separate indexes on each of the columns in my where and order clauses with the default order:
CREATE INDEX IX_age ON TABLE (age desc)
CREATE INDEX IX_retirement_age ON TABLE (retirement_age)
CREATE INDEX IX_retirement_length ON TABLE (retirement_length desc)
CREATE INDEX IX_survival ON TABLE (survival desc)
CREATE INDEX IX_deposit ON TABLE (deposit)
CREATE INDEX IX_balance_start ON TABLE (balance_start)
CREATE INDEX IX_balance_end ON TABLE (balance_end desc)

Fetch only part of your result set at a time?

I am fetching a huge result set of about 5 million rows (with 10-15 columns) with my query. There is no ID column and one cannot even be created (not my fault), so I cannot even partition my data on the basis of ID and then load it in parts. What makes it worse is that this is SQL server 2000, so most of the convenient SQL coding features might not even be available for this DB. Is there any way i can do something like -
Select top 10000 column_list from myTable
then, select next top 10000 column_list from myTable (ie 10001 to 20000)
and so on...
If you have a useful index, you can grab 10000 rows at a time by tracking the value based on the index.
Suppose the useful index is LastName + FirstName
Select top 10000 column_list from MyTable
order by LastName, FirstName
Then when you get the next 10000 rows, use the query
Select top 10000 column_list from MyTable
where LastName >= PreviousLastname && FirstName > PreviousFirstname
order by LastName, FirstName
Pseudocode above assumes no duplicates on the combination, if you could have duplicates, easiest method is to add another column (even if not indexed), that makes it unique. You would need that 3rd column in the order by clause.
PreviousLastname is the value from the 10,000 record of the previous query.
ADDED
A useful index in this context is any index that high a high cardinality -- mostly distinct values or at most a minimal numbers of non distinct values. An extremely non-useful index would be something like gender (M/F/null)
Since you are using this for data loading, the index selection is not important (ignoring performance considerations) as long as it has a high cardinality. Note that the index and and order by clause must match or you will put a heavy load on your database.
REVISION -- I saw an obvious mistake for the additional data where clause
where LastName >= PreviousLastname && FirstName > PreviousFirstname
This should have been
where (LastName > PreviousLastname)
or (LastName = PreviousLastname && FirstName > PreviousFirstname)