What indexes do I need to speed up AND/OR SQL queries - sql

Let's assume I have a table named customer like this:
+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ... | .... | ... |
and I need to perform the following query:
SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)
I'm aware of how single and multi-column indexes work, so I created these ones:
(name, age)
(lastname, age)
Is that all the indexes I need?
The above condition can be rephrased as:
... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)
but I'm not sure how smart RDBMS are, and if those indexes are the correct ones

Your approach is reasonable. Two factors are essential here:
Postgres can combine multiple indexes very efficiently with bitmap index scans.
PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
Test case
If you don't have enough data to measure tests, you can always whip up a quick test case like this:
CREATE TABLE customer (id int, name text, lastname text, age int);
INSERT INTO customer
SELECT g
, left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
, left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
, ((random()^2) * 100)::int
FROM generate_series(1, 30000) g; -- 30k rows for quick test case
For your query (reformatted):
SELECT *
FROM customer
WHERE (name = 'john' OR lastname = 'doe')
AND age = 21;
I would go with
CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);
However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.
CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);
The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.
Updated SQL Fiddle.
Why age first in the index?
This is not very important and needs deeper understanding to explain. But since you ask ...
The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:
Multicolumn index and performance
I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:
Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.
And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.
age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.
For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
Configuring PostgreSQL for read performance
Measure the size of a PostgreSQL table row
Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.
UNION query equivalent?
Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.

Turn the OR into two queries UNIONed:
SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'
Then create an index over (Age, Name) and another over (Age, LastName).

Related

Have multiple index with same column

I have 2 queries like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.age = 36
and employee.address = 'street 6'
and a second query like this
select *
from dbo.employee
where employee.name = 'lucas'
and employee.address = 'street 6'
I created an index with multiple columns like this
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, age, address)
This index work for the first query and performance is fast, but the second query took longer.
How can I reproduce this issue?
I expected create multiple index with same column will improve a second query but there is no different same took a longer
Use such an index:
CREATE NONCLUSTERED INDEX [IX_EMPLOYEE]
ON dbo.employee (name, address, age)
Note the order of columns and the fact that name + address covers the WHERE clause of the second query (therefore will make it seekable, that is fast), and that this index is usable in the first query as well.
It would work equally well if order of columns was (address, name, age). For those queries, select the one from those two that has the greatest amount of unique values (check it with SELECT COUNT(DISTINCT address) FROM dbo.employee or try to predict if you don't have the data yet).
You may consider removing the "age" column from the index if there are not many people with the same name at the same address in the worst case. It will seek to the first name + address, and then range scan through all the 'lucas'es on 'street 6' to find if any of them matches the age. If data model allows that, it'd be a reasonable change. However "age" column is probably narrow, so the savings won't be huge, in contrast to "address" column which contains more data (but needs to be first or second in order for those queries to be seekable).

SQLITE: How to make indexing work for you?

I have a sqlite db of employees with about a million entries.
company:
emp_id(primary) | first_name | last_name | company_name | job_title
The db contains only 10 distinct company names (i.e. let's say each company has about 100k employees)
I created an index on company name:
CREATE INDEX cmp_name ON company(company_name)
But I have not gained any speed while performing query:
WITH INDEX:
select * from company INDEXED BY cmp_name where company_name = 'XYZ corp';
Time: 88.45 sec
WITHOUT INDEX:
select * from company where company_name = 'XYZ corp';
Time: 89.12 sec
What am I doing wrong?
A database is organized into pages. If more than ten rows fit into a page, then on average, reading all the "XYZ Corp" rows still requires reading most pages. Furthermore, with the index entries not having the same order as the table rows, the table's page are no longer read in order.
The only way to speed up this query would be to use a covering index. First, reduce the number of columns read to the absolute minimum that you actually need, then add all those columns to the company name index (the INTEGER PRIMARY KEY column is implicitly part of every index):
CREATE INDEX cmp_name_and_other_stuff ON company(company_name, last_name);
SELECT emp_id, last_name FROM company WHERE company_name = 'XYZ Corp';
Doing this for every query will waste lots of storage space.

Fetch only part of your result set at a time?

I am fetching a huge result set of about 5 million rows (with 10-15 columns) with my query. There is no ID column and one cannot even be created (not my fault), so I cannot even partition my data on the basis of ID and then load it in parts. What makes it worse is that this is SQL server 2000, so most of the convenient SQL coding features might not even be available for this DB. Is there any way i can do something like -
Select top 10000 column_list from myTable
then, select next top 10000 column_list from myTable (ie 10001 to 20000)
and so on...
If you have a useful index, you can grab 10000 rows at a time by tracking the value based on the index.
Suppose the useful index is LastName + FirstName
Select top 10000 column_list from MyTable
order by LastName, FirstName
Then when you get the next 10000 rows, use the query
Select top 10000 column_list from MyTable
where LastName >= PreviousLastname && FirstName > PreviousFirstname
order by LastName, FirstName
Pseudocode above assumes no duplicates on the combination, if you could have duplicates, easiest method is to add another column (even if not indexed), that makes it unique. You would need that 3rd column in the order by clause.
PreviousLastname is the value from the 10,000 record of the previous query.
ADDED
A useful index in this context is any index that high a high cardinality -- mostly distinct values or at most a minimal numbers of non distinct values. An extremely non-useful index would be something like gender (M/F/null)
Since you are using this for data loading, the index selection is not important (ignoring performance considerations) as long as it has a high cardinality. Note that the index and and order by clause must match or you will put a heavy load on your database.
REVISION -- I saw an obvious mistake for the additional data where clause
where LastName >= PreviousLastname && FirstName > PreviousFirstname
This should have been
where (LastName > PreviousLastname)
or (LastName = PreviousLastname && FirstName > PreviousFirstname)

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Query mySql with LIKE %...% and not pull false records

I have a database that contains two fields that collect multiple values. For instance, one is colors, where one row might be "red, blue, navyblue, lightblue, orange". The other field uses numbers, we'll call it colorID, where one row might be "1, 10, 23, 110, 239."
Now, let's say I want to SELECT * FROM my_table WHERE 'colors' LIKE %blue%; That query will give me all the rows with "blue," but also rows with "navyblue" or "lightblue" that may or may not contain "blue." Likewise, with colorID, a query for WHERE 'colorID' LIKE %1% will pull up a lot more rows than I want.
What's the correct syntax to properly query the database and only return correct results? FWIW, the fields are both set as TEXT (due to the commas). Is there a better way to store the data that would make searching easier and more accurate?
you really should look at changing your db schema. One option would be to create a table that holds colours with an INT as the primary key. You could then create a pivot table to link my_table to colours
CREATE TABLE `colours` (
`id` INT NOT NULL ,
`colour` VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( `id` )
) ENGINE = MYISAM
CREATE TABLE `mytable_to_colours` (
`mytable_id` INT NOT NULL ,
`colour_id` INT NOT NULL ,
) ENGINE = MYISAM
so your query could look like this - where '1' is the value of blue (and more likely how you would be referencing it)
SELECT *
FROM my_table
JOIN mytable_to_colours ON (my_table.id = mytable_to_colours.mytable_id)
WHERE colour_id = '1'
If you want to search in your existing table you can use the following query:
SELECT *
FROM my_table
WHERE colors LIKE 'blue,%'
OR colors LIKE '%,blue'
OR colors LIKE '%,blue,%'
OR colors = 'blue'
However it is much better than when you create table colors and numbers and create many to many relationships.
EDITED: Just like #seengee has written.
MySQL has a REGEXP function that will allow you to match something like "[^a-z]blue|^blue". But you should really consider not doing it this way at all. A single table containing one row for each color (with multiple rows groupable by a common ID) would be far more scalable.
The standard answer would be to normalize the data by putting a colorSelID (or whatever) in this table, then having another table with two columns, mapping from 'colorSelID' to the individual colorIDs, so your data above would turn into something like:
other colums | colorSelId
other data | 1
Then in the colors table, you'd have:
colorSelId | ColorId
1 | 1
1 | 10
1 | 23
1 | 110
1 | 239
Then, when you want to find all the items that match colorID 10, you just search on colorID, and join that ColorSelId back to your main table to get all the items with a colorID of 10:
select *
from
main_table join color_table
on
main_table.ColorSelId=color_table.ColorSelId
where
color_table.colorId = 10
Edit: note that this will also probably speed up your searches a lot, at least assuming you index on ColorId in the color table, and ColorSelId in the main table. A search on '%x%' will (almost?) always do a full table scan, whereas this will use the index.
Perhaps this will help to you:
SELECT * FROM table WHERE column REGEXP "[X]"; // where X is a number. returns all rows containg X in your column
SELECT * FROM table WHERE column REGEXP "^[X]"; // where X is a number. returns all rows containg X as first number in your column
Good luck!
None of the solutions suggested so far seem likely to work, assuming I understand your question. Short of splitting the comma-delimited string into a table and joining, you can do this (using 'blue' as an example):
WHERE ', ' + myTable.ValueList + ',' LIKE '%, blue,%'
If you aren't meticulous about spaces after commas, you would need to replace spaces in ValueList with empty strings as part of this code (and remove the space in ', ').