Avoid duplicate data in SQLite3 with a covering index - optimization

In our company we have a rather big SQLite3 database with, let's say, some points of interest (POI). The database is created once, and used in read-only mode in a mobile user application.
POI have names that can contain several words and letters with diacritics. To perform a quick search of POI in the application, there is an additional table with single uppercase ASCII words and the corresponding ID in the main table. And there is a covering index. The database looks like this (simplified) :
CREATE TABLE poi(id INTEGER PRIMARY KEY, name TEXT, attributes TEXT);
CREATE TABLE poi_search (word TEXT, poi_id INTEGER);
CREATE INDEX poi_search_idx ON poi_search(word, poi_id);
Then, you can query for POI whose name contain "FOO" with a request like that:
SELECT * from poi INNER JOIN poi_search ON poi.id=poi_search.poi_id
WHERE poi_search.word < 'FOO' AND poi_search.word < 'FOP';
The query is very quick and uses a covering index, so it doesn't need to access the poi_search table at all:
sqlite> EXPLAIN QUERY PLAN SELECT * from poi INNER JOIN poi_search ON poi.id=poi_search.poi_id WHERE poi_search.word < 'FOO' AND poi_search.word < 'FOP';
0|0|1|SEARCH TABLE poi_search USING COVERING INDEX poi_search_idx (word<?)
0|1|0|SEARCH TABLE poi USING INTEGER PRIMARY KEY (rowid=?)
I just realized that this is a big waste of space, since the covering index duplicates all the data of the index table. In the application, the table poi_search is in fact never used.
In there a way, even a tricky one, to remove or to truncate the poi_search table while keeping all data in the covering index ? I know that such a database will be in a incoherent state, so probably there is no way with the official API to do such a hack.
I don't care having a hacked version of SQLite3 for the production of the database; but the DB has to produce correct search values for the given request in a vanilla SQLite3 client.

There is no tricky way, or a hack, to do what you want.
You'll have to make do with the documented way, which is guaranteed to keep the database consistent:
CREATE TABLE poi_search (
word TEXT PRIMARY KEY,
poi_id INTEGER
) WITHOUT ROWID;
-- no other index needed

Related

Do I have a redundant index in this SQLite schema?

I created an SQLlite schema as follows:
CREATE TABLE tab1 (
year INTEGER,
tar_id TEXT,
content BLOB,
UNIQUE (year, tar_id) ON CONFLICT REPLACE);
CREATE INDEX tab1_ix1 ON bcas (year, tar_id);
Then I looked at a query plan:
sqlite> explain query plan select * from tab1 where tar_id = 1 and year = (select max(year) from tab1 where year < 2019 and tar_id = 1);
QUERY PLAN
|--SEARCH TABLE tab1 USING COVERING INDEX sqlite_autoindex_tab1_1 (year=? AND tar_id=?)
`--SCALAR SUBQUERY
`--SEARCH TABLE tab1 USING COVERING INDEX tab1_ix1 (year<?)
It seems to me that only one index would be sufficient to do this, but it uses both my explicit tab1_ix1 and and automatically generated sqlite_autoindex_tab1_1.
Is one of them redundant? If so, how do I get rid of one of them and get the same behaviour?
Yes, you have a redundant index. A unique constraint generates an index automatically. You do not need to create another one explicitly index for the same columns in the same order.
Note that an index on (tar_id, year) would be a different index, because the ordering of keys in the index matters.
Although both seam redundant but there is some detail to be focused. In first part of query "*" is used so the auto created candidate is best suited because it links all other columns to retrieve. While in second portion with "max" clause only two columns (year, tar_id) are under consideration and they are all present with needed sequence in "smaller" manually created index "tab1_ix1" and engine thinks smaller being efficient so it uses tab1_ix1 this time.
So the default auto created index will be utilized by engine if slight degradation in performance is affordable and second smaller tab1_ix1 seem burden maintenance wise.

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

Is it advised to index the field if I envision retrieving all records corresponding to positive values in that field?

I have a table with definition somewhat like the following:
create table offset_table (
id serial primary key,
offset numeric NOT NULL,
... other fields...
);
The table has about 70 million rows in it.
I envision doing the following query many times
select * from offset_table where offset > 0;
For speed issues, I am wondering whether it would be advised to create an index like:
create index on offset_table(offset);
I am trying to avoid creation of unnecessary indices on this table as it is pretty big already.
As you mentioned in the comments, it would be ~70% of rows that match the offset > 0 predicate.
In that case the index would not be beneficial, since postgresql (and basically every other DBMS) would prefer a full table scan instead. It happens because it would be faster than jumping between reading the index consequently and the table randomly.

What key columns to use on filtered index with covering WHERE clause?

I'm creating a filtered index such that the WHERE filter includes the complete query criteria. WIth such an index, it seems that a key column would be unnecessary, though SQL requires me to add one. For example, consider the table:
CREATE TABLE Invoice
(
Id INT NOT NULL IDENTITY PRIMARY KEY,
Data VARCHAR(MAX) NOT NULL,
IsProcessed BIT NOT NULL DEFAULT 0,
IsInvalidated BIT NOT NULL DEFAULT 0
)
Queries on the table look for new invoices to process, i.e.:
SELECT *
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
So, I can tune for these queries with a filtered index:
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (IsProcessed)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
GO
My question: What should the key column(s) for IX_Invoice_IsProcessed_IsInvalidated be? Presumably the key column isn't being used. My intuition leads me to pick a column that is small and will keep the index structure relatively flat. Should I pick the table primary key (Id)? One of the filter columns, or both of them?
Because you have a clustered index on that table it doesn't really matter what you put in the key columns of that index; meaning Id is there free of charge. The only thing you can do is include everything in the included section of the index to actually have data handy at the leaf level of the index to exclude key lookups to the table. Or, if the queue is huge, then, perhaps, some other column would be useful in the key section.
Now, if that table didn't have a primary key then you would have to include or specify as key columns all the columns that you need for joining or other purposes. Otherwise, RID lookups on heap would occur because on the leaf level of indexes you would have references to data pages.
What percentage of the table does this filtered index cover? If it's small, you may want to cover the entire table to handle the "SELECT *" from the index without hitting the table. If it's a large portion of the table though this would not be optimal. Then I'd recommend using the clustered index or primary key. I'd have to research more because I forget which is optimal right now but if they're the same you should be set.
I suggest you declare it as follows
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (Id)
INCLUDE (Data)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
The INCLUDE clause will mean that the Values of the Data column will be stored as part of the index.
If you didn't have an INCLUDE clause then the query plan for
SELECT Id, Data
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
would involve a two step process
use the index to find the list of primary key values that match the
criteria
get the data from the table that match those primary keys
If, on the other hand, the index includes the [Data] column then it will properly cover the query as there will be no need to look up the data using the primary keys
You don't get something for nothing though
The downside to this is that you will be storing the varchar(MAX) data twice for these records so there will need to be more data written to the database and more storage will be used although this isn't so much of a problem if you're only talking about a small section of the data.
As always the more time and effort you put into putting things away carefully the faster and easier it is to get them back.

Is it really worth it to normalize the "Toxi" way? ( 3NF )

I'm in the early stages of my database design so nothing is final yet, and I'm using the "TOXI" 3-table design for my threads which have optional tags, but I can't help but feel that the joining is not really necessary and perhaps I need to just rely on a simple tags column in my posts table where I can just store a varchar of something like <tag>, <secondTag>.
So to recap:
is it worth the trouble of the extra left joins on the 2 tag tables instead of just having a tag column in my posts table.
is there a way I can optimize my query?
Schema
CREATE TABLE `posts` (
`post_id` INT UNSIGNED PRIMARY AUTO_INCREMENT,
`post_name` VARCHAR(255)
) Engine=InnoDB;
CREATE TABLE `post_tags` (
`tag_id` INT UNSIGNED PRIMARY AUTO_INCREMENT,
`tag_name` VARCHAR(255)
) Engine=InnoDB;
CREATE TABLE `post_tags_map` (
`map_id` INT PRIMARY AUTO_INCREMENT,
`post_id` INT NOT NULL,
`tags_id` INT NOT NULL,
FOREIGN KEY `post_id` REFERENCES `posts` (`post_id`),
FOREIGN KEY `post_id` REFERENCES `post_tags` (`tag_id`)
) Engine=InnoDB;
Sample Data
INSERT INTO `posts` (`post_id`, `post_name`)
VALUES
(1, 'test');
INSERT INTO `post_tags` (`tag_id`, `tag_name`)
VALUES
(1, 'mma'),
(2, 'ufc');
INSERT INTO `posts_tags_map` (`map_id`, `post_id`, `tags_id`)
VALUES
(1, 1, 1),
(2, 1, 2);
Current query
SELECT
posts.*,
GROUP_CONCAT( post_tags.tag_name order by post_tags.tag_name ) AS tags
FROM posts
LEFT JOIN posts_tags_map
ON posts_tags_map.post_id = posts.post_id
LEFT JOIN post_tags
ON posts_tags_map.tags_id = posts_tags.tag_id
WHERE posts.post_id = 1
GROUP BY post_id
Result
IF there are tags:
post_id post_name tags
1 test mma, ufc
Having all tags in different records (normalized) means that you'll be able to rename the tags more easily should the need arise and track the tag name history.
SO, for instance, renamed SQL Server related tags at least thrice (mssql -> sqlserver -> sql-server).
Having all tags in one record (denormalized) means that you can index this column with a FULLTEXT index and search for posts having two or more tags at once:
SELECT *
FROM posts
WHERE MATCH(tags) AGAINST('+mma +ufc')
which is possible too but less efficient with normalized design.
(Don't forget to adjust #ft_min_word_len to index tags of 3 characters or less for this to work)
You can combine both designs: store both the map table and the denormalized column. This will require more maintenance, though.
You can also store the normalized design in your database and use the query you provided to feed the tags to Sphinx or Lucene.
This way, you can do history digging with MySQL, fulltext tag searches using Sphinx, and no extra maintenance will be required.
If you use the VARCHAR hack, it will be nearly impossible for you to query the data. It will be hell to write a query which accurately and efficiently shows all posts with a given tag (and let's face it, that's a pretty big aspect of a tagging system): The accuracy part is hard because you need to consider all possibilities for the comma; the effeciency part is hard because searching in a string is much, much slower than looking at the full value of a field (moreso if you could use an integer).
So yes, it is most certainly worth it.
As far as making your query faster is concerned - make sure you have the relevant indexes on your tables. Run an EXPLAIN on the query to see where any bottleneck is placed. I don't think it would be better to fetch the tags for each post as you process it, but it might be - I'm not sure how efficient MySQL really is at string manipulation, which is what it's doing when you do the GROUP_CONCAT.
Your query of a tag would be very slow if you had a varchar with a list of tags in. You would be doing something along the lines of where post.tag like '%mytag%' which would not perform anywhere near as well as searching on an indexed key.
[edit]
This study shows performance of various ways of doing tagging systems (including FULLTEXT indexed) and suggests where and when you would like to use each one.
Joining (when you have correct indexes) is generally much faster than trying to pull data out of the middle of a comma delimited string in a field even using full text search. Or you could go with a bunch of separate tag fields (Tag1, tag2, tag3) and querying will still be harder (let me search 5 fields to find if I have used that tag) and you would need to add a new column every time you need to add a new tag and you've used up the existing columns. The normalized database design is the best possible, most performant way to go. Databases are designed to use joins. Why you wouldn't want to use them is beyond me.