Consider two queries:
SELECT Log.Key, Time, Filter.Name, Text, Blob
FROM Log
JOIN Filter ON FilterKey = Filter.Key
WHERE FilterKey IN (1)
ORDER BY Log.Key
LIMIT #limit
OFFSET #offset
and
SELECT Log.Key, Time, Filter.Name, Text, Blob
FROM Log
JOIN Filter ON FilterKey = Filter.Key
WHERE FilterKey IN (1,2)
ORDER BY Log.Key
LIMIT #limit
OFFSET #offset
Difference is IN(1) vs IN(1,2). Problem: second query is ~50 times slower (on 3 Gb database it's 0.2s vs 13.0s)!
I know what WHERE FilterKey IN (1,2) is equal to WHERE FilterKey = 1 OR FilterKey = 2. It seems what only single filter works well with index. Why?
How to increase performance of second query (to use multiple conditions)?
Structure:
CREATE TABLE Filter (Key INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT)
CREATE TABLE Log (Key INTEGER PRIMARY KEY AUTOINCREMENT, Time DATETIME, FilterKey INTEGER, Text TEXT, Blob BLOB)
CREATE INDEX FilterKeyIndex on Log(FilterKey)
The FilterKeyIndex stores not only the FilterKey values but also the rowid of the actual table to be able to find the corresponding row. The index is sorted over both columns.
In the first query, when reading all index entries whose FilterKey is one, in order, the rowid values also are in order. That rowid is the same as Log.Key, so it is not necessary to do any further sorting.
In the second query, the Log.Key values come from two index runs, so there is no guarantee that they are sorted, so the database has to sort all results rows before it can return the first one.
To speed up the second query, you would have to read all the Log rows in the order of the Key column, i.e., scan the table without looking up any Log rows in the index. Either drop FilterKeyIndex, or use ... FROM Log NOT INDEXED JOIN ....
Related
I have a table account_config where I keep key-value configs for accounts with columns:
id - pk
account_id - fk
key
value
Table may have configs for thousands of accounts, but for each account it may have 10-20 configs max. I am using query:
select id, key, value from account_config t where t.account_id = ? and t.key = ?;
I already have index for account_id field, do I need another index for key field here? Will second filter (key = ?) apply to already filtered result set (account_id = ?) or it scans whole table?
Indexes are used when only a small percentage of the table's rows get accessed and the index helps finding those rows quickly.
You say there are thousands of accounts in your table, each with 10 to 20 rows.
Let's say there are 3000 accounts and 45,000 rows in your table, then accessing data via an index on the account ID means with the index we access about 0,03 % of the rows to find the one row in question. That makes it extremely likely that the index will be used.
Of course, if there were an index on (account_id, key), that index would be preferred, as we would only have to read one row from the table which the index points to.
So, yes, your index should suffice for the query shown, but if you want to get this faster, then provide the two-column index.
Let's say we log events in a Sqlite database with Unix timestamp column ts:
CREATE TABLE data(ts INTEGER, text TEXT); -- more columns in reality
and that we want fast lookup for datetime ranges, for example:
SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;
Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).
Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:
ts rowid
1608000001 1
1608000002 2
1608000012 3
1608000077 4
which is a waste of DB space (and CPU when a query has to look in the index first).
To avoid this:
(1) we could use ts as INTEGER PRIMARY KEY, so ts would be the rowid itself. But this fails because ts is not unique: 2 events can happen at the same second (or even at the same millisecond).
See for example the info given in SQLite Autoincrement.
(2) we could use rowid as timestamp ts concatenated with an increasing number. Example:
16087186540001
16087186540002
[--------][--]
ts increasing number
Then rowid is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999.
But is there an easy way to ask Sqlite to INSERT an item with a rowid greater than or equal to a given value? Let's say the current timestamp is 1608718654 and two events appear:
CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540001
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540002
More generally, how to create time-series optimally with Sqlite, to have fast queries WHERE timestamp BETWEEN a AND b?
First solution
The method (2) detailed in the question seems to work well. In a benchmark, I obtained:
naive method, without index: 18 MB database, 86 ms query time
naive method, with index: 32 MB database, 12 ms query time
method (2): 18 MB database, 12 ms query time
The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.
We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:
MAX(?, (SELECT seq FROM sqlite_sequence) + 1)
Code:
import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probability 1%
t += 1
db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Using a WITHOUT ROWID table
Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.
db.executescript("""
CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);
CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
""")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
t += 1
db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Roughly-sorted UUID
More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:
ULID (Universally Unique Lexicographically Sortable Identifier)
Snowflake
MongoDB ObjectId
All these methods use an ID which is:
[---- timestamp ----][---- random and/or incremental ----]
I am not expert in SqlLite, but have worked with databases and time series. I have hade similar situation previously, and I would share my conceptual solution.
You have some how part of the answer in your question, but not the way of doing it.
The way I did it, creating 2 tables, one table (main_logs) will log time in seconds incrementation as date as integer as primary key and the other table logs contain all logs (main_sub_logs) that made in that particular time that in your case can be up to 10000 logs per second in it. The main_sub_logs has reference to main_logs and it contain for each log second and X number of logs belong to that second with own counter id, that starts over again.
This way you limit your time series look up to seconds of event windows instead of all logs in one place.
This way you can join those two tables and when you look up from in first table between 2 specific time you get all logs in between.
So what here is how I created my 2 tables:
CREATE TABLE IF NOT EXISTS main_logs (
id INTEGER PRIMARY KEY
);
CREATE TABLE IF NOT EXISTS main_sub_logs (
id INTEGER,
ref INTEGER,
log_counter INTEGER,
log_text text,
PRIMARY KEY (id),
FOREIGN KEY (ref) REFERENCES main_logs(id)
)
I have inserted some dummy data:
Now lets query all logs between 1608718655 and 1608718656
SELECT * FROM main_logs AS A
JOIN main_sub_logs AS B ON A.id == B.Ref
WHERE A.id >= 1608718655 AND A.id <= 1608718656
Will get this result:
I'm using Postgresql (cockroachdb) and I want to select a specific row. For example, there are thousands of records and I want to select row number 999.
In this case we would use LIMIT and OFFSET, SELECT * FROM table LIMIT 1 OFFSET 998;
However, using LIMIT and OFFSET can cause performance issue according to this post. So I'm wondering if there a way to get specific row without a full table scan.
I feel like it is possible because the database seems to sort data by primary key, that when I do SELECT * FROM table; it always show a sorted result. Since it is sorted by primary key, database can use index to access a specific row, right?
If you select rows based on the primary key (e.g. SELECT * FROM table WHERE <primary key> = <value>), no scans will be needed underneath the hood. The same is also true if you define a secondary index on the table and apply a WHERE clause that filters based on the column(s) in the secondary index.
a query from a large table like:
select something from sometable limit somecount;
I know the limit statement is usefull to avoid get too much rows from a query.
But how about using it when not much rows got but in a large table?
for example, there is a table create like this
CREATE TABLE if not exists users (
id integer primary key autoincrement,
name varchar(80) unique not null,
password varchar(20) not null,
role integer default 1, -- 0 -> supper admin; 1 -> user
banned integer default 0
);
case 1: i want to get users where id=100. Here id is primary key,
surely it can get 1 row at most.which is faster between 2 statements below?
select * from users where id=100;
select * from users where id=100 limit 1;
case 2: i want to get users where name='jhon'. Here name is unique,
also it can get 1 row at most.which is faster between 2 statements below?
select * from users where name='jhon';
select * from users where name='jhon' limit 1;
case 3: i want to get users where role=0. Here role is neither primary key
nor unique, but i know there are only 10 rows at most. which is faster between 2 statements below?
select * from users where role=0;
select * from users where role=0 limit 10;
If you care about performance, then add indexes to handle all three queries. This requires an additional index on: users(role). The id column already has an index as the primary key; name has an index because it is declared unique.
For the first two cases, the limit shouldn't make a difference. Without limit, the engine finds the matching row in the index and returns it. If the engine doesn't use the "unique" information, then it might need to peek at the next value in the index, just to see if it is the same.
The third case, with no index, is a bit different. Without an index, the engine will want to scan all the rows to find all matches. With an index, it can find all the matching rows there. Add a limit to that, and it will just stop at the first one.
The appropriate indexes will be a bigger boost to performance than using limit, on average.
I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?