SQL index column always value from 1 to N - sql

I think my question is very simple but every search in the web shows me results about SQL indexing.
I use the following SQL query to create a simple table:
CREATE TABLE SpeechOutputList
(
ID int NOT NULL IDENTITY(1,1),
SpeechConfigCode nvarchar(36) NOT NULL,
OutputSentence nvarchar(500),
IsPrimaryOutput bit DEFAULT 0,
PRIMARY KEY(ID),
FOREIGN KEY(SpeechConfigCode)
REFERENCES SpeechConfig
ON UPDATE CASCADE ON DELETE CASCADE
);
I would like to add an index column that increases automatically (not identity(1,1)) which always has values from 1 to N (according to the number of rows).
identity(1,1) will not do since there are many cases there are no continues numbers from 1 to N because it's intended for primary key.
Thanks

Trying to keep such an index field sequential, and without gaps, will not be efficient. If for instance a record is removed, you would need to have a trigger that renumbers the records that follow. This will take not only extra time, it will also reduce concurrency.
Furthermore, that index will not be a stable key for a record. If a client would get the index value of a record, and then later would try to locate it again by that index, it might well get a different record as a result.
If you still believe such an index is useful, I would suggest to create a view that will add this index on-the-fly:
CREATE VIEW SpeechOutputListEx AS
SELECT ID, SpeechConfigCode, OutputSentence, IsPrimaryOutput,
ROW_NUMBER() OVER (ORDER BY ID ASC) AS idx
FROM SpeechOutputList
This will make it possible to do selections, like:
SELECT * FROM SpeechOutputListEx WHERE idx = 5
To make an update, with a condition on the index, you would take the join with the view:
UPDATE s
SET OutputSentence = 'sentence'
FROM SpeechOutputList s
INNER JOIN SpeechOutputListEx se
ON s.ID = se.ID
WHERE idx = 5
The issue of primary:
You explained in comments that the order should indicate whether a sentence is primary.
For that purpose you don't need the view. You could add a column idx, that would allow gaps. Then just let the user determine the value of the idx column. Even if negative, that would not be an issue. You would select in order of idx value and so get the primary sentence first.
If a sentence would have to be made primary, you could issue this update:
update SpeechOutputList
set idx = (select min(idx) - 1 from SpeechOutputList)
where id = 123

Related

How to use time-series with Sqlite, with fast time-range queries?

Let's say we log events in a Sqlite database with Unix timestamp column ts:
CREATE TABLE data(ts INTEGER, text TEXT); -- more columns in reality
and that we want fast lookup for datetime ranges, for example:
SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;
Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).
Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:
ts rowid
1608000001 1
1608000002 2
1608000012 3
1608000077 4
which is a waste of DB space (and CPU when a query has to look in the index first).
To avoid this:
(1) we could use ts as INTEGER PRIMARY KEY, so ts would be the rowid itself. But this fails because ts is not unique: 2 events can happen at the same second (or even at the same millisecond).
See for example the info given in SQLite Autoincrement.
(2) we could use rowid as timestamp ts concatenated with an increasing number. Example:
16087186540001
16087186540002
[--------][--]
ts increasing number
Then rowid is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999.
But is there an easy way to ask Sqlite to INSERT an item with a rowid greater than or equal to a given value? Let's say the current timestamp is 1608718654 and two events appear:
CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540001
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540002
More generally, how to create time-series optimally with Sqlite, to have fast queries WHERE timestamp BETWEEN a AND b?
First solution
The method (2) detailed in the question seems to work well. In a benchmark, I obtained:
naive method, without index: 18 MB database, 86 ms query time
naive method, with index: 32 MB database, 12 ms query time
method (2): 18 MB database, 12 ms query time
The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.
We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:
MAX(?, (SELECT seq FROM sqlite_sequence) + 1)
Code:
import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probability 1%
t += 1
db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Using a WITHOUT ROWID table
Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.
db.executescript("""
CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);
CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
""")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
t += 1
db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Roughly-sorted UUID
More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:
ULID (Universally Unique Lexicographically Sortable Identifier)
Snowflake
MongoDB ObjectId
All these methods use an ID which is:
[---- timestamp ----][---- random and/or incremental ----]
I am not expert in SqlLite, but have worked with databases and time series. I have hade similar situation previously, and I would share my conceptual solution.
You have some how part of the answer in your question, but not the way of doing it.
The way I did it, creating 2 tables, one table (main_logs) will log time in seconds incrementation as date as integer as primary key and the other table logs contain all logs (main_sub_logs) that made in that particular time that in your case can be up to 10000 logs per second in it. The main_sub_logs has reference to main_logs and it contain for each log second and X number of logs belong to that second with own counter id, that starts over again.
This way you limit your time series look up to seconds of event windows instead of all logs in one place.
This way you can join those two tables and when you look up from in first table between 2 specific time you get all logs in between.
So what here is how I created my 2 tables:
CREATE TABLE IF NOT EXISTS main_logs (
id INTEGER PRIMARY KEY
);
CREATE TABLE IF NOT EXISTS main_sub_logs (
id INTEGER,
ref INTEGER,
log_counter INTEGER,
log_text text,
PRIMARY KEY (id),
FOREIGN KEY (ref) REFERENCES main_logs(id)
)
I have inserted some dummy data:
Now lets query all logs between 1608718655 and 1608718656
SELECT * FROM main_logs AS A
JOIN main_sub_logs AS B ON A.id == B.Ref
WHERE A.id >= 1608718655 AND A.id <= 1608718656
Will get this result:

Is it faster to use limit statement with known max count?

a query from a large table like:
select something from sometable limit somecount;
I know the limit statement is usefull to avoid get too much rows from a query.
But how about using it when not much rows got but in a large table?
for example, there is a table create like this
CREATE TABLE if not exists users (
id integer primary key autoincrement,
name varchar(80) unique not null,
password varchar(20) not null,
role integer default 1, -- 0 -> supper admin; 1 -> user
banned integer default 0
);
case 1: i want to get users where id=100. Here id is primary key,
surely it can get 1 row at most.which is faster between 2 statements below?
select * from users where id=100;
select * from users where id=100 limit 1;
case 2: i want to get users where name='jhon'. Here name is unique,
also it can get 1 row at most.which is faster between 2 statements below?
select * from users where name='jhon';
select * from users where name='jhon' limit 1;
case 3: i want to get users where role=0. Here role is neither primary key
nor unique, but i know there are only 10 rows at most. which is faster between 2 statements below?
select * from users where role=0;
select * from users where role=0 limit 10;
If you care about performance, then add indexes to handle all three queries. This requires an additional index on: users(role). The id column already has an index as the primary key; name has an index because it is declared unique.
For the first two cases, the limit shouldn't make a difference. Without limit, the engine finds the matching row in the index and returns it. If the engine doesn't use the "unique" information, then it might need to peek at the next value in the index, just to see if it is the same.
The third case, with no index, is a bit different. Without an index, the engine will want to scan all the rows to find all matches. With an index, it can find all the matching rows there. Add a limit to that, and it will just stop at the first one.
The appropriate indexes will be a bigger boost to performance than using limit, on average.

What key columns to use on filtered index with covering WHERE clause?

I'm creating a filtered index such that the WHERE filter includes the complete query criteria. WIth such an index, it seems that a key column would be unnecessary, though SQL requires me to add one. For example, consider the table:
CREATE TABLE Invoice
(
Id INT NOT NULL IDENTITY PRIMARY KEY,
Data VARCHAR(MAX) NOT NULL,
IsProcessed BIT NOT NULL DEFAULT 0,
IsInvalidated BIT NOT NULL DEFAULT 0
)
Queries on the table look for new invoices to process, i.e.:
SELECT *
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
So, I can tune for these queries with a filtered index:
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (IsProcessed)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
GO
My question: What should the key column(s) for IX_Invoice_IsProcessed_IsInvalidated be? Presumably the key column isn't being used. My intuition leads me to pick a column that is small and will keep the index structure relatively flat. Should I pick the table primary key (Id)? One of the filter columns, or both of them?
Because you have a clustered index on that table it doesn't really matter what you put in the key columns of that index; meaning Id is there free of charge. The only thing you can do is include everything in the included section of the index to actually have data handy at the leaf level of the index to exclude key lookups to the table. Or, if the queue is huge, then, perhaps, some other column would be useful in the key section.
Now, if that table didn't have a primary key then you would have to include or specify as key columns all the columns that you need for joining or other purposes. Otherwise, RID lookups on heap would occur because on the leaf level of indexes you would have references to data pages.
What percentage of the table does this filtered index cover? If it's small, you may want to cover the entire table to handle the "SELECT *" from the index without hitting the table. If it's a large portion of the table though this would not be optimal. Then I'd recommend using the clustered index or primary key. I'd have to research more because I forget which is optimal right now but if they're the same you should be set.
I suggest you declare it as follows
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (Id)
INCLUDE (Data)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
The INCLUDE clause will mean that the Values of the Data column will be stored as part of the index.
If you didn't have an INCLUDE clause then the query plan for
SELECT Id, Data
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
would involve a two step process
use the index to find the list of primary key values that match the
criteria
get the data from the table that match those primary keys
If, on the other hand, the index includes the [Data] column then it will properly cover the query as there will be no need to look up the data using the primary keys
You don't get something for nothing though
The downside to this is that you will be storing the varchar(MAX) data twice for these records so there will need to be more data written to the database and more storage will be used although this isn't so much of a problem if you're only talking about a small section of the data.
As always the more time and effort you put into putting things away carefully the faster and easier it is to get them back.

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

Equivalent of a composite index across multiple tables?

I have a table structure similar the following:
create table MAIL (
ID int,
FROM varchar,
SENT_DATE date
);
create table MAIL_TO (
ID int,
MAIL_ID int,
NAME varchar
);
and I need to run the following query:
select m.ID
from MAIL m
inner join MAIL_TO t on t.MAIL_ID = m.ID
where m.SENT_DATE between '07/01/2010' and '07/30/2010'
and t.NAME = 'someone#example.com'
Is there any way to design indexes such that both of the conditions can use an index? If I put an index on MAIL.SENT_DATE and an index on MAIL_TO.NAME, the database will choose to use either one of the indexes or the other, not both. After filtering by the first condition the database always has to do a full scan of the results for the second condition.
Oracle can use both indices. You just don't have the right two indices.
Consider: if the query plan uses your index on mail.sent_date first, what does it get from mail? It gets all the mail.ids where mail.sent_date is within the range you gave in your where clause, yes?
So it goes to mail_to with a list of mail.ids and the mail.name you gave in your where clause. At this point, Oracle decides that it's better to scan the table for matching mail_to.mail_ids rather than use the index on mail_to.name.
Indices on varchars are always problematic, and Oracle really prefers full table scans. But if we give Oracle an index containing the columns it really wants to use, and depending on total table rows and statistics, we can get it to use it. This is the index:
create index mail_to_pid_name on mail_to( mail_id, name ) ;
This works where an index just on name doesn't, because Oracle's not looking just for a name, but for a mail_id and a name.
Conversely, if the cost-based analyzer determines it's cheaper to go to table mail_to first, and uses your index on mail_to.name, what doe sit get? A bunch of mail_to_.mail_ids to look up in mail. It needs to find rows with those ids and certain sent_dates, so:
create index mail_id_sentdate on mail( sent_date, id ) ;
Note that in this case I've put sent_date first in the index, and id second. (This is more an intuitive thing.)
Again, the take home point is this: in creating indices, you have to consider not just the columns in your where clause, but also the columns in your join conditions.
Update
jthg: yes, it always depends on how the data is distributed. And on how many rows are in the table: if very many, Oracle will do a table scan and hash join, if very few it will do a table scan. You might reverse the order of either of the two indices. By putting sent_date first in the second index, we eliminate most needs for an index solely on sent_date.
A materialized view would allow you to index the values, assuming the stringent materialized view criteria is met.
Which criterion is more selective? The date range or the addressee? I would guess the addressee. And if that is highly selective, do not care for the date index, just let the database do the search based on the found mail ids. But index table MAIL on the id if it is not already.
On the other hand, some modern optimizers would even make use of both indexes, scanning both tables and than build a hash value of the join columns to merge the results of both. I am not absolutely sure if and when Oracle would chose this strategy. I just realized that SQL Server tends to make hash joins rather often, compared to other engines.
In situations where the requirements aren't met for a materialized view, there are these two options:
1) You can create a cross reference table, and keep this updated with triggers.
The concepts would be the same with Oracle, but i only have SQL Server installed at the moment to run the test, see this setup:
create table MAIL (
ID INT IDENTITY(1,1),
[FROM] VARCHAR(200),
SENT_DATE DATE,
CONSTRAINT PK_MAIL PRIMARY KEY (ID)
);
create table MAIL_TO (
ID INT IDENTITY(1,1),
MAIL_ID INT,
[NAME] VARCHAR (200),
CONSTRAINT PK_MAIL_TO PRIMARY KEY (ID)
);
ALTER TABLE [dbo].[MAIL_TO] WITH CHECK ADD CONSTRAINT [FK_MAILTO_MAIL] FOREIGN KEY([MAIL_ID])
REFERENCES [dbo].[MAIL] ([ID])
GO
ALTER TABLE [dbo].[MAIL_TO] CHECK CONSTRAINT [FK_MAILTO_MAIL]
GO
CREATE TABLE CompositeIndex_MailSentDate_MailToName (
[MAIL_ID] INT,
[MAILTO_ID] INT,
SENT_DATE DATE,
MAILTO_NAME VARCHAR(200),
CONSTRAINT PK_CompositeIndex_MailSentDate_MailToName PRIMARY KEY (MAILTO_ID,MAIL_ID)
)
GO
CREATE NONCLUSTERED INDEX IX_MailSent_MailTo ON dbo.CompositeIndex_MailSentDate_MailToName (SENT_DATE,MAILTO_NAME)
CREATE NONCLUSTERED INDEX IX_MailTo_MailSent ON dbo.CompositeIndex_MailSentDate_MailToName (MAILTO_NAME,SENT_DATE)
GO
CREATE TRIGGER dbo.trg_MAILTO_Insert
ON dbo.MAIL_TO
AFTER INSERT AS
BEGIN
INSERT INTO dbo.CompositeIndex_MailSentDate_MailToName ( MAIL_ID, MAILTO_ID, SENT_DATE, MAILTO_NAME )
SELECT mailTo.MAIL_ID,mailTo.ID,m.SENT_DATE,mailTo.NAME
FROM
inserted mailTo
INNER JOIN dbo.MAIL m ON m.ID = mailTo.MAIL_ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Delete
ON dbo.MAIL_TO
AFTER DELETE AS
BEGIN
DELETE mailToDelete
FROM
dbo.MAIL_TO mailToDelete
INNER JOIN deleted ON mailToDelete.ID = deleted.ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Update
ON dbo.MAIL_TO
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.MAILTO_NAME = updates.NAME
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAILTO_ID
END
GO
CREATE TRIGGER dbo.trg_MAIL_Update
ON dbo.MAIL
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.SENT_DATE = updates.SENT_DATE
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAIL_ID
END
GO
INSERT INTO dbo.MAIL ( [FROM], SENT_DATE )
SELECT 'SenderA','2018-10-01'
UNION ALL SELECT 'SenderA','2018-10-02'
INSERT INTO dbo.MAIL_TO ( MAIL_ID, NAME )
SELECT 1,'CustomerA'
UNION ALL SELECT 1,'CustomerB'
UNION ALL SELECT 2,'CustomerC'
UNION ALL SELECT 2,'CustomerD'
UNION ALL SELECT 2,'CustomerE'
SELECT * FROM dbo.MAIL
SELECT * FROM dbo.MAIL_TO
SELECT * FROM dbo.CompositeIndex_MailSentDate_MailToName
You can then use the dbo.CompositeIndex_MailSentDate_MailToName table to JOIN to the rest of your data. This is useful in environments where your rate of inserts and updates are low, but your query needs are high. So the relative overhead of implementing the triggers is small.
This has the advantage of being updated transactionally, in real time.
2) If you don't want the performance/management overhead of a trigger, and you only need this for next day reporting, you can create a view, and a nightly process which truncates the table and selects the entire view into a materialized table.
I've used this successfully to index flattened relational data requiring joins across a dozen or so tables.. reducing report times from hours to seconds. While it's an expensive query, you can set the job to run off hours if you have periods of reduced usage.
If your queries are generally for a particular month, then you could partition the data by month.