How to use time-series with Sqlite, with fast time-range queries? - sql

Let's say we log events in a Sqlite database with Unix timestamp column ts:
CREATE TABLE data(ts INTEGER, text TEXT); -- more columns in reality
and that we want fast lookup for datetime ranges, for example:
SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;
Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).
Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:
ts rowid
1608000001 1
1608000002 2
1608000012 3
1608000077 4
which is a waste of DB space (and CPU when a query has to look in the index first).
To avoid this:
(1) we could use ts as INTEGER PRIMARY KEY, so ts would be the rowid itself. But this fails because ts is not unique: 2 events can happen at the same second (or even at the same millisecond).
See for example the info given in SQLite Autoincrement.
(2) we could use rowid as timestamp ts concatenated with an increasing number. Example:
16087186540001
16087186540002
[--------][--]
ts increasing number
Then rowid is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999.
But is there an easy way to ask Sqlite to INSERT an item with a rowid greater than or equal to a given value? Let's say the current timestamp is 1608718654 and two events appear:
CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540001
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540002
More generally, how to create time-series optimally with Sqlite, to have fast queries WHERE timestamp BETWEEN a AND b?

First solution
The method (2) detailed in the question seems to work well. In a benchmark, I obtained:
naive method, without index: 18 MB database, 86 ms query time
naive method, with index: 32 MB database, 12 ms query time
method (2): 18 MB database, 12 ms query time
The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.
We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:
MAX(?, (SELECT seq FROM sqlite_sequence) + 1)
Code:
import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probability 1%
t += 1
db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Using a WITHOUT ROWID table
Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.
db.executescript("""
CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);
CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
""")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
t += 1
db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
Roughly-sorted UUID
More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:
ULID (Universally Unique Lexicographically Sortable Identifier)
Snowflake
MongoDB ObjectId
All these methods use an ID which is:
[---- timestamp ----][---- random and/or incremental ----]

I am not expert in SqlLite, but have worked with databases and time series. I have hade similar situation previously, and I would share my conceptual solution.
You have some how part of the answer in your question, but not the way of doing it.
The way I did it, creating 2 tables, one table (main_logs) will log time in seconds incrementation as date as integer as primary key and the other table logs contain all logs (main_sub_logs) that made in that particular time that in your case can be up to 10000 logs per second in it. The main_sub_logs has reference to main_logs and it contain for each log second and X number of logs belong to that second with own counter id, that starts over again.
This way you limit your time series look up to seconds of event windows instead of all logs in one place.
This way you can join those two tables and when you look up from in first table between 2 specific time you get all logs in between.
So what here is how I created my 2 tables:
CREATE TABLE IF NOT EXISTS main_logs (
id INTEGER PRIMARY KEY
);
CREATE TABLE IF NOT EXISTS main_sub_logs (
id INTEGER,
ref INTEGER,
log_counter INTEGER,
log_text text,
PRIMARY KEY (id),
FOREIGN KEY (ref) REFERENCES main_logs(id)
)
I have inserted some dummy data:
Now lets query all logs between 1608718655 and 1608718656
SELECT * FROM main_logs AS A
JOIN main_sub_logs AS B ON A.id == B.Ref
WHERE A.id >= 1608718655 AND A.id <= 1608718656
Will get this result:

Related

Why using MAX function in query cause postgresql performance issue?

I have a table with three columns time_stamp, device_id and status s.t status type is json. Also time_stamp and device_id columns have index. I need to grab latest value of status with id 1.3.6.1.4.1.34094.1.1.1.1.1 which is not null.
You can find query execution time of following command With and Without using MAX bellow.
Query with MAX:
SELECT DISTINCT MAX(time_stamp) FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
Query without MAX:
SELECT DISTINCT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
First query takes about 3sec and second one takes just 3msec with two different plans. I think both queries should have same query plan, Why it does not use index in when it wants to calculate MAX? How can improve running time of first query?
PS I use postgres 9.6(dockerized version).
Also this is table definition.
-- Table: device.status_events
-- DROP TABLE device.status_events;
CREATE TABLE device.status_events
(
time_stamp timestamp with time zone NOT NULL,
device_id bigint,
status jsonb,
is_active boolean DEFAULT true,
CONSTRAINT status_events_device_id_fkey FOREIGN KEY (device_id)
REFERENCES device.devices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE device.status_events
OWNER TO monitoring;
-- Index: device.status_events__time_stamp
-- DROP INDEX device.status_events__time_stamp;
CREATE INDEX status_events__time_stamp
ON device.status_events
USING btree
(time_stamp);
The index you show us cannot produce the first plan you show us. With that index, the plan would have to be applying a filter for the jsonb column, which it isn't. So the index must be a partial index, with the filter being applied at the index level so that it is not needed in the plan.
PostgreSQL is using an index for the max query, it just isn't the index you want it to.
All of your devide_id=7 must have low timestamps, but PostgreSQL doesn't know this. It thinks that by walking down the timestamps index, it will quickly find a device_id=7 and then be done. But instead it needs to walk a large chunk of the index before finding such a row.
You can force it away from the "wrong" index by changing the aggregate expression to something like:
MAX(time_stamp + interval '0')
Or you could instead build a more tailored index, which the planner will choose instead of the falsely attractive one:
create index on device.status_events (device_id , time_stamp)
where status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}';
I believe this should generate a better plan
SELECT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}')
ORDER BY timestamp DESC
LIMIT 1
Let me know how that works for you.

Multiple filters on indexed field

Consider two queries:
SELECT Log.Key, Time, Filter.Name, Text, Blob
FROM Log
JOIN Filter ON FilterKey = Filter.Key
WHERE FilterKey IN (1)
ORDER BY Log.Key
LIMIT #limit
OFFSET #offset
and
SELECT Log.Key, Time, Filter.Name, Text, Blob
FROM Log
JOIN Filter ON FilterKey = Filter.Key
WHERE FilterKey IN (1,2)
ORDER BY Log.Key
LIMIT #limit
OFFSET #offset
Difference is IN(1) vs IN(1,2). Problem: second query is ~50 times slower (on 3 Gb database it's 0.2s vs 13.0s)!
I know what WHERE FilterKey IN (1,2) is equal to WHERE FilterKey = 1 OR FilterKey = 2. It seems what only single filter works well with index. Why?
How to increase performance of second query (to use multiple conditions)?
Structure:
CREATE TABLE Filter (Key INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT)
CREATE TABLE Log (Key INTEGER PRIMARY KEY AUTOINCREMENT, Time DATETIME, FilterKey INTEGER, Text TEXT, Blob BLOB)
CREATE INDEX FilterKeyIndex on Log(FilterKey)
The FilterKeyIndex stores not only the FilterKey values but also the rowid of the actual table to be able to find the corresponding row. The index is sorted over both columns.
In the first query, when reading all index entries whose FilterKey is one, in order, the rowid values also are in order. That rowid is the same as Log.Key, so it is not necessary to do any further sorting.
In the second query, the Log.Key values come from two index runs, so there is no guarantee that they are sorted, so the database has to sort all results rows before it can return the first one.
To speed up the second query, you would have to read all the Log rows in the order of the Key column, i.e., scan the table without looking up any Log rows in the index. Either drop FilterKeyIndex, or use ... FROM Log NOT INDEXED JOIN ....

SQL index column always value from 1 to N

I think my question is very simple but every search in the web shows me results about SQL indexing.
I use the following SQL query to create a simple table:
CREATE TABLE SpeechOutputList
(
ID int NOT NULL IDENTITY(1,1),
SpeechConfigCode nvarchar(36) NOT NULL,
OutputSentence nvarchar(500),
IsPrimaryOutput bit DEFAULT 0,
PRIMARY KEY(ID),
FOREIGN KEY(SpeechConfigCode)
REFERENCES SpeechConfig
ON UPDATE CASCADE ON DELETE CASCADE
);
I would like to add an index column that increases automatically (not identity(1,1)) which always has values from 1 to N (according to the number of rows).
identity(1,1) will not do since there are many cases there are no continues numbers from 1 to N because it's intended for primary key.
Thanks
Trying to keep such an index field sequential, and without gaps, will not be efficient. If for instance a record is removed, you would need to have a trigger that renumbers the records that follow. This will take not only extra time, it will also reduce concurrency.
Furthermore, that index will not be a stable key for a record. If a client would get the index value of a record, and then later would try to locate it again by that index, it might well get a different record as a result.
If you still believe such an index is useful, I would suggest to create a view that will add this index on-the-fly:
CREATE VIEW SpeechOutputListEx AS
SELECT ID, SpeechConfigCode, OutputSentence, IsPrimaryOutput,
ROW_NUMBER() OVER (ORDER BY ID ASC) AS idx
FROM SpeechOutputList
This will make it possible to do selections, like:
SELECT * FROM SpeechOutputListEx WHERE idx = 5
To make an update, with a condition on the index, you would take the join with the view:
UPDATE s
SET OutputSentence = 'sentence'
FROM SpeechOutputList s
INNER JOIN SpeechOutputListEx se
ON s.ID = se.ID
WHERE idx = 5
The issue of primary:
You explained in comments that the order should indicate whether a sentence is primary.
For that purpose you don't need the view. You could add a column idx, that would allow gaps. Then just let the user determine the value of the idx column. Even if negative, that would not be an issue. You would select in order of idx value and so get the primary sentence first.
If a sentence would have to be made primary, you could issue this update:
update SpeechOutputList
set idx = (select min(idx) - 1 from SpeechOutputList)
where id = 123

fast search in a 10 million records table with unique index column of SQL server 2008 R2 on win 7

I need to do a fast search in a column with floating point numbers in a table of SQL server 2008 R2 on Win 7.
the table has 10 million records.
e.g.
Id value
532 937598.32421
873 501223.3452
741 9797327.231
ID is primary key, I need o do a search on "value" column for a given value such that I can find the 5 closest points to the given point in the table.
The closeness is defined as the absolute value of the difference between the given value and column value.
The smaller value, the closer.
I would like to use binary search.
I want to set an unique index on the value column.
But, I am not sure whether the table will be sorted every time when I search the given value in the column ?
Or, it only sorts the table one time because I have set the value column as unique index ?
Are there better ways to do this search ?
A sorting will have to be done whenever I do a search ? I need to do a lot of times of search in the table. I know the sorting time is O(n lg n). Using index can really have done the sort for me ? or the index is associated with a sorted tree to hold the column values ?
When an index is set up, the values have been sorted ? I do not need to sort it every time when I do a search ?
Any help would be appreciated.
thanks
Sorry for my initial response, no, I would not even create an index, it won't be able to use it because you're searching not on a given value but the difference between that given value and the value column on the table. You could create a function based index, but you would have to specify the # you're searching on, which is not constant.
Given that, I would look at getting enough RAM to swallow the whole table. Ie. if the table is 10gb, try to get 10gb ram allocated for caching. And if possible do it on a machine w/ an SSD, or get an SSD.
The sql itself is not complicated, it's really just an issue of performance.
select top 5 id, abs(99 - val) as diff
from tbl
order by 2
If you don't mind some trial and error, you could create an index on the value column, and then search as follows -
select top 5 id, abs(99 - val) as diff
from tbl
where val between 99-30 and 99+30
order by 2
The above query WOULD utilize the index on the value column, because it is searching on a range of values in the value column, not the differences between the values in that column and X (2 very different things)
However, there is no guarantee it would return 5 rows, it would only return 5 rows if there actually existed 5 rows within 30 of 99 (69 to 129). If it returned 2, 3, etc. but not 5, you would have to run the query again and expand the range, and keep doing so until you determine your top 5. However, these queries should run quite a bit faster than having no index and firing against the table blind. So you could give it a shot. The index may take a while to create though, so you might want to do that part overnight.
You mention sql server and binary search. SQL server does not work that way, but sql server (or other database) is a good solution for this problem.
Just to concrete, I will assume
create table mytable
(
id int not null
, value float not null
constraint mytable_pk primary key(id)
)
And you need an index on the value field.
Now get ten rows 5 above and 5 below the search value with these 2 selects
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
-- and
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
To combine the 2 unions into 1 result set you need
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
But since you only want the smallest 5 differences, wrap with one more layer as
SELECT TOP 5 * FROM
(
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
)
ORDER BY DIFF ASC
I Have not tested any of this
Creating the table's clustered index upon [value] will cause [value]'s values to be stored on disk in sorted order. The table's primary key (perhaps already defined on [Id]) might already be defined as the table's clustered index. There can only be one clustered index on a table. If a primary key on [Id] is already clustered, the primary key will need to be dropped, the clustered index on [value] will need to be created, and then the primary key on [Id] can be recreated (as a nonclustered primary key). A clustered index upon [value] should improve performance of this specific statement, but you must ultimately test all variety of T-SQL that will reference this table before making the final choice about this table's most useful clustered index column(s).
Because the FLOAT data type is imprecise (subject to your system's FPU and its floating point rounding and truncation errors, while still in accordance with IEEE 754's specifications), it can be a fatal mistake to assume every [value] will be unique, event when the decimal number (being inserted into FLOAT) appears (in decimal) to be unique. Irrational numbers must always be truncated and rounded. In decimal, PI is an example of an irrational value, which can be truncated and rounded to an imprecise value of 3.142. Similarly, the decimal number 0.1 is an irrational number in binary, which means FLOAT will not store decimal 0.1 as a precise binary value.... You might want to consider whether the domain of acceptable values offered by the NUMERIC data type can accommodate [value] (thus gaining more precise answers when compared to a use of FLOAT).
While a NUMERIC data type might require more storage space than FLOAT, the performance of a given query is often controlled by the levels of the (perhaps clustered) index's B-Tree (assuming an index seek can be harnessed by the query, which for your specific need is a safe assumption). A NUMERIC data type with a precision greater than 28 will require 17 bytes of storage per value. The payload of SQL Server's 8KB page is approximately 8000 bytes. Such a NUMERIC data type will thus store approximately 470 values per page. A B-Tree will consume 2^(index_level_pages-1) * 470 rows/page to store the 10,000,000 rows. Dividing both sides by 470 rows/page: 2^(index_level_pages-1) = 10,000,000/470 pages. Simplifying: log(base2)10,000,000/470 = (index_level_pages-1). Solving: ~16 = index_level_pages (albeit this is back of napkin math, I think it close enough). Thus searching for a specific value in a 10,000,000 row table will require ~16*8KB = ~128 KB of reads. If a clustered index is created, the leaf level of a clustered index will contain the other NUMERIC values that are "close" to the one being sought. Since that leaf level page (and the 15 other index pages) are now cached in SQL Server's buffer pool and are "hot", the next search (for values that are "close" to the value being sought) is likely to be constrained by memory access speeds (as opposed to disk access speeds). This is why a clustered index can enhance performance for your desired statement.
If the [value]'s values are not unique (perhaps due to floating point truncation and rounding errors), and if [value] has been defined as the table's clustered index, SQL Server will (under the covers) add a 4-byte "uniqueifier" to each value. A uniqueifier adds overhead (per above math, it is less overhead than might be thought, when a index can be harnessed). That overhead is another (albeit less important) reason to test. If values can instead be stored as NUMERIC and if a use of NUMERIC would more precisely ensure persisted decimal values are indeed unique (just the way they look, in decimal), that 4 byte overhead can be eliminated by also declaring the clustered index as being unique (assuming value uniqueness is a business need). Using similar math, I am certain you will discover the index levels for a FLOAT data type are not all that different from NUMERIC.... An index B-Tree's exponential behavior is "the great leveler" :). Choosing FLOAT because it has smaller storage space than NUMERIC may not be as useful as can initially be thought (even when greatly more storage space for the table, as a whole, is needed).
You should also consider/test whether a Columnstore index would enhance performance and suit your business needs.
This is a common request coming from my clients.
It's better if you transform your float column into two integer columns (one for each part of the floating point number), and put the appropriate index on them for fast searching. For example: 12345.678 will become two columns 12345 and 678.

Mysql improve SELECT speed

I'm currently trying to improve the speed of SELECTS for a MySQL table and would appreciate any suggestions on ways to improve it.
We have over 300 million records in the table and the table has the structure tag, date, value. The primary key is a combined key of tag and date. The table contains information for about 600 unique tags most containing an average of about 400,000 rows but can range from 2000 to over 11 million rows.
The queries run against the table are:
SELECT date,
value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
....and there are very few if any INSERTS.
I have tried partitioning the data by tag into various number of partitions but this seems to have little increase in speed.
take time to read my answer here: (has similar volumes to yours)
500 millions rows, 15 million row range scan in 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
then amend your table engine to innodb as follows:
create table tag_date_value
(
tag_id smallint unsigned not null, -- i prefer ints to chars
tag_date datetime not null, -- can we make this date vs datetime ?
value int unsigned not null default 0, -- or whatever datatype you require
primary key (tag_id, tag_date) -- clustered composite PK
)
engine=innodb;
you might consider the following as the primary key instead:
primary key (tag_id, tag_date, value) -- added value save some I/O
but only if value isnt some LARGE varchar type !
query as before:
select
tag_date,
value
from
tag_date_value
where
tag_id = 1 and
tag_date between 'x' and 'y'
order by
tag_date;
hope this helps :)
EDIT
oh forgot to mention - dont use alter table to change engine type from mysiam to innodb but rather dump the data out into csv files and re-import into a newly created and empty innodb table.
note i'm ordering the data during the export process - clustered indexes are the KEY !
Export
select * into outfile 'tag_dat_value_001.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 1 and 50
order by
tag_id, tag_date;
select * into outfile 'tag_dat_value_002.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 51 and 100
order by
tag_id, tag_date;
-- etc...
Import
import back into the table in correct order !
start transaction;
load data infile 'tag_dat_value_001.dat'
into table tag_date_value
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
(
tag_id,
tag_date,
value
);
commit;
-- etc...
What is the cardinality of the date field (that is, how many different values appear in that field)? If the date BETWEEN 'x' AND 'y' is more limiting than the tag = 'a' part of the WHERE clause, try making your primary key (date, tag) instead of (tag, date), allowing date to be used as an indexed value.
Also, be careful how you specify 'x' and 'y' in your WHERE clause. There are some circumstances in which MySQL will cast each date field to match the non-date implied type of the values you compare to.
I would do two things - first throw some indexes on there around tag and date as suggested above:
alter table table add index (tag, date);
Next break your query into a main query and sub-select in which you are narrowing your results down when you get into your main query:
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y'
AND tag IN ( SELECT tag FROM table WHERE tag = 'a' )
ORDER BY date
Your query is asking for a few things - and with that high # of rows, the look of the data can change what the best approach is.
SELECT date, value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
There are a few things that can slow down this select query.
A very large result set that has to be sorted (order by).
A very large result set. If tag and date are in the index (and let's assume that's as good as it gets) every result row will have to leave the index to lookup the value field. Think of this like needing the first sentence of each chapter of a book. If you only needed to know the chapter names, easy - you can get it from the table of contents, but since you need the first sentence you have to go to the actual chapter. In certain cases, the optimizer may choose just to flip through the entire book (table scan in query plan lingo) to get those first sentences.
Filtering by the wrong where clause first. If the index is in the order tag, date... then tag should (for a majority of your queries) be the more stringent of the two columns. So basically, unless you have more tags than dates (or maybe than dates in a typical date range), then dates should be the first of the two columns in your index.
A couple of recommendations:
Consider if it's possible to truncate some of that data if it's too old to care about most of the time.
Try playing with your current index - i.e. change the order of the items in it.
Do away with your current index and replace it with a covering index (has all 3 fields in it)
Run some EXPLAIN's and make sure it's using your index at all.
Switch to some other data store (mongo db?) or otherwise ensure this monster table is kept as much in memory as possible.
I'd say your only chance to further improve it is a covering index with all three columns (tag, data, value). That avoids the table access.
I don't think that partitioning can help with that.
I would guess that adding an index on (tag, date) would help:
alter table table add index (tag, date);
Please post the result of an explain on this query (EXPLAIN SELECT date, value FROM ......)
I think that the value column is at the bottom of your performance issues. It is not part of the index so we will have table access. Further I think that the ORDER BY is unlikely to impact the performance so severely since it is part of your index and should be ordered.
I will argument my suspicions for the value column by the fact that the partitioning does not really reduce the execution time of the query. May you execute the query without value and further give us some results as well as the EXPLAIN? Do you really need it for each row and what kind of column is it?
Cheers!
Try inserting just the needed dates into a temporary table and the finishing with a select on the temporary table for the tags and ordering.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y' ;
ALTER TABLE foo ADD INDEX index( tag );
SELECT date, value
FROM foo
WHERE tag = "a"
ORDER BY date;
if that doesn't work try creating foo off the tag selection instead.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE tag = "a";
ALTER TABLE foo ADD INDEX index( date );
SELECT date, value
FROM foo
WHERE date BETWEEN 'x' and 'y'
ORDER BY date;