Big Table Optimisation - sql

I have a 12million row table, so not enormous, but I want to optimize it for reads as much as possible.
for example currently running
SELECT *
FROM hp.historicalposition
WHERE instrumentid = 1167 AND fundid = 'XXX'
ORDER BY date;
returns 4200 rows and is taking about 4 seconds the first time it is run and 1 second the second time.
What indices might help and and are there any other suggestions?
CREATE TABLE hp.historicalposition
(
date date NOT NULL,
fundid character(3) NOT NULL,
instrumentid integer NOT NULL,
quantityt0 double precision,
quantity double precision,
valuation character varying,
fxid character varying,
localt0 double precision,
localt double precision,
CONSTRAINT attrib_fund_fk FOREIGN KEY (fundid)
REFERENCES funds (fundid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT attrib_instr_fk FOREIGN KEY (instrumentid)
REFERENCES instruments (instrumentid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)

Here is your query:
SELECT *
FROM hp.historicalposition
WHERE instrumentid = 1167 AND fundid = 'XXX'
ORDER BY date;
The best index is a composite index:
create index idx_historicalposition_instrumentid_fundid_date) on historicalposition(instrumentid, fundid, date);
This satisfies the where clause and can also be used for the order by.

You definitely need `instrumentid, fundid` index:
create index hp.historicalposition_instrumentid_fundid_idx
on hp.historicalposition(instrumentid,fundid);
You can then organize your table data so it's order on the disk physically matches this index:
cluster hp.historicalposition using hp.historicalposition_instrumentid_fundid_idx;

General ideas, not necessarily all applicable to postgresql (in fact, they come from the Oracle world):
Partition by time (e.g. day/week/whatever seems most applicable)
If there is only one way of accessing the data and the table is of write-once type, then using index organised table could help (a.k.a. clustered index). Also tweak the write settings not to leave any space in the pages written to disk.
Consider using compression - to reduce the number of physical reads needed
Have a database job that regularly updates the statistics

Related

best way to store time series data with an identifier in sqlite3

Let's say there are a number of different sensors, all of which save data in a database as they measure it and each sensors can have more entries. I'm looking for the best way to save this data so that select queries could be done fastest possible later. Something like
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT, measured_value REAL, time_of_measuring REAL)"
could basically work, but I imagine this wouldn't be very fast for selecting. I know about primary keys, but they prevent duplicates, so I can't just put sensor_id as a primary key. I'm basically looking for sqlite equivalent of saving data like this, but in a single table and as one measurement being one row:
data = {"sensor1":[x1,x2,x3], "sensor2":[z1,z2,z3]...}
I imagine something like ˇˇ would work for inserting more than a single value for each sensor, but would that help at all with selecting?
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT NOT NULL, measured_value REAL, time_of_measuring REAL NOT NULL, PRIMARY KEY(sensor_id, time_of_measuring ))"
For this time-series data, the relevant primary (or unique) key is probably (time_of_measuring, sensor_id). This is close to what you suggested at the end of your question, but the columns are in reverse order.
Technically, this prevent a sensor from loging two measures at the same point in time, which seems like a relevant business rule for your data.
When it comes to the speed of queries: it highly depends on the query themselves. Say that you have query like:
select sensor_id, measured_val, time_of_measuring
from data_table
where
sensor_id = ?
and time_of_measuring >= ?
and time_of_measuring < ?
order by sensor_id, time_of_measuring
This query would take advantage of the primary key index, since the columns are the same as those of the where and order by clauses. You could add the measured_val to the index to make the query even more efficient:
create index data_table_idx1
on data_table(sensor_id, time_of_measuring, measured_val);
As another example, consider this where clause:
where time_of_measuring >= ? and time_of_measuring < ?
No predicate on sensor_id, but time_of_measuring is the first column in the index, so the primary key index can be used.
As typical counter-examples, the following where clauses would not benefit the index:
where sensor_id = ? -- need an index where `sensor_id` is first
where sensor_id = ? and measured_val >= ? -- needs an index on "(sensor_id, measured_val)"

Efficient inserts with duplicate checks for large tables in Postgres

I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?
Thanks in advance!
Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.
Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS query can make use of this unique index.
If you have queries with the WHERE clause only filtering on the modem_serial field, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost)
columns.
The order of the columns in the index matters.

Implement a ring buffer

We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.
I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.
Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.
Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
CREATE TABLE Log1 (
Id int NOT NULL
CHECK (Id BETWEEN 0 AND 9)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
Id int NOT NULL
CHECK (Id BETWEEN 10 AND 19)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
Id int NOT NULL
CHECK (Id BETWEEN 20 AND 29)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
Then create a partitioned view
CREATE VIEW LogView AS (
SELECT * FROM Log1
UNION ALL
SELECT * FROM Log2
UNION ALL
SELECT * FROM Log3
)
If you are on SQL2012 you can use a sequence
CREATE SEQUENCE LogSequence AS int
START WITH 0
INCREMENT BY 1
MINVALUE 0
MAXVALUE 29
CYCLE
;
And then start to insert values
INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
,'SomeMessage'
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way
I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

SELECT query slow when flag column is a constraint

I have a fairly simple table called widgets. Each row holds an id, a description, and an is_visible flag:
CREATE TABLE `widgets` (
`id` int auto_increment primary key,
`description` varchar(255),
`is_visible` tinyint(1) default 1
);
I'd like to issue a query that selects the descriptions of a subset of visible widgets. The following simple query does the trick (where n and m are integers):
SELECT `description`
FROM `widgets`
WHERE (`is_visible`)
ORDER BY `id` DESC
LIMIT n, m;
Unfortunately this query, as written, has to scan at least n+m rows. Is there a way to make this query scan fewer rows, either by reworking the query or modifying the schema?
Use indexes for faster query result:
ALTER TABLE `widgets` ADD INDEX ( `is_visible` )
Is there a way to make this query scan fewer rows?
No, not really. Given that it's a binary flag, you wouldn't get much benefit from creating an index on that field.
I will elaborate, given the downvote.
You have to take into consideration the cardinality (# of unique values) of an index. From the MySQL Manual:
The higher the cardinality, the greater the chance that MySQL uses the index when doing joins.
On that field would be 2. It doesn't get much lower than that.
See also: Why does MySQL not use an index on a int field that's being used as a boolean?
Indexing boolean fields