I am working with a table with a "state" column, which typically holds only 2 or 3 different values. Sometimes, when this table holds several million rows, following SQL statement becomes slow (I assume a full table scan is done):
SELECT state, count(*) FROM mytable GROUP BY state
I expect to get something like this:
disabled | 500000
enabled | 2000000
(basically I want to know how many items are "enabled" and how many items are "disabled" - actually that's a number instead of a text in my real application)
I guess adding an index for my state column is pretty useless, since only very few different values can be found there. What other options do I have?
There is also a "timestamp" column (with an index). Ideally the solution should also work well if I add:
WHERE timestamp BETWEEN x AND y
Right now I'm using an SQLite3 database, but it looks like other database engines are not too different, so solutions for other DB engines might be interesting as well.
Thank you!
I would put a covering index on timestamp,state (in that order). The rationale is:
the condition on the timestamp will be much more selective than the state
if the state is still in the index (i.e covering index), the engine only has to generate a range scan on the index itself (without having to pay for random I/Os to access the main data of the table).
Note: if the timestamp range is too wide, it will become slow despite of the index. Because random I/Os are more expensive than sequential I/Os, there is a point where the index range scan will become more expensive than the table scan. As a rule of thumb, if you need to scan more than 10% of the table, the engine should consider to keep the table scan and ignore the index. I'm note sure sqlite is smart enough to support this kind of optimization though.
Related
Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.
Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.
I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';
I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.
Using SQLite, Got a table with ~10 columns. Theres ~25million rows.
That table has an INDEX on 'sid, uid, area, type'.
I run a select like so:
SELECT sid from actions where uid=1234 and area=1 and type=2
That returns me 1571 results, and takes 4 minutes to complete.
Is that sane?
I'm far from an SQL expert, so hopefully someone can fill me in on what I'm missing. Why could this possibly take 4+ minutes with everything indexed?
Any recommended resources to learn about achieving high SQL performance? I feel like a lot of the Google results just give me opinions or anecdotes, I wouldn't mind a solid book.
Create uid+area+type index instead, or uid+area+type+sid
Since the index starts with the sid column, it must do a scan (start at the beginning, read to the end) of either the index or the table to find your data matching the other 3 columns. This means it has to read all 25 million rows to find the answer. Even if it's reading just the rows of the index rather than the table, that's a lot of work.
Imagine a phone book of the greater New York metropolitan area, organized by (with an 'index' on) Last Name, First Name.
You submit SELECT [Last Name] FROM NewYorkPhoneBook WHERE [First Name] = 'Thelma'
It has to read all 25 million entries to find all those Thelmas. Unless you either specify the last name and can then turn directly to the page where that last name first appers (a seek), or have an index organized by First Name (a seek on the index followed by a seek on the table, aka a "bookmark lookup"), there's no way around it.
The index you would create to make your query faster is on uid, area, type. You could include sid, though leave it out if sid is part of the primary key.
Note: Tables often do have multiple indexes. Just note that the more indexes, the slower the write performance. Unnecessary indexes can slow overall performance, sometimes radically so. Testing and eventually experience will help guide you in this. Also, reasoning it out as a real-world problem (like my phone book examples) can really help. If it wouldn't make sense with phone books (and separate phone book indexes) then it probably won't make sense in the database.
One more thing: even if you put an index on those columns, if your query is going to end up pulling a great percentage of the rows in the main table, it will still be cheaper to scan the table rather than do the bookmark lookup (seek the index then seek the table for each row found). The exact "tipping point" of whether to do a bookmark lookup with a seek, or to do a table scan isn't something I can tell you off the top of my head, but it is based on solid math.
The index is not really usefull as it does start with the wrong field... which means a table scan.
Looks like you have a normal computer there, not something made for databases. I run table scans over 650 million rows in about a minute on my lower end db server, but that means reading about a gigabyte per second from the discs, which are a RAID of 10k RM discs - RAID 10. Just to say that basically... that databases love IO, and that in a degree that you have never seen before. Basically larger db servers have many discs to satisfy the IOPS (IO per second) requirement. I have seen a server with 190 discs.
So, you ahve two choices: beed up your IOPS capability (means spending money), or set up indices that get used because they are "proper".
Proper means: an index only is usefull if the fields it contains are used from left to right. Not necessarily in the same order... but if a field is missed there is a chance the SQL System decides it is not worth pursuing the index and instead goes table scan (as in your case).
When you create your new index on uid, area and type, you should also do a select distinct on each one to determine which has the fewest distinct entries, then create your index such that the fewer the differences the earlier they show up in the index definition.