PostgreSQL with TimescaleDB only uses a single core during index creation - sql

we have a PostgreSQL hypertable with a few billion rows and we're trying to create a unique index on top of it like so:
CREATE UNIQUE INDEX device_data__device_id__value_type__timestamp__idx ON public.device_data(device_id, value_type, "timestamp" DESC);
We created the hypertable like this:
SELECT create_hypertable('device_data', 'timestamp');
Since we want to create the index as fast as possible, we'd like to parallelize the index creation, and followed this guide.
We tested various settings for work_mem, maintenance_work_mem, max_worker_processes, max_parallel_maintenance_workers, and max_parallel_workers. We also set the parallel_workers setting on our table: ALTER TABLE device_data SET (parallel_workers = 10);. But no matter what we do, the index creation always only uses a single core (we have 16 available), and therefore the creation takes very long.
Any idea what we might be missing here?
Our PostgreSQL version is 12.5 and the server runs Ubuntu 18.

Unfortunately, Timescale doesn't currently support parallel index creation. I would recommend filing a Github issue asking to support it. It is a bit of a heavy lift and might not get prioritized horribly quickly. I think another option that could be useful would be to take the https://docs.timescale.com/latest/api#create_index transaction_per_chunk option here and allow the user to control how the indexes are created, so a simple api that would create the index for all future chunks, but not on older chunks and then allow you to call create_index(chunk_name, ht_index_name) on all the chunks, then you could parallelize that operation in your own code. This ends up being a much simpler lift because the transactionality of the parallel index creation is the hardest part.

Related

Performance difference in Couchbase's get by Key and select by index

As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html

Firebird indexes - create and use

For an existing table in our database,
if I create an index on a table such as:
create index i_system_code_system_type_no on system_code (SYSTEM_TYPE_NO);
and then set statistics:
set statistics index i_system_code_system_type_no;
when executing a query against the table which "should" use the index, doesn't. For whatever reason, the optimizer I'm guessing (unless I've missed something), doesn't feel it's the best approach. However, if I restart the Firebird Guardian service, it does use it.
Does it take time for the server to catch up and build the index? Short of me forcing the index to use in the actual query itself is there a way to update or do something else to tell the server to use my new index? I can tell its using the new index because the "Adapted Plan" shows it using the index, plus the query execution time drops from about 0.5 sec to "instant".
This new index will be applied to 00's of databases of our customers, so just trying to determine the best way to distribute this update without having to restart the individual services on each of the customers machines.

Do I need to run gather_table_stats every time I create an index in order for the Oracle optimizer to use it?

I saw some examples where indexes were being created. Afterwards the following was executed:
exec dbms_stats.gather_table_stats(...)
Is this necessary for Oracle to pay attention to the index? I think stats are gathered every night (?) but there have been situations where I created an index was was disappointed by the explain plans that followed. Maybe I'm missing a step?
It depends on the version of Oracle.
In versions prior to 9i, you had to explicitly gather statistics after creating an index before the cost-based optimizer would have any realistic chance of using it.
In 9i, Oracle added the COMPUTE STATISTICS clause to the CREATE INDEX statement. That allowed you to gather statistics on the index as part of the index creation process. If you didn't specify COMPUTE STATISTICS, you still had to manually gather the statistics before the CBO would be likely to consider it.
In 10g, the default behavior changed and Oracle would automatically compute the statistics on the index when you created it without requiring you to specify COMPUTE STATISTICS. Out of force of habit or because they're just updating older example code, people will often still include the GATHER_INDEX_STATS call in examples they post.
In 10g and later, there is a background job that is created by default that gathers statistics on objects that are missing statistics and objects whose statistics are stale at night. DCookie's explanation of the 10g job is spot on. Oracle changed how the job was set up in 11g but it's still essentially doing the same things.
There is an out of the box default scheduled job named GATHER_STATS_JOB. It runs in the MAINTENANCE_WINDOW_GROUP. Your table may or may not get analyzed, depending on if there's enough time in the window to get to it, or if the window has been altered by the DBA's, etc. The best way to ensure it gets analyzed in this case is to manually run a gather_table_stats job for that table after you create an index.

Does MySQL use existing indexes on creating new indexes?

I have a large table with millions of records.
Table `price`
------------
id
product
site
value
The table is brand new, and there are no indexes created.
I then issued a request for new index creation with the following query:
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
This took long long time, last time I was checking ran for 5000+ seconds, because of the machine.
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
Next to run query 1:
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
Next to run query 2:
CREATE INDEX ix_price_value_id ON price (value, id);
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
No, it won't.
Theoretically, an index on (site, product, value, id) has everything required to build an index on any subset of these fields (including the indices on (product, value, id) and (value, id)).
However, building an index from a secondary index is not supported.
First, MySQL does not support fast full index scan (that is scanning an index in physical order rather than logical), thus making an index access path more expensive than the table read. This is not a problem for InnoDB, since the table itself is always clustered.
Second, the record orders in these indexes are completely different so the records need to be sorted anyway.
However, the main problem with the index creation speed in MySQL is that it generates the order on site (just inserting the records one by one into a B-Tree) instead of using a presorted source. As #Daniel mentioned, fast index creation solves this problem. It is available as a plugin for 5.1 and comes preinstalled in 5.5.
If you're using MySQL version 5.1, and the InnoDB storage engine, you may want to use the InnoDB Plugin 1.0, which supports a new feature called Fast Index Creation. This allows the storage engine to create indexes without copying the contents of the entire table.
Overview of the InnoDB Plugin:
Starting with version 5.1, MySQL AB has promoted the idea of a “pluggable” storage engine architecture, which permits multiple storage engines to be added to MySQL. Currently, however, most users have accessed only those storage engines that are distributed by MySQL AB, and are linked into the binary (executable) releases.
Since 2001, MySQL AB has distributed the InnoDB transactional storage engine with its releases (both source and binary). Beginning with MySQL version 5.1, it is possible for users to swap out one version of InnoDB and use another.
Source: Introduction to the InnoDB Plugin
Overview of Fast Index Creation:
In MySQL versions up to 5.0, adding or dropping an index on a table with existing data can be very slow if the table has many rows. The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Beginning with version 5.1, MySQL allows a storage engine to create or drop indexes without copying the contents of the entire table. The standard built-in InnoDB in MySQL version 5.1, however, does not take advantage of this capability. With the InnoDB Plugin, however, users can in most cases add and drop indexes much more efficiently than with prior releases.
...
Changing the clustered index requires copying the data, even with the InnoDB Plugin. However, adding or dropping a secondary index with the InnoDB Plugin is much faster, since it does not involve copying the data.
Source: Overview of Fast Index Creation

How do I force Postgres to use a particular index?

How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan and enable_indexscan parameters. See:
Examining index usage
enable_ parameters
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_ params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
For small tables, it's faster to do sequential scans.
Postgres doesn't use indexes when datatypes don't match properly, you may need to include appropriate casts.
Your planner settings might be causing problems.
See also this old newsgroup post.
Probably the only valid reason for using
set enable_seqscan=false
is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.
TL;DR
Run the following three commands and check whether the problem is fixed:
ANALYZE;
SET random_page_cost = 1.0;
SET effective_cache_size = 'X GB'; # replace X with total RAM size minus 2 GB
Read on for further details and background information about this.
Step 1: Analyze tables
As a simple first attempt to fix the issue, run the ANALYZE; command as the database superuser in order to update all table statistics. From the documentation:
The query planner uses these statistics to help determine the most efficient execution plans for queries.
Step 2: Set the correct random page cost
Index scans require non-sequential disk page fetches. PostgreSQL uses the random_page_cost configuration parameter to estimate the cost of such non-sequential fetches in relation to sequential fetches. From the documentation:
Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.
The default value is 4.0, thus assuming an average cost factor of 4 compared to sequential fetches, taking caching effects into account. However, if your database is stored on an SSD drive, then you should actually set random_page_cost to 1.1 according to the documentation:
Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1.
Also, if an index is mostly (or even entirely) cached in RAM, then an index scan will always be significantly faster than a disk-served sequential scan. The query planner however doesn't know which parts of the index are already cached, and thus might make an incorrect decision.
If your database indices are frequently used, and if the system has sufficient RAM, then the indices are likely to be cached eventually. In such a case, random_page_cost can be set to 1.0, or even to a value below 1.0 to aggressively prefer using index scans (although the documentation advises against doing that). You'll have to experiment with different values and see what works for you.
As a side note, you could also consider using the pg_prewarm extension to explicitly cache your indices into RAM.
You can set the random_page_cost like this:
SET random_page_cost = 1.0;
Step 3: Set the correct cache size
On a system with 8 or more GB RAM, you should set the effective_cache_size configuration parameter to the amount of memory which is typically available to PostgreSQL for data caching. From the documentation:
A higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.
Note that this parameter doesn't change the amount of memory which PostgreSQL will actually allocate, but is only used to compute cost estimates. A reasonable value (on a dedicated database server, at least) is the total RAM size minus 2 GB. The default value is 4 GB.
You can set the effective_cache_size like this:
SET effective_cache_size = '14 GB'; # e.g. on a dedicated server with 16 GB RAM
Step 4: Fix the problem permanently
You probably want to use ALTER SYSTEM SET ... or ALTER DATABASE db_name SET ... to set the new configuration parameter values permanently (either globally or per-database). See the documentation for details about setting parameters.
Step 5: Additional resources
If it still doesn't work, then you might also want to take a look at this PostgreSQL Wiki page about server tuning.
Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description = 'Refund'
GROUP BY client_id
PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description||'' = 'Refund'
GROUP BY client_id
The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.
Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.
There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.
Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.
One thing to note with PostgreSQL; where you are expecting an index to be used and it is not being used, is to VACUUM ANALYZE the table.
VACUUM ANALYZE schema.table;
This updates statistics used by the planner to determine the most efficient way to execute a query. Which may result in the index being used.
Another thing to check is the types.
Is the index on an int8 column and you are querying with numeric? The query will work but the index will not be used.
There is a trick to push postgres to prefer a seqscan adding a OFFSET 0 in the subquery
This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.
Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.
see How can I prevent Postgres from inlining a subquery?
Indexes can only be used under certain circumstances.
For example the type of the value fits to the type of the column.
You are not doing a operation on the column before comparing to the value.
Given a customer table with 3 columns with 3 indexes on all of the columns.
create table customer(id numeric(10), age int, phone varchar(200))
It might happend that the database trys to use for example the index idx_age instead of using the phone number.
You can sabotage the usage of the index age by doing an operation of age:
select * from customer where phone = '1235' and age+1 = 24
(although you are looking for the age 23)
This is of course a very simple example and the intelligence of postgres is probably good enough to do the right choice. But sometimes there is no other way then tricking the system.
Another example is to
select * from customer where phone = '1235' and age::varchar = '23'
But this is probably more costy than the option above.
Unfortunately you CANNOT set the name of the index into the query like you can do in MSSQL or Sybase.
select * from customer (index idx_phone) where phone = '1235' and age = 23.
This would help a lot to avoid problems like this.
Apparently there are cases where Postgre can be hinted to using an index by repeating a similar condition twice.
The specific case I observed was using PostGIS gin index and the ST_Within predicate like this:
select *
from address
natural join city
natural join restaurant
where st_within(address.location, restaurant.delivery_area)
and restaurant.delivery_area ~ address.location
Note that the first predicate st_within(address.location, restaurant.delivery_area) is automatically decomposed by PostGIS into (restaurant.delivery_area ~ address.location) AND _st_contains(restaurant.delivery_area, address.location) so adding the second predicate restaurant.delivery_area ~ address.location is completely redundant. Nevertheless, the second predicate convinced the planner to use spatial index on address.location and in the specific case I needed, improved the running time 8 times.