I have a table which I currently define as follows:
CREATE TABLE pairs (
id INTEGER PRIMARY KEY,
p1 INTEGER,
p2 INTEGER,
r INTEGER,
UNIQUE(p1, p2) ON CONFLICT IGNORE,
FOREIGN KEY (p1) REFERENCES points(id),
FOREIGN KEY (p2) REFERENCES points(id)
)
After that it is filled with gigabytes of data. Now I will need to do a lot of selects exactly like this:
SELECT id, r FROM pairs WHERE p1 = 666 OR p2 = 666
So the question is: what indexes I should create to speed up this select?
CREATE INDEX p1_index ON pairs(p1)
CREATE INDEX p2_index ON pairs(p2)
or may be
CREATE UNIQUE INDEX p_index ON pairs(p1, p2)
or may be even both? (and buy a new HDD for them).
SQLite3 does not create automatically index for a UNIQUE constraint on multiple columns.
Since you are using the OR condition, I would go with multiple indexes. If it was an AND condition then a multi-column index would work better.
For the OR condition:
The optimizer will start looking at one of the indexes, finds a match and just grabs that row. The other index will only be looked at when there is no match with the first one. On multi-processor systems, both the indexes will be (should be) scanned in parallel too. Awesome, right?
For the AND condition:
If 2 indexes are available, the optimizer will have to look at both of them, merge the output of the two index scans and then fetch the results from the base table. This may turn out to be expensive. Here, a multi-column index would have been great.
But then again, the optimizer may choose a different path based on the available table and index statistics.
Hope this helps.
Use EXPLAIN QUERY PLAN to check if indexes are used.
For your example query, both the single-column indexes would be used:
> EXPLAIN QUERY PLAN SELECT id, r FROM pairs WHERE p1 = 666 OR p2 = 666;
0|0|0|SEARCH TABLE pairs USING INDEX p1_index (p1=?) (~10 rows)
0|0|0|SEARCH TABLE pairs USING INDEX p2_index (p2=?) (~10 rows)
The multi-column index (which you already have because of the UNIQUE constraint) would be used if the lookup of a single record needs both columns:
> EXPLAIN QUERY PLAN SELECT id, r FROM pairs WHERE p1 = 666 AND p2 = 666;
0|0|0|SEARCH TABLE pairs USING INDEX sqlite_autoindex_pairs_1 (p1=? AND p2=?) (~1 rows)
However, a multi-column index can also be used for lookups on its first column(s):
> DROP INDEX p1_index;
> EXPLAIN QUERY PLAN SELECT id, r FROM pairs WHERE p1 = 666 OR p2 = 666;
0|0|0|SEARCH TABLE pairs USING INDEX sqlite_autoindex_pairs_1 (p1=?) (~10 rows)
0|0|0|SEARCH TABLE pairs USING INDEX p2_index (p2=?) (~10 rows)
Also see the documentation:
Query Optimizer Overview,
Query Planning.
Related
I am experiencing very slow performance when trying to join 2 tables: one has 39M rows, the other 10k (35 sec). This runs on Azure SQL Premium instance, which is very decent server
select m39.*
from [Table_With_39M_Rows] m39
inner join [Table_With_10K_Rows] k10 on m39.[Id] = k10.[Id]
even a count(*) takes around 10 seconds
select count(*)
from [Table_With_39M_Rows] m39
inner join [Table_With_10K_Rows] k10 on m39.[Id] = k10.[Id]
Here are the table details:
Table [Table_With_39M_Rows] has around 39 million rows (50 columns) with a clustered columnstore index:
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Table_With_39M_Rows
ON Table_With_39M_Rows
CREATE UNIQUE NONCLUSTERED UNCI_Table_With_39M_Rows_Id (Id ASC)
Table [Table_With_10K_Rows] has around 10k rows (50 columns) and Id as the primary key
ALTER TABLE Table_With_10K_Rows
ADD CONSTRAINT PK_Table_With_10K_Rows
PRIMARY KEY CLUSTERED([Id] ASC)
Clustered ColumnsStore index scan takes 99% and slows everything down.
How can I optimize this particular join? What indexing strategy should I employ?
Clustered column store indexes are helpful if row group elimination works(you can think of this skipping entire segment of rows which don't satisfy predicate) and if queries are analytical in nature.
To check whether segment elimination is occurring, you can use below queries
Below is a sample demo for a query i have(since we don't have your test data) which may help you understand more
query:
select s.* from sales s
join
numbers n
on n.number=s.id
Numbers table only has 65356 rows and sales table has more than 3 million rows.Each segment can have only one million rows.If you can observe the output of statistics IO,SQLSERVER reads 2 segments(2 million rows) and 2 segments are skipped,which is not great and i expect only one segment to be read and remaining three segments to be skipped..But 2 are read as shown below
Table 'sales'. Segment reads 2, segment skipped 2.
This is happening because you might have created clustered column store from a heap ,so try doing below
drop your existsing clustered column store index,in my case it is
drop index nci on sales
now try creating clustered index first and clustered column store next,this helps sqlserver in inserting the rows in order into clustered column store index.. you might also want to use maxdop 1 to avoid parallelism and unordered rows
create clustered index nci on sales(id)
create clustered columnstore index nci on sales
with (drop_existing=on,maxdop =1)
if you run the query now, you can see segement elimiation occurs and query is fast
Table 'sales'. Segment reads 1, segment skipped 2.
References and further reading:
https://www.sqlpassion.at/archive/2017/01/30/columnstore-segment-elimination/
https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/07 /17/columnstore-index-how-do-they-defer-from-traditional-btree-indices-on-rowstore-tables/
https://blogs.msdn.microsoft.com/sql_server_team/columnstore-index-performance-rowgroup-elimination/
I suggest you be consistent on use of [].
ID for a foreign key is not a good name.
Columnstore Indexes Described
Columnstore indexes give high performance gains for queries that use
full table scans, and are not well-suited for queries that seek into
the data, searching for a particular value.
Just because you need columnstore for other purposes does not make it a good applications for this.
Try regular nonclustered index on [Table_With_39M_Rows].[ID]
I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).
My schema is essentially:
CREATE TABLE documents (
document_id SERIAL PRIMARY KEY,
title TEXT
);
CREATE TABLE tags (
tag_id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE documents_tags (
document_id INTEGER REFERENCES documents,
tag_id INTEGER REFERENCES tags,
PRIMARY KEY (document_id, tag_id)
);
I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:
In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654])
100 loops, best of 3: 13.7 ms per loop
Translating the set operations to Postgres doesn't work that fast, however:
stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(# SELECT documents_tags.document_id AS id FROM documents_tags
stuff(# WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
document_id
---------------
...
Time: 201.476 ms
Replacing NOT IN with EXCEPT makes it even slower.
I have btree indexes on document_id and tag_id in all three tables and another one on (document_id, tag_id).
The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.
How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?
Here is the result of an EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
SELECT documents_tags.documents_id AS id FROM documents_tags
WHERE documents_tags.tag_id IN (12345, 7654)
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 146388
SubPlan 1
-> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Heap Blocks: exact=3300
-> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Planning time: 0.117 ms
Execution time: 303.289 ms
(11 rows)
Time: 303.790 ms
The only settings I changed from the default configuration were:
shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB
Running Postgres 9.4.5 on a server with 64GB RAM.
Optimize setup for read performance
Your memory settings seem reasonable for a 64GB server - except maybe work_mem = 512MB. That's high. Your queries are not particularly complex and your tables are not that big.
4.5 million rows (300k x 15) in the simple junction table documents_tags should occupy ~ 156 MB and the PK another 96 MB. For your query you typically don't need to read the whole table, just small parts of the index. For "mostly read-only" like you have, you should see index-only scans on the index of the PK exclusively. You don't need nearly as much work_mem - which may not matter much - except where you have many concurrent queries. Quoting the manual:
... several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value.
Setting work_mem too high may actually impair performance:
Increasing work_mem and shared_buffers on Postgres 9.2 significantly slows down queries
I suggest to reduce work_mem to 128 MB or less to avoid possible memory starvation- unless you have other common queries that require more. You can always set it higher locally for special queries.
There are several other angles to optimize read performance:
Configuring PostgreSQL for read performance
Key problem: leading index column
All of this may help a little. But the key problem is this:
PRIMARY KEY (document_id, tag_id)
300k documents, 2 tags to exclude. Ideally, you have an index with tag_id as leading column and document_id as 2nd. With an index on just (tag_id) you can't get index-only scans. If this query is your only use case, change your PK as demonstrated below.
Or probably even better: you can create an additional plain index on (tag_id, document_id) if you need both - and drop the two other indexes on documents_tags on just (tag_id) and (document_id). They offer nothing over the two multicolumn indexes. The remaining 2 indexes (as opposed to 3 indexes before) are smaller and superior in every way. Rationale:
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
While being at it, I suggest to also CLUSTER the table using the new PK, all in one transaction, possibly with some extra maintenance_work_mem locally:
BEGIN;
SET LOCAL maintenance_work_mem = '256MB';
ALTER TABLE documents_tags
DROP CONSTRAINT documents_tags_pkey
, ADD PRIMARY KEY (tag_id, document_id); -- tag_id first.
CLUSTER documents_tags USING documents_tags_pkey;
COMMIT;
Don't forget to:
ANALYZE documents_tags;
Queries
The query itself is run-of-the-mill. Here are the 4 standard techniques:
Select rows which are not present in other table
NOT IN is - to quote myself:
Only good for small sets without NULL values
Your use case exactly: all involved columns NOT NULL and your list of excluded items is very short. Your original query is a hot contender.
NOT EXISTS and LEFT JOIN / IS NULL are always hot contenders. Both have been suggested in other answers. LEFT JOIN has to be an actual LEFT [OUTER] JOIN, though.
EXCEPT ALL would be shortest, but often not as fast.
1. NOT IN
SELECT document_id
FROM documents d
WHERE document_id NOT IN (
SELECT document_id -- no need for column alias, only value is relevant
FROM documents_tags
WHERE tag_id IN (12345, 7654)
);
2. NOT EXISTS
SELECT document_id
FROM documents d
WHERE NOT EXISTS (
SELECT 1
FROM documents_tags
WHERE document_id = d.document_id
AND tag_id IN (12345, 7654)
);
3. LEFT JOIN / IS NULL
SELECT d.document_id
FROM documents d
LEFT JOIN documents_tags dt ON dt.document_id = d.document_id
AND dt.tag_id IN (12345, 7654)
WHERE dt.document_id IS NULL;
4. EXCEPT ALL
SELECT document_id
FROM documents
EXCEPT ALL -- ALL, to keep duplicate rows and make it faster
SELECT document_id
FROM documents_tags
WHERE tag_id IN (12345, 7654);
Benchmark
I ran a quick benchmark on my old laptop with 4 GB RAM and Postgres 9.5.3 to put my theories to the test:
Test setup
SET random_page_cost = 1.1;
SET work_mem = '128MB';
CREATE SCHEMA temp;
SET search_path = temp, public;
CREATE TABLE documents (
document_id serial PRIMARY KEY,
title text
);
-- CREATE TABLE tags ( ... -- actually irrelevant for this query
CREATE TABLE documents_tags (
document_id integer REFERENCES documents,
tag_id integer -- REFERENCES tags -- irrelevant for test
-- no PK yet, to test seq scan
-- it's also faster to create the PK after filling the big table
);
INSERT INTO documents (title)
SELECT 'some dummy title ' || g
FROM generate_series(1, 300000) g;
INSERT INTO documents_tags(document_id, tag_id)
SELECT i.*
FROM documents d
CROSS JOIN LATERAL (
SELECT DISTINCT d.document_id, ceil(random() * 50000)::int
FROM generate_series (1,15)) i;
ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id); -- your current index
ANALYZE documents_tags;
ANALYZE documents;
Note that rows in documents_tags are physically clustered by document_id due to the way I filled the table - which is likely your current situation as well.
Test
3 test runs with each of the 4 queries, best of 5 every time to exclude caching effects.
Test 1: With documents_tags_pkey like you have it. Index and physical order of rows are bad for our query.
Test 2: Recreate the PK on (tag_id, document_id) like suggested.
Test 3: CLUSTER on new PK.
Execution time of EXPLAIN ANALYZE in ms:
time in ms | Test 1 | Test 2 | Test 3
1. NOT IN | 654 | 70 | 71 -- winner!
2. NOT EXISTS | 684 | 103 | 97
3. LEFT JOIN | 685 | 98 | 99
4. EXCEPT ALL | 833 | 255 | 250
Conclusions
Key element is the right index with leading tag_id - for queries involving few tag_id and many document_id.
To be precise, it's not important that there are more distinct document_id than tag_id. This could be the other way round as well. Btree indexes basically perform the same with any order of columns. It's the fact that the most selective predicate in your query filters on tag_id. And that's faster on the leading index column(s).
The winning query for few tag_id to exclude is your original with NOT IN.
NOT EXISTS and LEFT JOIN / IS NULL result in the same query plan. For more than a couple of dozen IDs, I expect these to scale better ...
In a read-only situation you'll see index-only scans exclusively, so the physical order of rows in the table becomes irrelevant. Hence, test 3 did not bring any more improvements.
If writes to the table happen and autovacuum can't keep up, you'll see (bitmap) index scans. Physical clustering is important for those.
Use an outer join, with the tag condition on the join, keeping only missed joins to return where none of the specified tags match:
select d.id
from documents d
join documents_tags t on t.document_id = d.id
and t.tag_id in (12345, 7654)
where t.document_id is null
I'm new to sql, if i use this query frequently:
SELECT * FROM student WHERE key1=? AND key2=?
I want to create index on student, what is the main difference between these two below?
CREATE INDEX idx_key1 on student (key1);
CREATE INDEX idx_key2 on student (key2);
and
CREATE INDEX idx_keys on student (key1, key2);
Thanks!
The second one (CREATE INDEX idx_keys on student (key1, key2)) will return all the rows you need in a single index seek (to find the rows) + key lookups to get the columns.
If you create 2 single-column indexes, only one of them can be used for index seek. Then for every returned row you need a key lookup to get the other key and filter the results. Or the DB engine will simply decide it's faster to just do a table scan and filter.
So the 2nd one is much better for your query.
I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).
Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.
I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.
Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.
I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.
I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?