Postgres Functional Index on string_to_array(long_string, ',') not being used

Postgres Functional Index on string_to_array(long_string, ',') not being used - sql

I'm dealing with a ~20M line table in Postgres 10.9 that has a text column in that is a bunch of comma delimited strings. This table gets joined all over to many tables that are much longer, and every time the previous authors did so, they join with the on clause being some_other_string = Any(string_to_array(col, ','))
I'm trying to implement a quick optimization to make queries faster while I work on a better solution with the following index:
My functional index:
create index string_to_array_index on happy_table (string_to_array(col));
Test query:
select string_to_array(col, ',') from happy_table;
When I execute an explain on the test query in order to see if the index is being used, I can see that it isn't. I see examples of functional indexes on strings where they lowercase the string or perform some basic operation like that. Do functional indexes work with string_to_array?
select a.id
from joyful_table a
join happy_table b on a.col = any(string_to_array(b.col, ','));

That is a bad design. No matter what you do and how big the tables are, you are stuck with a nested loop join (because the join condition does not use the = operator).
You are right; the best you can do is to speed up that nested loop with an index.
Your index doesn't work because it is a B-tree index, and that cannot be used with arrays in a meaningful way. What you need is a GIN index:
CREATE INDEX ON happy_table USING gin (string_to_array(col, ','));
But that index won't be used with = ANY. You'll have to rewrite the join to
SELECT a.id
FROM joyful_table a
JOIN happy_table b
ON ARRAY[a.col] <# string_to_array(b.col, ',');

Related

Approach to index on Multiple Join columns on same Table?

I've many tables joining each other and for a perticular table I've multiple columns on joining condition.
For e.g.
select a.av, b.qc
TableA a INNER JOIN TableB b
ON (a.id = b.id and a.status = '20' and a.flag='false' and a.num in (1,2,4))
how should be the approach.
1. CREATE NONCLUSTERED INDEX N_IX_Test
ON TableA (id,status,flag,num)
INCLUDE(av);
2. CREATE NONCLUSTERED INDEX N_IX_Test1
ON TableB (id)
INCLUDE(qc);
This two approaches I could think off, everytime i see multiple columns for same table on joining condition i make it as composite index and add select list column to include is it fine?

If id is a unique key in each table, there is no benefit to the join (harmful in fact) from adding more fields to the index.
Now if ID is not unique and not well distributed and by the using the extra columns, you are making a covering index then yes, you are making an index that will make for fast selects. However the covering index maintenance itself is an extra load on SQL server. Hard to tell from your example if this is what your are saying.
So if ID unique or at least not many duplicates for a given ID, I would be reluctant to add covering indexes unless a large percentage of your queries can be satisfied by selecting from the covering index.

Different join algorithms need different indexing. Your indexing approaches are only good for nested loops joins, but I guess hash join might be a better option in that case. However, there is a trick which makes an index useful for nested loops as well as for hash join: put the non-join predicates first into the index:
CREATE NONCLUSTERED INDEX N_IX_Test
ON TableA (status,flag,id,num)
INCLUDE(av);
num is still last because it's not an equality comparison.
This is just a wild guess, exact advice is only possible if you provide more info such as the clustered indexes (if any) and also the execution plan.
References:
about indexing joins (nested loops, hash & merge)

SQLite does not like inner joins?

I am having trouble with sqlite in an android application.
It seems that any JOIN OPERATION totally kills my performance
One table is a fts3 table because my application is a dictionary and I read fts3 benefits dictionary like look ups.
These are my 2 tables I want to join (mainly getting the meaning of the word (okurigana) in different languages :
CREATE VIRTUAL TABLE tango USING fts3 (okurigana, kana, pos, pos_detail);
CREATE TABLE translation (_id int(7), language VARCHAR(10), meaning VARCHAR(100), FOREIGN KEY (_id) REFERENCES tango(rowid));
CREATE INDEX lang_match ON translation (language);
I query these tables with this command:
Select a.rowid, a.okurigana, a.kana, b.meaning
from tango a inner join translation b
ON a.rowid=b._id AND b.language='eng'
WHERE a.okurigana MATCH 'A*'"
The query takes several seconds to complete. I dont understand why. If I use this query (removed the inner join) the query is extremely fast.
Select a.rowid, a.okurigana, a.kana
from tango a
WHERE a.okurigana MATCH 'A*';
Why does a join kills the performance o.0?

You can speed up the query with the use of indexes. This is your query:
Select a.rowid, a.okurigana, a.kana, b.meaning
from tango a inner join
translation b
ON a.rowid = b._id AND b.language = 'eng'
WHERE a.okurigana MATCH 'A*'" ;
There are basically two ways for the engine to process this query. One way is to do the filtering on tango using the where clause and then to look up the values in translation. For this, a useful index would be:
create index translation_id_language_meaning on translation(_id, language, meaning)
The other way would be to scan translation and then do the the lookup on tango. For this, a useful index would be:
create index translation_language_id_meaning on translation(language, _id, meaning)
The first is probably most appropriate for your query, but the better solution depends on the table statistics and distribution of values.

If adding an inner join slows the query down without increasing significantly the number of rows that you get back, it is usually because your schema lacks an index.
In your case, it looks like your translation._id or translation.language is not indexed (perhaps both columns need indexing).
Adding indexes using the CREATE INDEX ... command for these two columns should speed up your query.

Several layers of nested subqueries with Exists/In, best performance?

I'm working on some rather large queries for a search function. There are a number of different inputs and the queries are pretty big as a result. It's grown to where there are nested subqueries 2 layers deep. Performance has become an issue on the ones that will return a large dataset and likely have to sift through a massive load of records to do so. The ones that have less comparing to do perform fine, but some of these are getting pretty bad. The database is DB2 and has all of the necessary indexes, so that shouldn't be an issue. I'm wondering how to best write/rewrite these queries to perform as I'm not quite sure how the optimizer is going to handle it. I obviously can't dump the whole thing here, but here's an example:
Select A, B
from TableA
--A series of joins--
WHERE TableA.A IN (
Select C
from TableB
--A few joins--
WHERE TableB.C IN (
Select D from TableC
--More joins and conditionals--
)
)
There are also plenty of conditionals sprinkled throughout, the vast majority of which are simple equality. You get the idea. The subqueries do not provide any data to the initial query. They exist only to filter the results. A problem I ran into early on is that the backend is written to contain a number of partial query strings that get assembled into the final query (with 100+ possible combinations due to the search options, it simply isn't feasible to write a query for each), which has complicated the overall method a bit. I'm wondering if EXISTS instead of IN might help at one or both levels, or another bunch of joins instead of subqueries, or perhaps using WITH above the initial query for TableC, etc. I'm definitely looking to remove bottlenecks and would appreciate any feedback that folks might have on how to handle this.
I should probably also add that there are potential unions within both subqueries.

It would probably help to use inner joins instead.
Select A, B
from TableA
inner join TableB on TableA.A = TableB.C
inner join TableC on TableB.C = TableC.D
Databases were designed for joins, but the optimizer might not figure out that it can use an index for a sub-query. Instead it will probably try to run the sub-query, hold the results in memory, and then do a linear search to evaluate the IN operator for every record.
Now, you say that you have all of the necessary indexes. Consider this for a moment.
If one optional condition is TableC.E = 'E' and another optional condition is TableC.F = 'F',
then a query with both would need an index on fields TableC.E AND TableC.F. Many young programmers today think they can have one index on TableC.E and one index on TableC.F, and that's all they need. In fact, if you have both fields in the query, you need an index on both fields.
So, for 100+ combinations, "all of the necessary indexes" could require 100+ indexes.
Now an index on TableC.E, TableC.F could be use in a query with a TableC.E condition and no TableC.F condition, but could not be use when there is a TableC.F condition and no TableC.E condition.
Hundreds of indexes? What am I going to do?
In practice it's not that bad. Let's say you have N optional conditions which are either in the where clause or not. The number of combinations is 2 to the nth, or for hundreds of combinations, N is log2 of the number of combinations, which is between 6 and 10. Also, those log2 conditions are spread across three tables. Some databases support multiple table indexes, but I'm not sure DB2 does, so I'd stick with single table indexes.
So, what I am saying is, say for the TableC.E, and TableC.F example, it's not enough to have just the following indexes:
TableB ON C
TableC ON D
TableC ON E
TableC ON F
For one thing, the optimizer has to pick among which one of the last three indexes to use. Better would be to include the D field in the last two indexes, which gives us
TableB ON C
TableC ON D, E
TableC ON D, F
Here, if neither field E nor F is in the query, it can still index on D, but if either one is in the query, it can index on both D and one other field.
Now suppose you have an index for 10 fields which may or may not be in the query. Why ever have just one field in the index? Why not add other fields in descending order of likelihood of being in the query?
Consider that when planning your indexes.

I found out "IN" predicate is good for small subqueries and "EXISTS" for large subqueries.
Try to execute query with "EXISTS" predicate for large ones.
SELECT A, B
FROM TableA
WHERE EXISTS (
Select C
FROM TableB
WHERE TableB.C = TableA.A)

mysql: which queries can untilize which indexes?

I'm using Mysql 5.0 and am a bit new to indexes. Which of the following queries can be helped by indexing and which index should I create?
(Don't assume either table to have unique values. This isn't homework, its just some examples I made up to try and get my head around indexing.)
Query1:
Select a.*, b.*
From a
Left Join b on b.type=a.type;
Query2:
Select a.*, b.*
From a,b
Where a.type=b.type;
Query3:
Select a.*
From a
Where a.type in (Select b.type from b where b.brand=5);
Here is my guess for what indexes would be use for these different kinds of queries:
Query1:
Create Index Query1 Using Hash on b (type);
Query2:
Create Index Query2a Using Hash on a (type);
Create Index Query2b Using Hash on b (type);
Query3:
Create Index Query2a Using Hash on b (brand,type);
Am I correct that neither Query1 or Query3 would utilize any indexes on table a?
I believe these should all be hash because there is only = or !=, right?
Thanks

using the explain command in mysql will give a lot of great info on what mysql is doing and how a query can be optimized.
in q1 and q2: an index on (a.type, all other a cols) and one on (b.type, all other b cols)
in q3: an index on (a.b_type, all other a cols) and one on b (brand, type)
ideally, you'd want all the columns that were selected stored directly in the index so that mysql doesn't have to jump from the index back to the table data to fetch the selected columns. however, that is not always manageable (i.e.: sometimes you need to select * and indexing all columns is too costly), in which case indexing just the search columns is fine.
so everything you said works great.

query 3 is invalid, but i assume you meant
where a.type in ....
Query 1 is the same as query two, just better syntax, both probably have the same query plan and both will use both indexes.
Query 3 will use the index on b.brand, but not the type portion of it. It would also use an index on a.type if you had one.
You are right that they should be hash indexes.

Query 3 could utilize an index on a.type if the number of b's with brand=5 is close to zero
Query2 will utilize indices if they are B-trees (and thus are sorted). Using hash indices with index-join may slow down your query (because you'll have to read Size(a) values in non-sequential way)

Query optimization and indexing is a huge topic, so you'll definitely want to read about MySQL and the specific storage engines you're using. The "using hash" is supported by InnoDB and NDB; I don't think MyISAM supports it.
The joins you have will perform a full table or index scan even though the join condition is equality; Every row will have to be read because there's no where clause.
You'll probably be better off with a standard b-tree index, but measure it and investigate the query plan with "explain". MySQL InnoDB stores row data organized by primary key so you should also have a primary key on your tables, not just an index. It's best if you can use the primary key in your joins because otherwise MySQL retrieves the primary key from the index, then does another fetch to get the row. The nice exception to that rule is if your secondary index includes all the columns you need in the query. That's called a covering index and MySQL will not have to lookup the row at all.

Creating Indexes for Group By Fields?

Do you need to create an index for fields of group by fields in an Oracle database?
For example:
select *
from some_table
where field_one is not null and field_two = ?
group by field_three, field_four, field_five
I was testing the indexes I created for the above and the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?

It could be correct, but that would depend on how much data you have. Typically I would create an index for the columns I was using in a GROUP BY, but in your case the optimizer may have decided that after using the field_two index that there wouldn't be enough data returned to justify using the other index for the GROUP BY.

No, this can be incorrect.
If you have a large table, Oracle can prefer deriving the fields from the indexes rather than from the table, even there is no single index that covers all values.
In the latest article in my blog:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
, there is a query in which Oracle does not use full table scan but rather joins two indexes to get the column values:
SELECT l.id, l.value
FROM t_left l
WHERE NOT EXISTS
(
SELECT value
FROM t_right r
WHERE r.value = l.value
)
The plan is:
SELECT STATEMENT
HASH JOIN ANTI
VIEW , 20090917_anti.index$_join$_001
HASH JOIN
INDEX FAST FULL SCAN, 20090917_anti.PK_LEFT_ID
INDEX FAST FULL SCAN, 20090917_anti.IX_LEFT_VALUE
INDEX FAST FULL SCAN, 20090917_anti.IX_RIGHT_VALUE
As you can see, there is no TABLE SCAN on t_left here.
Instead, Oracle takes the indexes on id and value, joins them on rowid and gets the (id, value) pairs from the join result.
Now, to your query:
SELECT *
FROM some_table
WHERE field_one is not null and field_two = ?
GROUP BY
field_three, field_four, field_five
First, it will not compile, since you are selecting * from a table with a GROUP BY clause.
You need to replace * with expressions based on the grouping columns and aggregates of the non-grouping columns.
You will most probably benefit from the following index:
CREATE INDEX ix_sometable_23451 ON some_table (field_two, field_three, field_four, field_five, field_one)
, since it will contain everything for both filtering on field_two, sorting on field_three, field_four, field_five (useful for GROUP BY) and making sure that field_one is NOT NULL.

Do you need to create an index for fields of group by fields in an Oracle database?
No. You don't need to, in the sense that a query will run irrespective of whether any indexes exist or not. Indexes are provided to improve query performance.
It can, however, help; but I'd hesitate to add an index just to help one query, without thinking about the possible impact of the new index on the database.
...the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
Not always. Often a GROUP BY will require Oracle to perform a sort (but not always); and you can eliminate the sort operation by providing a suitable index on the column(s) to be sorted.
Whether you actually need to worry about the GROUP BY performance, however, is an important question for you to think about.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas