Is this index defined correctly for this join usage? (Postgres) - sql

select
*
from
tbl1 as a
inner join
tbl2 as b on
tbl1.id=b.id
left join
tbl3 as c on
tbl2.id=tb3.parent_id and
tb3.some_col=2 and
tb3.attribute_id=3
In the example above:
If I want optimal performance on the join, should I set the index on tbl3 as so?
parent_id,
some_col,
attribute_id

The answer depends on the chosen join type.
If PostgreSQL chooses a nested loop or a merge outer join, your index is perfect.
If PostgreSQL chooses a hash outer join, the index won't help at all. In that case you need an index on (some_col, attribute_id).
Work with EXPLAIN to make the best choice for your case.
Note: If one of the conditions on some_col and attribute_id is not selective (doesn't filter out a significant number of rows), it is often better to omit that column in the index. In that case, it is better to get the benefit of a smaller index and more HOT updates.

My answer is "Maybe". I am speaking from experience with SQL Server, so someone please correct me if I am wrong and it is different in Postgres.
Your index looks fine for the most part. An issue that may arise is using the SELECT *. If tbl3 has more columns than what is defined in your index and you are querying those fields, they won't be in your index and the engine will have to do additional lookups outside that index.
Another thing would be based on the cardinality of your fields, meaning which are the most selective. If parent_id has a high cardinality, meaning very few duplicates, it could cause more reads against the index. However, if your lowest cardinality field is first and the db can quickly filter out huge chunks of data, that might be more efficient.
I have seen both work very well in SQL Server. SQL Server has even recommended indexes, I apply them, and then it recommends a different one based on field cardinality. Again, I am not familiar with the Postgres engine and I am just assuming these topics apply across both. If all else fails, create 3 indexes with different column order and see which one the engine likes the best.

Related

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Where (implicit inner join) vs. explicit inner join - does it affect indexing?

For the query
SELECT * from table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
or
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
Somehow, i would tend to create one index (id,status) on table_a for the top form
whereas my natural tendency for the bottom form would be to create two separate indices,
id, and status, on table_a.
the two queries are effectively the same, right? would you index both the same way?
how would you index table_a (assuming this is the only query that exists in the system to avoid other considerations)? one or two indices?
The "traditional style" and the SQL 92 style inner join are semantically equivalent, and most DBMS will treat them the same (Oracle, for example, does). They will use the same execution plan for both forms (this is, nevertheless, implementation-dependent, and not guaranteed by any standard).
Hence, indexes are used the same way in both forms, too.
Independently of the syntax you use, the appropriate indexing strategy is implementation-dependent: some DBMS (such as Postgres) generally prefer single-column indexes and can combine them very efficiently, others, such as Oracle, can take more advantage from combined (or even covering) indexes (although both forms work for both DBMS of course).
Regarding the syntax of your example, the position of the second WHERE clause surprises me a little bit.
The following two queries are processed the same way in most DBMS:
SELECT * FROM table_a, b WHERE table_a.id = b.id AND table_a.status ='success'
and
SELECT * FROM a JOIN b ON table_a.id = b.id WHERE table_a.status ='success'
However, your second query shifts the WHERE clause inside the FROM clause, which is no valid SQL in my view.
A quick check for
SELECT * from a WHERE table_a.status ='success' JOIN b ON table_a.id = b.id
confirms: MySQL 5.5, Postgres 9.3, and Oracle 11g all yield a syntax error for it.
The two queries should be optimized to perform the same way; however, the join syntax is ANSI compliant, and the older version should be deprecated. As far as index usage is concerned, you only want to touch a table (index) once. The RDBMS and tabular design you are using will determine the specifics as to whether or not you need to include the PRIMARY KEY (assuming that's what ID represents in your example) in a covering index. Also, SELECT * may or may not be covered; better to use specific column names.
Well you ruled out other queries but there are still open questions: particularly about the data distribution. E.g. how to to number of rows WHERE table_a.status ='success' compare to the table size of table_b? Depending on the optimizers estimates the has to make two important decisions:
Which join algorithm to use (Nested Loops; Hash or Sort/Merge)
In which order to process the table?
Unfortunately these decision affect indexing (and are affected by indexing!)
Example: consider there is only one row WHERE table_a.status ='success'. Than it would be fine to have an index on table_a.status to find that row quickly. Next, we'd like to have an index on table_b.id to find the corresponding rows quickly using a nested loops join. Considering that you select * it doesn't make any sense to include additional columns into these indexes (not considering any other queries in the system).
But now imagine that you don't have an index on table_a.status but on table_a.id and that this table is huge compared to table_b. For demonstration let's assume table_b has only one row (extreme case, of course). Now it would be better to go to table_b, fetch all rows (just one) and than fetch the corresponding rows from table_a using the index. You see how indexing affects the join order? (for a nested loops join in this example)
This is just one simple example how things interact. Most database have three join algorithms to chose from (except MySQL).
If you create the three mentioned indexes and look which way the database executions the join (explain plan) you'll note that one or two of the indexes remains unused for the specific join-algo/join-order selected for your query. In theory, you could drop that indexes. However, keep in mind that the optimizer makes his decision based on the statistics available to him and that the optimizers estimations might be wrong.
You can find more about indexing joins on my web-site: http://use-the-index-luke.com/sql/join

Index spanning multiple tables in PostgreSQL

Is it possible in PostgreSQL to place an index on an expression containing fields of multiple tables. So for example an index to speed up an query of the following form:
SELECT *, (table1.x + table2.x) AS z
FROM table1
INNER JOIN table2
ON table1.id = table2.id
ORDER BY z ASC
No it's not possible to have an index on many tables, also it really wouldn't guarantee speeding up anything since you won't always get an Index Only Scan. What you really want is a materialized view but pg doesn't have those either. You can try implementing it yourself using triggers like this or this.
Update
As noted by #petter. The materialized views were introduced in 9.3.
No, that's not possible in any currently shipping SQL dbms. Oracle supports bitmap join indexes, but that might not be relevant. It's not clear to me whether you want an index on only the join columns of multiple tables, or whether you want an index on arbitrary columns of joined tables.
To determine the real source of performance problems, learn to read the output of PostgreSQL's EXPLAIN ANALYZE.

mysql: which queries can untilize which indexes?

I'm using Mysql 5.0 and am a bit new to indexes. Which of the following queries can be helped by indexing and which index should I create?
(Don't assume either table to have unique values. This isn't homework, its just some examples I made up to try and get my head around indexing.)
Query1:
Select a.*, b.*
From a
Left Join b on b.type=a.type;
Query2:
Select a.*, b.*
From a,b
Where a.type=b.type;
Query3:
Select a.*
From a
Where a.type in (Select b.type from b where b.brand=5);
Here is my guess for what indexes would be use for these different kinds of queries:
Query1:
Create Index Query1 Using Hash on b (type);
Query2:
Create Index Query2a Using Hash on a (type);
Create Index Query2b Using Hash on b (type);
Query3:
Create Index Query2a Using Hash on b (brand,type);
Am I correct that neither Query1 or Query3 would utilize any indexes on table a?
I believe these should all be hash because there is only = or !=, right?
Thanks
using the explain command in mysql will give a lot of great info on what mysql is doing and how a query can be optimized.
in q1 and q2: an index on (a.type, all other a cols) and one on (b.type, all other b cols)
in q3: an index on (a.b_type, all other a cols) and one on b (brand, type)
ideally, you'd want all the columns that were selected stored directly in the index so that mysql doesn't have to jump from the index back to the table data to fetch the selected columns. however, that is not always manageable (i.e.: sometimes you need to select * and indexing all columns is too costly), in which case indexing just the search columns is fine.
so everything you said works great.
query 3 is invalid, but i assume you meant
where a.type in ....
Query 1 is the same as query two, just better syntax, both probably have the same query plan and both will use both indexes.
Query 3 will use the index on b.brand, but not the type portion of it. It would also use an index on a.type if you had one.
You are right that they should be hash indexes.
Query 3 could utilize an index on a.type if the number of b's with brand=5 is close to zero
Query2 will utilize indices if they are B-trees (and thus are sorted). Using hash indices with index-join may slow down your query (because you'll have to read Size(a) values in non-sequential way)
Query optimization and indexing is a huge topic, so you'll definitely want to read about MySQL and the specific storage engines you're using. The "using hash" is supported by InnoDB and NDB; I don't think MyISAM supports it.
The joins you have will perform a full table or index scan even though the join condition is equality; Every row will have to be read because there's no where clause.
You'll probably be better off with a standard b-tree index, but measure it and investigate the query plan with "explain". MySQL InnoDB stores row data organized by primary key so you should also have a primary key on your tables, not just an index. It's best if you can use the primary key in your joins because otherwise MySQL retrieves the primary key from the index, then does another fetch to get the row. The nice exception to that rule is if your secondary index includes all the columns you need in the query. That's called a covering index and MySQL will not have to lookup the row at all.

SQL Server Index question

I have a query that joins 3 tables in SQL Server 2005, but has no Where clause, so I am indexing the fields found in the join statement.
If my index is set to Col1,col2,col3
And my join is
Tbl1
inner join tbl2
On
Tbl1.col3=tbl2.col3
Tbl1.col2=Tbl2.col2
Tbl1.col1=Tbl2.col1
Does the order of the join statement make a difference as compared to the order of the index? Should I set my index to be Col3,col2,col1? Or rearrage my join statement to be Col1,col2,col3?
Thanks
The SQL Server query optimiser should work it out. No need to change for the example you gave.
This is the simple answer though, and it depends on what columns you are selecting and how you are joining the 3 tables.
Note: I'd personally prefer to change the JOIN around to match a "natural" order. That is, I try to use my columns in the same order (JOIN, WHERE) that matches my keys and/or indexes. As Joel mentioned, it can help later on for troubleshooting.
For querying purposes, it does not matter. You may consider alternate ordering sequences, based on the following:
possible use of the index for other queries (including some with ORDER BY .. one of these columns)
to limit index fragmentation (by using orders that tends to add records towards the end of the table and/or nearby non-selective parameters)
Edit: on second thoughts, having the most selective column first may help the optimizer, for example by providing it with a better estimated row yield and such... But this important issue may be getting off topic, as the OP's question was whether the order of the join conditions mattered.
If you allways have a join on Col1-3 then you should build the index so that the "most distinctive column" is in the 1st field and the most general ones in the last field
So a "Status Ok" / "Status denied" field should be field 3 and a SSN or Phonenumber should be field one on the index