Does a 4-column composite index benefit a 3-column query? - sql

I have a table with columns A, B, C, D, E, ..., N. with PK(A). I also have a composite, unique, non-clustered index defined for columns D, C, B, A, in that order.
If I use a query like:
where D = 'a' and C = 'b' and B = 'c'
without a clause for A, do I still get the benefits of the index?

Yes, SQL server can perform a seek operation on the index (D, C, B, A) for these queries:
WHERE D = 'd'
WHERE D = 'd' AND C = 'c'
WHERE D = 'd' AND C = 'c' AND B = 'b'
WHERE D = 'd' AND C = 'c' AND B = 'b' AND A = 'a'
And it could perform a scan operation on the said index for these:
WHERE C = 'c' AND B = 'b'
WHERE A = 'a'
-- etc
But there is one more thing to consider: the order of columns inside indexes matters. It is possible to have two indexes (D, C, B, A) and (B, C, D, A) perform differently. Consider the following example:
WHERE Active = 1 AND Type = 2 AND Date = '2019-09-10'
Assuming that the data contains two distinct values for Active, 10 for Type and 1000 for Date, an index on (Date, Type, Active, Other) will be more efficient than (Active, Type, Date, Other).
You could also consider creating different variations of the said index for different queries.
PS: if column A is not used inside the WHERE clause then you can simply INCLUDE it.

Yeah, it will still use the index properly and make your query more efficient.
Think of it like a nested table of contents in a text-book:
Skim till you see the chapter name (e.g. "Data Structures")
Skim until you see the relevant section title ("Binary Trees")
Skim until you see the relevant topic (e.g. "Heaps").
If you only want to get to the binary trees section, the contents going to the topic level doesn't hurt you at all :).
Now... if you wanted to find binary trees and you didn't know they were a data structure, then this table of contents wouldn't be very useful for you (e.g. if you were missing "D").

Yes, now SQL can do a seek to just those records, instead of having to scan the whole table.
In addition:
SQL will have better statistics (SQL will auto-create statistics on the set of columns composing the index) as to how many rows are likely to satisfy your query
The fact that it is UNIQUE will also tell SQL about the data, which may result in a better plan (for example if you do DISTINCT or UNION, it will know that those columns are already distinct).
SQL will have to read less data, because instead of having to read all "N" columns (even though you only need 3), it can read the index, which will have 4, so only one will be "superfluous".
Note that because the particular query in question is WHERE on D, C, and B, in this case the order of the index won't matter. If the WHERE clause is only on C and B, you would get much less benefit because SQL would no longer be able to seek on D, and (1) and (2) above wouldn't apply. (3) still would, though.

Related

Adding a "calculated column" to BigQuery query without repeating the calculations

I want to resuse value of calculated columns in a new third column.
For example, this query works:
select
countif(cond1) as A,
countif(cond2) as B,
countif(cond1)/countif(cond2) as prct_pass
From
Where
Group By
But when I try to use A,B instead of repeating the countif, it doesn't work because A and B are invalid:
select
countif(cond1) as A,
countif(cond2) as B,
A/B as prct_pass
From
Where
Group By
Can I somehow make the more readable second version work ?
Is this first one inefficient ?
You should construct a subquery (i.e. a double select) like
SELECT A, B, A/B as prct_pass
FROM
(
SELECT countif(cond1) as A,
countif(cond2) as B
FROM <yourtable>
)
The same amount of data will be processed in both queries.
In the subquery one you will do only 2 countif(), in case that step takes a long time then doing 2 instead of 4 should be more efficient indeed.
Looking at an example using bigquery public datasets:
SELECT
countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B,
countif(homeFinalRuns>3)/countif(awayFinalRuns>3) as division
FROM `bigquery-public-data.baseball.games_post_wide`
or
SELECT A, B, A/B as division FROM
(
SELECT countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B
FROM `bigquery-public-data.baseball.games_post_wide`
)
we can see that doing all in one (without a subquery) is actually slightly faster. (I ran the queries 6 times for different values of the inequality, 5 times was faster and one time slower)
In any case, the efficiency will depend on how taxing is to compute the condition in your particular dataset.

select distinct performance is not consistent

There is a distinct query on a single table
select distinct d, e, f, a, b, c from t where a = 1 and e = 2;
The number of distinct values in cols a, b, c are high (high column cardinality) and cols d, e, f are low cardinality columns. My data is in ORC format in S3, and I have external table in Athena and Redshift spectrum pointing to the same file.
When above query is run in athena it comes back in couple of secs, whereas in redshift spectrum it takes couple of minutes.
But when I move col f at the end of the select list, it works fine in Redshift spectrum too. This happens for only for this particular column, I mean moving d or e at the end does not make any difference i.e. they run longer. The col f is a varchar column as are others and the max length of this column is 30 bytes.
Two questions
(a) Any insight or pointers to the peculiar behavior where moving col f to the end of the list makes it run faster whereas putting it in between makes it slower
(b) Is there a recommended SQL best practice to list the columns in decreasing order of column cardinality in distinct or group by statements? Does it make difference in the execution times if columns of lower cardinality are put first or if they are put in mixed arrangement?
Updating your Redshift driver to the latest version can usually bring your Redshift Spectrum speed almost in line with Athena.
https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver
This may not be the cause in your use case but it is definitely worth a try!

Using previous table in pig group syntax after filter

Suppose I have a table in pig with 3 columns, a , b, c. Now suppose I want to filter the table by b == 4 and then group it by a. I believe that would look something like this.
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by my_table.a;
My question is why can't it look like this:
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by t1_filtered.a;
Why do you have to reference the table before the filter? I'm trying to learn pig and i find myself making this mistake a lot. It seems to me that t1_filtered should equal a table that is just the filtered version of t1. Therefore a simple group should make sense, but i've been told you need to reference the table from before. Does anyone know whats going on behind the scenes and why this makes sense? Also, help naming this question is also appreciated.
The way you have De-referenced(.) is also not correct. This is how it should be.
A = LOAD '/filepath/to/tabledata' using PigStorage(',') as (a:int,b:int,c:int);
B = FILTER A BY a==1;
C = GROUP B BY a;
But your way of dereferencing(.) will also work in some cases. You can only use dot(.) when you are referencing a complex data type like a map,tuple or bag. If we use dot operator to access the normal fields it would expect a scalar output. If it has more than one output then you will get a error something like this.
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,2,3), 2nd :(2,2,2)
Your way of using the dot operator would work only if the output of your group by has only one output if not you will end up with this error. Relation B is not a complex data type that is the reason we do not use any dereferencing operator in the group by clause.
Hope this answers your question.

PostgreSQL: How to access column on anonymous record

I have a problem that I'm working on. Below is a simplified query to show the problem:
WITH the_table AS (
SELECT a, b
FROM (VALUES('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data AS (
SELECT 'data7' AS c, array_agg(ROW(a, b)) AS d
FROM the_table
)
SELECT c, d[array_upper(d, 1)]
FROM my_data
In the my data section, you'll notice that I'm creating an array from multiple rows, and the array is returned in one row with other data. This array needs to contain the information for both a and b, and keep two values linked together. What would seem to make sense would be to use an anonymous row or record (I want to avoid actually creating a composite type).
This all works well until I need to start pulling data back out. In the above instance, I need to access the last entry in the array, which is done easily by using array_upper, but then I need to access the value in what used to be the b column, which I cannot figure out how to do.
Essentially, right now the above query is returning:
"data7";"(data5,6)"
And I need to return
"data7";6
How can I do this?
NOTE: While in the above example I'm using text and integers as the types for my data, they are not the actual final types, but are rather used to simplify the example.
NOTE: This is using PostgreSQL 9.2
EDIT: For clarification, Something like SELECT 'data7', 6 is not what I'm after. Imagine that the_table is actually pulling from database tables and not the WITH statement the I put in for convenience, and I don't readily know what data is in the table.
In other words, I want to be able to do something like this:
SELECT c, (d[array_upper(d, 1)]).b
FROM my_data
And get this back:
"data7";6
Essentially, once I've put something into an anonymous record by using the row() function, how do I get it back out? How do I split up the 'data5' part and the 6 part so that they don't both return in one column?
For another example:
SELECT ROW('data5', 6)
makes 'data5' and 6 return in one column. How do I take that one column and break it back into the original two?
I hope that clarifies
If you can install the hstore extension:
with the_table as (
select a, b
from (values('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data as (
select 'data7' as c, array_agg(row(a, b)) as d
from the_table
)
select c, (avals(hstore(d[array_upper(d, 1)])))[2]
from my_data
;
c | avals
-------+-------
data7 | 6
This is just a very quick throw together around a similarish problem - not an answer to your question. This appears to be one direction towards identifying columns.
with x as (select 1 a, 2 b union all values (1,2),(1,2),(1,2))
select a from x;

How to "default" a column in a SELECT query

Say I have a database table T with 4 fields, A, B, C, and D. A, B, and C are the primary key. For any combination of [A, B], there is always a row where C == spaces. There may or may not be other rows where C != spaces. I have a query that gets all rows where [A, B] == [in_a, in_b], and also where C == in_c if such a row exists, or C == spaces if the in_c row doesn't exist. So, if there is a row that matches the particular C value, I want that one, otherwise I want the spaces one. It is very important that if there is a matching C row, that I not be returned the spaces one along with it.
I have a working query, but its not very fast. This is executing on DB2 for z/OS. I have full control over these tables, so I can define new indicies if needed. The only index on the table right now is [A, B, C], the primary key. This SQL is kinda messy, and I feel theres a better way to accomplish this task. What can I do to make this query faster?
The query I have now is:
SELECT A, B, C, D FROM T
WHERE A = :IN_A AND B > :IN_B AND
(C = :IN_C
OR (NOT EXISTS(
SELECT B FROM T WHERE
A = :IN_A AND B > :IN_B AND C = :IN_C))
AND C = " ");
Caveat emptor, as I am not familiar with DB2 SQL...
You could try using an ORDER BY clause to sort the matching rows such that a row with c = spaces is last in the sorted set, then retrieve just the first row of the set. Something like:
select first
A, B, C, D
from T
where A = :IN_A
and B = :IN_B
order by C desc;
This assumes that the FIRST and ORDER BY DESC clauses do what I expect them to.
This will work on DB2 LUW, not sure if the order by clause works on DB2 Z:
select
a, b, c, d
from t
where a = :IN_A
and b = :IN_B
and c in (:IN_C,' ')
order by
case c when ' ' then 2 else 1 end
fetch first 1 row only
Make sure that the ' ' value matches the actual value of the column.
Good luck,
Why not start up the index advisor and reads its advices? (or is this only on DB2 for i/OS?)
We use the advisor for our very big production environment and it gives great advices. But having that said, it's always good to start with a good statement.