SQL Covering Columns Order - sql

Does the order of covering columns matter in an index?
CREATE INDEX idx1 ON MyTable (Col1, Col2) INCLUDE (Col3, Col4)
That is the order of Col3 & Col4 in the above example.

No, included columns are not ordered, so the order that they appear does not matter

Related

Get one row per unique column value combination (`DISTINCT ON` operation without using it)

I have a table with 5 columns, For each unique combination of the first three columns, I want a single sample row. I don't care which values are considered for columns 4 and 5, as long as they come from the same row (they should be aligned).
I realise I can do a DISTINCT to fetch on the first three columns to fetch unique combinations of the first three columns. But the problems is then I cannot get 4th and 5th column values.
I considered GROUP BY, I can do a group by on the first three columns, then I can do a MIN(col4), MIN(col5). But there is no guarantee that values of col4 and col5 are aligned (from the same row).
The DB I am using does not support DISTINCT ON operation, which I realise is what I really need.
How do I perform what DISTINCT ON does without actually using that operation ?
I am guessing this is how I would write the SQL if DISTINCT ON was supported:
SELECT
DISTINCT ON (col1, col2, col3)
col1, col2, col3, col4, col5
FROM TABLE_NAME
select
col1, col2, col3, col4, col5
from (
select col1, col2, col3, col4, col5,
row_number() over (partition by col1, col2, col3) as n
from table_name
)
where n = 1

Why does the optimizer choose a keylookup instead of 2 separate queries?

I have a table that has a primary key/clustered index on an ID column and a nonclustered index on a system date column. If I query all the columns from the table using the system date column (covering index wouldn't make sense here) the execution plan shows a key lookup because for each record it finds it has to go the the ID to get all of the column data.
The weird thing is, if I write 2 queries with a temp table it performs much faster. I can query the system date to get a table of ID's and then use that table to search the ID column. This makes sense because you're no longer doing the slow key lookup for each record.
Why doesn't the optimizer do this for us?
--slow version with key lookup
--id primary key/clustered index
--systemdate nonclustered index
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable
where SystemDate > '2019-01-01'
--faster version
--id primary key/clustered index
--systemdate nonclustered index
select ID, SystemDate
into #myTempTable
from MyTable
where SystemDate > '2019-01-01'
select t1.ID, t1.col1, t1.col2, t1.col3, t1.col4, t1.col5, t1.SystemDate
from MyTable t1
inner join #myTempTable t2
on t1.ID = t2.ID
Well, in second case you're actually doing a key lookup yourself, aren't you? ; )
Optimizer could perform slower due to outdated (or missing) statistics, fragmented index.
To tell you why it's actually slower, it's best if you'd paste your execution plans here. This would be way easier to explain what happens.
Query optimizer chooses key lookup because the query is not supported by covering index. It has to grab missing columns from table itself:
/*
--slow version with key lookup
--id primary key/clustered index
--systemdate nonclustered index
*/
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable
where SystemDate > '2019-01-01';
Adding a covering index should boost the performance:
CREATE INDEX my_idx ON MyTable(SystemDate) INCLUDE(col1, col2, col3, col4, col5);
db<>fiddle demo
For query without JOIN:
select ID, col1, col2, col3, col4, col5, SystemDate
from MyTable -- single table
where SystemDate > '2019-01-01';
There is JOIN in execution plan:
After introducing covering index there is no need for additional key lookup:

Performance for Avg & Max in SQL

I want to decrease the query execution time for the following query.
This query is taking around 1 min 20 secs for about 2k records.
Numbers of records in table: 1348474
Number of records processed through where query: 25000
Number of records returned: 2152
SELECT Col1, Col2,
ISNULL(AVG(Col3),0) AS AvgCol,
ISNULL(MAX(Col3),0) AS MaxCol,
COUNT(*) AS Col5
FROM TableName WITH(NOLOCK)
GROUP BY Col1, Col2
ORDER BY Col1, MaxCol DESC
I tried removing the AVG & MAX columns and it lowered to 1 sec.
Is there any optimized solution for the same?
I have no other indexing other than Primary key.
Update
Indexes added:
nonclustered located on PRIMARY - Col1
nonclustered located on PRIMARY - Col2
clustered, unique, primary key located on PRIMARY - Id
======
Thanks in advance..Happy coding !!!
For this query:
SELECT Col1, Col2,
COALESCE(AVG(Col3), 0) AS AvgCol,
COALESCE(MAX(Col3), 0) AS MaxCol,
COUNT(*) AS Col5
FROM TableName
GROUP BY Col1, Col2
ORDER BY Col1, MaxCol DESC;
I would start with an index on (Col1, Col2, Col3).
I'm not sure if this will help. It is possible that the issue is the time for ordering the results.

De-duplicating rows in a table with respect to certain columns and retaining the corresponding values in the other columns in HIVE

I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.
You could use something as below if you are not considered about ordering
create table table2 as
select col1, col2, col3,
,split(agg_col,"|")[0] as col4
,split(agg_col,"|")[1] as col5
,split(agg_col,"|")[2] as col6
,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
max(concat(cast(col4 as string),"|",
cast(col5 as string),"|",
cast(col6 as string),"|",
cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;
Below is another approach, which gives much control over ordering but slower than above approach
create table table2 as
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
rank() over ( partition by col1, col2, col3
order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;
rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.

Multiple Non-Clustered index and performance?

I have a table in SQL Server that has 700 000 records. But, when I am making a simple select query with 3 to 4 conditions in where clause, it is taking up to 45 seconds. I already have 2 non-clustered and 1 clustered index on that. So I was thinking to add 2 more non-clustered index in that table. By doing so, My table will have indexes for all columns which I am using in where clause of my query. I have also done it and found that result is coming quite faster as compared to previous one.
Can having 5 to 6 Non-clustered index can harm database performance or it would not affect much?
My Query structure is
SELECT ( SOME COLUMNS) FROM MyTable
WHERE COL1 = #Id AND COL2 >= #SomeDate AND (NOT (COL3 = 1)) AND
(COL4 <= #SomeOtherDate)
Table has 35 columns.
This is your query:
SELECT ( SOME COLUMNS)
FROM MyTable
WHERE COL1 = #Id AND COL2 >= #SomeDate AND (NOT (COL3 = 1)) AND
(COL4 <= #SomeOtherDate)
Unfortunately, your query can only make direct use of two columns in this clause. I would suggest the following composite index: (col1, col2, col3, col4). This index covers the where clause, but can only be used directly for the first two conditions.
A clustered index would probably be a marginal improvement over a non-clustered b-tree index.
Note if col3 only takes on the values 0 and 1, then you should write the where case:
WHERE COL1 = #Id AND COL2 >= #SomeDate AND COL3 = 0 AND
(COL4 <= #SomeOtherDate)
And use either (col1, col3, col2, col4) or (col1, col3, col4, col2).