optimize insert max dates from 1 million row table - sql

I need to get the max dates from a detail table which meets the following condition.
This transaction table reaches near 1 million rows.
Is there a better query than this?
insert into SCH1.maxDATES
select a.ID, a.STATUS, max(detail.REGISTER_DATE) max_DATE
from SCH1.User a
inner join SCH1.Transaction detail on detail.ID = a.ID
where a.STATUS = 3 and detail.REGISTER_DATE is not null
group by a.ID, a.STATUS

Determine what the indexes are for that table and join on them if possible. Also being more specific, without limiting the data you want, is always better.
Here is a helpful site I commonly look at for optimization advice:
http://beginner-sql-tutorial.com/sql-query-tuning.htm

Related

Any tips and tricks to avoid or reduce cost of one-to-many joins and non-equi joins when dataset is large?

I am wondering how people grapple with large one-to-many joins, and in particular non-equi joins, when they have large data. If the keys of the two tables A and B are sufficiently repetitive, the output of the join between the two can be nearly the size of |A| * |B|. This must come up frequently in analytics at large companies, so I am wondering what ways there are to reduce the computation time of these joins.
However, many times A and B are different tables, and in those cases I do not think LAG() can be used.
Example of a non-equi, one-to-many join
As a simplified example of a situation where a non-equi and one-to-many join might be warranted, I have tables A and B, with each having a numeric id column, a date field date_created and some field group. For each row in table A, I want the id column of A and all data of the corresponding row in table B where B.date_created is the largest possible value such that A.date_created > B.date_created and A.group = B.group. In other words, I want the most recent row of table B with respect to the date_created and group fields of each row in column A.
Code when using a window function
In most use cases where these non-equi-joins come up, A and B are the same table and the date_created fields in fact correspond to the same column. In this situation, I could use the LAG() window function:
WITH id_tuples AS
(
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A
)
SELECT id_t.id,
A.*
FROM id_tuples id_t
INNER JOIN A ON A.id = id_t.lagged_id
which I believe is more efficient than a self-join. However, this approach is not possible when the columns being compared are different, or belong to different tables.
Code when window function is not feasible
I use the following code to compute the most recent row of table B for each row in table A.
SELECT *
FROM
(
SELECT A.id,
B.*,
DENSE_RANK() OVER (PARTITION BY A.id ORDER BY B.date_created) AS date_rank
FROM A
INNER JOIN B ON B.group = A.group
AND B.date_created < A.date_created
)
WHERE date_rank = 1
The problem here is that the grouping variables A.group and B.group can have only a few distinct values. Then the join becomes nearly a Cartesian join and the number of outputted results in the subquery can be many orders of magnitude greater than the sum of the rows of A and B. This is wasteful since the outer query proceeds to throw out the majority of the results by filtering for date_rank = 1.
Is there a better way of structuring the query to reduce the cost of these joins, or avoid them entirely in these situations? I am asking in the abstract but I've found that neither my relational database, nor my Spark cluster (once I move the data there) has enough memory to handle such a join. Even on smaller datasets, this operation takes a large amount of time to run. And I don't believe my dataset is particularly large relative to what others are doing.
Your first query can simply be written as:
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A;
There is no need for the JOIN.
For the second query, one method is a lateral join:
SELECT A.id, B.*,
FROM A LEFT JOIN LATERAL
(SELECT B.*
FROM B
WHERE B.group = A.group AND
B.date_created < A.date_created
ORDER BY B.date_created DESC
FETCH FIRST 1 ROW ONLY
) B;
This should use an index on B(GROUP, date_created).

SQL Query to count the records

I am making up a SQL query which will get all the transaction types from one table, and from the other table it will count the frequency of that transaction type.
My query is this:
with CTE as
(
select a.trxType,a.created,b.transaction_key,b.description,a.mode
FROM transaction_data AS a with (nolock)
RIGHT JOIN transaction_types b with (nolock) ON b.transaction_key = a.trxType
)
SELECT COUNT (trxType) AS Frequency, description as trxType,mode
from CTE where created >='2017-04-11' and created <= '2018-04-13'
group by trxType ,description,mode
The transaction_types table contains all the types of transactions only and transaction_data contains the transactions which have occurred.
The problem I am facing is that even though it's the RIGHT join, it does not select all the records from the transaction_types table.
I need to select all the transactions from the transaction_types table and show the number of counts for each transaction, even if it's 0.
Please help.
LEFT JOIN is so much easier to follow.
I think you want:
select tt.transaction_key, tt.description, t.mode, count(t.trxType)
from transaction_types tt left join
transaction_data t
on tt.transaction_key = t.trxType and
t.created >= '2017-04-11' and t.created <= '2018-04-13'
group by tt.transaction_key, tt.description, t.mode;
Notes:
Use reasonable table aliases! a and b mean nothing. t and tt are abbreviations of the table name, so they are easier to follow.
t.mode will be NULL for non-matching rows.
The condition on dates needs to be in the ON clause. Otherwise, the outer join is turned into an inner join.
LEFT JOIN is easier to follow (at least for people whose native language reads left-to-right) because it means "keep all the rows in the table you have already read".

SQL grouping. How to select row with the highest column value when joined. No CTEs please

I've been banging my head against the wall for something that I think should be simple but just cant get to work.
I'm trying to retrieve the row with the highest multi_flag value when I join table A and table B but I can't seem to get the SQL right because it returns all the rows rather than the one with the highest multi_flag value.
Here are my tables...
Table A
Table B
This is almost my desired result but only if I leave out the value_id row
SELECT CATALOG, VENDOR_CODE, INVLINK, NAME_ID, MAX(multi_flag) AS multiflag
FROM TBLINVENT_ATTRIBUTE AS A
INNER JOIN TBLATTRIBUTE_VALUE AS B
ON A.VALUE_ID = B.VALUE_ID
GROUP BY CATALOG, VENDOR_CODE, INVLINK, NAME_ID
ORDER BY CATALOG DESC
This is close to what I want to retreive but not quite notice how it returns unique name_id and the highest multi_flag but I also need the value_id that belongs to such multi_flag / name_id grouping...
If I include the value_id in my SQL statement then it returns all rows and is no longer grouped
Notic ein the results below how it no longer returns the row for the highest multi_flag and how all the different values for name_id (Ex. name_id 1) are also returned
You can choose to use a sub-query, derived table or CTE to solve this problem. Performance will be depending on the amount of data you are querying. To achieve your goal of getting the max multiflag you must first get the max value based on the grouping you want to achieve this you can use a CTE or sub query. The below CTE will give the max multi_flag by value that you can use to get the max multi_flag and then you can use that to join back to your other tables. I have three joins in this example but this can be reduce and as far a performance it may be better to use a subquery but you want know until you get the se the actual execution plans side by side.
;with highest_multi_flag as
(
select value_id, max(multi_flag) AS multiflag
FROM TBLINVENT_ATTRIBUTE
group by value_id
)
select A.CATALOG, a.VENDOR_CODE, a.INVLINK, b.NAME_ID,m.multiflag
from highest_multi_flag m
inner join TBLINVENT_ATTRIBUTE AS A on a.VALUE_ID =b. m.VALUE_ID
INNER JOIN TBLATTRIBUTE_VALUE AS B ON m.VALUE_ID = B.VALUE
You can use Lateral too, its an other solution
SELECT
A.CATALOG, A.VENDOR_CODE, A.INVLINK, B.NAME_ID, M.maxmultiflag
FROM TBLINVENT_ATTRIBUTE AS A
inner join lateral
(
select max(B.multi_flag) as maxmultiflag from TBLINVENT_ATTRIBUTE C
where A.VALUE_ID = C.VALUE_ID
) M on 1=1
INNER JOIN TBLATTRIBUTE_VALUE AS B ON M.maxmultiflag = B.VALUE

Difference between Two Queries - Join vs IN

I have the following two queries. Query1 is returning 1000 as row count where as Query2 is returning 4000 as row count. Can someone please explain the difference between both the queries. I was hoping both would return same count.
Query1:
SELECT COUNT(*)
FROM TableA A
WHERE A.VIN IN (
SELECT VIN
FROM TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN
)
Query2:
SELECT COUNT(*)
FROM TABLEA A, TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN AND A.VIN = C.VIN
In many cases, they will return the same answer, but not necessarily. The first counts the number of rows in A that match the conditions -- each row is counted only once, regardless of the number of matches. The second does a join, which can multiply the number of rows.
The second query would be equivalent in results if it used count(distinct A.id), where id is unique or a primary key.
That said, although they are similar in functionality, how they are executed can be quite different. Different SQL engines might do a better job of optimizing one version or the other.
By the way, you should avoid the archaic join syntax that you are using. Since 1992, explicit joins have been part of SQL syntax.

SQL, only if matching all foreign key values to return the record?

I have two tables
Table A
type_uid, allowed_type_uid
9,1
9,2
9,4
1,1
1,2
24,1
25,3
Table B
type_uid
1
2
From table A I need to return
9
1
Using a WHERE IN clause I can return
9
1
24
SELECT
TableA.type_uid
FROM
TableA
INNER JOIN
TableB
ON TableA.allowed_type_uid = TableB.type_uid
GROUP BY
TableA.type_uid
HAVING
COUNT(distinct TableB.type_uid) = (SELECT COUNT(distinct type_uid) FROM TableB)
Join the two tables togeter, so that you only have the records matching the types you are interested in.
Group the result set by TableA.type_uid.
Check that each group has the same number of allowed_type_uid values as exist in TableB.type_uid.
distinct is required only if there can be duplicate records in either table. If both tables are know to only have unique values, the distinct can be removed.
It should also be noted that as TableA grows in size, this type of query will quickly degrade in performance. This is because indexes are not actually much help here.
It can still be a useful structure, but not one where I'd recommend running the queries in real-time. Rather use it to create another persisted/cached result set, and use this only to refresh those results as/when needed.
Or a slightly cheaper version (resource wise):
SELECT
Data.type_uid
FROM
A AS Data
CROSS JOIN
B
LEFT JOIN
A
ON Data.type_uid = A.type_uid AND B.type_uid = A.allowed_type_uid
GROUP BY
Data.type_uid
HAVING
MIN(ISNULL(A.allowed_type_uid,-999)) != -999
Your explanation is not very clear. I think you want to get those type_uid's from table A where for all records in table B there is a matching A.Allowed_type_uid.
SELECT T2.type_uid
FROM (SELECT COUNT(*) as AllAllowedTypes FROM #B) as T1,
(SELECT #A.type_uid, COUNT(*) as AllowedTypes
FROM #A
INNER JOIN #B ON
#A.allowed_type_uid = #B.type_uid
GROUP BY #A.type_uid
) as T2
WHERE T1.AllAllowedTypes = T2.AllowedTypes
(Dems, you were faster than me :) )