How to make testing an SQL Query faster?

How to make testing an SQL Query faster? - sql

I was given this code for a SQL query, but for the life of me can't figure out what exactly is going on. I'm pretty new to SQL so any help is greatly appreciated.
SELECT *
FROM ( SELECT rownum as rn
, a.*
FROM ( SELECT outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
FROM MESSAGES outbound
WHERE (1 = 1)
GROUP BY outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
ORDER BY CREATION_DATE DESC ) a
)
WHERE rn BETWEEN 1 AND 25
I'm specifically having touble understanding SELECT rownum as rn, a.* FROM (...a ) but I assume this is where I would edit the query to only check 1000 rows (which is my goal). Right now it's checking all entries in the database (750,000) and I only want it to check 1000 for testing.
Thanks!

Alright, let's start picking this guy apart. Starting with the subquery
SELECT outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
FROM MESSAGES outbound
WHERE (1 = 1)
GROUP BY outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
ORDER BY CREATION_DATE DESC
What's happening there is the subquery is selecting msg_id, msg_type, etc from the table MESSAGES. It's aliasing that table and calling it outbound. FROM MESSAGES outbound means "get the data from MESSAGES but call the table outbound instead."
Now, you might note the WHERE (1=1) clause... that's trivially true, and will always occur. Sometimes people use WHERE (1=1) because a script somewhere adds additional filters if certain parameters are selected. For now don't worry about that.
Last, the GROUP BY {blah blah blah} is telling your database to dedup these data. It's effectively SELECT DISTINCT. Last, the subquery is ordered by Creation_date DESC so the most recent occurrence of a message is the one that is selected. If I had to guess, the deduping and ordering is because this is a messaging system that might contain essentially duplicate records (like maybe someone resent the same email) or because messaging systems are often distributed and don't emphasize consistency on write, but rather write speed. I have no idea why exactly they needed to dedup these guys, but the important thing for you is that someone thought it was necessary and they were probably right.
Outside of the subquery you see
SELECT rownum as rn
, a.*
Everything that the subquery was doing got labelled "a". Remember that alias concept from earlier. Your entire subquery has an alias too, and it's called "a". So, we are selecting everything from a ("a.*") and we are also selecting the rownumber and calling that rn. The where clause at the very end says "give me the first 25 rows."
So... if you want to select 1000 rows in this manner (dedup, keep the most recent, etc) then just change WHERE rn BETWEEN 1 AND 25 to WHERE rn BETWEEN 1 AND 1000.
If, on the other hand, you don't want to dedup messages at all and only want the top 1000 rows of the table, then
SELECT outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
FROM MESSAGES outbound
WHERE ROWNUM <= 1000;
should do the job.
Does this help?

To answer your question you need to determine how early do you want to limit the subset of records your query will be checking for testing.
Also, you have to define what your goal is with testing: Are you looking to do simple check to determine if the query can be executed? Or are you actually looking to prove correctness?
If you just want to test that it executes you could put a limit very early on, something like this:
-- first part of query omitted for brevity
SELECT TOP 1000 outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
FROM MESSAGES outbound
-- bottom part of query omitted for brevity
Or, for fastest performance, limit the initial source:
-- first part of query omitted for brevity
SELECT outbound.MSG_ID
, outbound.MSG_TYPE
, outbound.FROM_ADDR
, outbound.TO_ADDR
, outbound.EMAIL_SUBJECT
, outbound.CREATION_DATE
, outbound.MQ_MSG_ID
FROM (SELECT TOP 1000 * FROM MESSAGES) outbound
-- bottom part of query omitted for brevity

Related

My SQL MERGE statement runs for too long

I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk).
I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such
that more than 1 row in source matches a row in target. This check is
computationally expensive and may affect the overall runtime of a
MERGE statement significantly. hive.merge.cardinality.check=false may
be used to disable the check at your own risk. If the check is
disabled, but the statement has such a cross join effect, it may lead
to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.

Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.
Other two options do not require ACID mode. Do full target rewrite.
Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):
insert overwrite table destination
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups
If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.
One more approach using full join: https://stackoverflow.com/a/37744071/2700344

SQL Server CTE use IDs from single column with EXCEPT?

Having received kindness the other day from someone whose eyes were less bleary than mine I thought I'd give it another shot. Thanks in advance for your assistance.
I have a single SQL Server (2012) table named Contacts. That table has four columns I am currently concerned with. The table has a total of 71,454 rows. There are two types of records in the table; Companies and Employees. Both use the same column, named (Client ID), for their primary key. The existence of a Company Name is what differentiates between Company and Employee data. Employees have no associated Company Name. There are 29,021 Companies leaving 42,433 Employees.
There may be 0-n number of Employees associated with any one Company. I am attempting to create output that will reflect the relationship between Companies and Clients, if there are any. I would like to use the Company ID (Client ID column) as my anchor data set.
Not sure my definition is correct but the thought was to create a CTE of the known Companies by virtue of a given Company Name. Then, use the remaining Client IDs but use the EXCEPT clause to filter the already-retrieved Client IDs out of the result set.
Here the code I currently have;
;
WITH cte ( BaseID, Client_id, Company_name,
First_name, Last_name, [level] )
AS ( SELECT Client_id AS BaseID ,
Client_id ,
Company_name ,
First_name ,
Last_name ,
1
FROM dbo.Conv_client_clean
WHERE ( COMPANY_NAME IS NOT NULL
OR COMPANY_NAME != ''
)
UNION ALL
SELECT c.BaseID ,
children.Client_id ,
children.Company_name ,
children.First_name ,
children.Last_name ,
cte.[level] + 1
FROM dbo.Conv_client_clean children
INNER JOIN cte c ON c.Client_id = children.CLIENT_ID
EXCEPT
SELECT children.Client_id
FROM cte
)
SELECT BaseID ,
Client_id ,
Company_name ,
first_name ,
Last_name ,
[Level]
FROM cte
OPTION ( MAXRECURSION 0 );
In this instance I receive the following error;
Msg 252, Level 16, State 1, Line 3
Recursive common table expression 'cte' does not contain a top-level UNION ALL operator.
Any suggestions?
Thanks!

In the recursion cte query, you cannot have more set operations(union, except, union all,intersect) after the the one Union ALL which is refers the cte itself. I think what you can try is change the query as below and check
...
UNION ALL
SELECT c.BaseID ,
children.Client_id ,
children.Company_name ,
children.First_name ,
children.Last_name ,
cte.[level] + 1
FROM dbo.Conv_client_clean children
WHERE children.Client_id NOT IN (SELECT Client_id FROM cte)

As mentioned to Kiran I was able to concoct an 'old fashioned' approach what is good enough for now.
Thank you everyone for your kind attention.

I'm not sure what you are trying to do with level. It seems that it will be 1 for companies and 2 for employees. If that's the case, you don't even need recursion. The first part of your cte creates a list of companies. That's fine. Now use that to join back to the original table to show all the employees too.
WITH
cte( BaseID, ClientID, Company_name, First_name, Last_name )AS(
SELECT Base_ID,
Base_ID AS Client_id ,
Company_name,
First_name,
Last_name
FROM dbo.Conv_client_clean
WHERE COMPANY_NAME IS NOT NULL
OR COMPANY_NAME <> ''
)
select c2.Base_id, c2.Client_id,
c1.Company_Name, c2.First_Name, c2.Last_Name,
case when c2.client_id is null then 1 else 2 end Level
from cte c1
join Conv_client_clean c2
on c1.BaseID = isnull( c2.Client_ID, c2.Base_id )
order by c1.BaseID, c2.Base_id;
Here's where I fiddled with it.

Unfortunately anything besides UNION ALL, after you've made your recursive reference, will not work. And if you think about it, it makes sense.
Recursion is conceptually identical to the following where recursion continues until max depth is reached or a query returns no results upon which another execution could act.
WITH Anchor AS (select...)
,recurse1 as (<Some body referring to Anchor>)
,recurse2 as (<Identical body except referring to recurse1>)
,recurse3 as (<Identical body except referring to recurse2>)
...
select * from Anchor
union all
select * from recurse1
union all
select * from recurse2
...
The problem is that conjunctive operators apply to EVERYTHING that precedes it. In your case, EXCEPT operates on everything to it's left side which includes the Anchor query. Afterwards, when looking for the anchor to which the recursive part must be applied, the query compiler doesn't find a 'top level union all operator' any more because it's been consumed as part of the left side of your recursive query.
It wouldn't help to contrive some syntax akin to parenthesis that could delimit the scope of the left side of your table conjunction because you would then build a case of 'multiple recursive references' which is also illegal.
BOTTOM LINE IS: The only conjunction that works in the recursive part of your query is UNION ALL because it simply concatenates the right side. It doesn't require knowledge of the left side to determine which rows to include.

Datediff between two tables

I have those two tables
1-Add to queue table
TransID , ADD date
10 , 10/10/2012
11 , 14/10/2012
11 , 18/11/2012
11 , 25/12/2012
12 , 1/1/2013
2-Removed from queue table
TransID , Removed Date
10 , 15/1/2013
11 , 12/12/2012
11 , 13/1/2013
11 , 20/1/2013
The TansID is the key between the two tables , and I can't modify those tables, what I want is to query the amount of time each transaction spent in the queue
It's easy when there is one item in each table , but when the item get queued more than once how do I calculate that?

Assuming the order TransIDs are entered into the Add table is the same order they are removed, you can use the following:
WITH OrderedAdds AS
( SELECT TransID,
AddDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY AddDate)
FROM AddTable
), OrderedRemoves AS
( SELECT TransID,
RemovedDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY RemovedDate)
FROM RemoveTable
)
SELECT OrderedAdds.TransID,
OrderedAdds.AddDate,
OrderedRemoves.RemovedDate,
[DaysInQueue] = DATEDIFF(DAY, OrderedAdds.AddDate, ISNULL(OrderedRemoves.RemovedDate, CURRENT_TIMESTAMP))
FROM OrderedAdds
LEFT JOIN OrderedRemoves
ON OrderedAdds.TransID = OrderedRemoves.TransID
AND OrderedAdds.RowNumber = OrderedRemoves.RowNumber;
The key part is that each record gets a rownumber based on the transaction id and the date it was entered, you can then join on both rownumber and transID to stop any cross joining.
Example on SQL Fiddle

DISCLAIMER: There is probably problem with this, but i hope to send you in one possible direction. Make sure to expect problems.
You can try in the following direction (which might work in some way depending on your system, version, etc) :
SELECT transId, (sum(add_date_sum) - sum(remove_date_sum)) / (1000*60*60*24)
FROM
(
SELECT transId, (SUM(UNIX_TIMESTAMP(add_date)) as add_date_sum, 0 as remove_date_sum
FROM add_to_queue
GROUP BY transId
UNION ALL
SELECT transId, 0 as add_date_sum, (SUM(UNIX_TIMESTAMP(remove_date)) as remove_date_sum
FROM remove_from_queue
GROUP BY transId
)
GROUP BY transId;
A bit of explanation: as far as I know, you cannot sum dates, but you can convert them to some sort of timestamps. Check if UNIX_TIMESTAMPS works for you, or figure out something else. Then you can sum in each table, create union by conveniently leaving the other one as zeto and then subtracting the union query.
As for that devision in the end of first SELECT, UNIT_TIMESTAMP throws out miliseconds, you devide to get days - or whatever it is that you want.
This all said - I would probably solve this using a stored procedure or some client script. SQL is not a weapon for every battle. Making two separate queries can be much simpler.

Answer 2: after your comments. (As a side note, some of your dates 15/1/2013,13/1/2013 do not represent proper date formats )
select transId, sum(numberOfDays) totalQueueTime
from (
select a.transId,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate
) X
group by transId
Answer 1: before your comments
Assuming that there won't be a new record added unless it is being removed. Also note following query will bring numberOfDays as zero for unremoved records;
select a.transId, a.addDate, r.removeDate,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate

To return only the latest row [duplicate]

This question already has answers here:
How to get the last row of an Oracle table
(7 answers)
Closed 8 years ago.
I have a table storing transaction called TRANSFER . I needed to write a query to return only the newest entry of transaction for the given stock tag (which is a unique key to identify the material) so i used the following query
SELECT a.TRANSFER_ID
, a.TRANSFER_DATE
, a.ASSET_CATEGORY_ID
, a.ASSET_ID
, a.TRANSFER_FROM_ID
, a.TRANSFER_TO_ID
, a.STOCK_TAG
FROM TRANSFER a
INNER JOIN (
SELECT STOCK_TAG
, MAX(TRANSFER_DATE) maxDATE
FROM TRANSFER
GROUP BY STOCK_TAG
) b
ON a.STOCK_TAG = b.STOCK_TAG AND
a.Transfer_Date =b.maxDATE
But i end with a problem where when more than one transfer happens on the same transfer date it returns all the row where as i need only the latest . how can i get the latest row?
edited:
transfer_id transfer_date asset_category_id asset_id stock_tag
1 24/12/2010 100 111 2000
2 24/12/2011 100 111 2000

To avoid the potential situation of rows not being inserted in transfer_date order, and maybe for performance reasons, you might like to try:
select
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG
from (
SELECT
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG ,
row_number() over (
partition by stock_tag
order by transfer_date desc,
transfer_id desc) rn
FROM TRANSFER)
where rn = 1

Consider selecting MAX(TRANSFER_ID) in your subquery, assuming that TRANSFER_ID is an incrementing field, such that later transfers always have larger IDs than earlier transfers.

SQL query, select from 2 tables random

Hello all i have a problem that i just CANT get to work like i what it..
i want to show news and reviews (2 tables) and i want to have random output and not the same output
here is my query i really hope some one can explain me what i do wrong
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6

First off it looks like the column order for the two SELECT statements don't match which they need to for a UNION.
What does the following return?
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.id ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6
(which RDBMS are you using? the SQL you have is not valid for Sybase but there may be techniques depending on the 'flavour' of SQL you are using)

Since RAND() appears only in the ORDER BY clause, would it not only be evaluated once for the whole query, and not once per row?

The problem is the first table is not selecting random elements
SELECT temp.* FROM
(
SELECT
anmeldelser.id ,
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.godkendt,
'News' as artType
FROM anmeldelser
UNION
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt,
'Review' as artType
FROM nyheder
) temp
ORDER BY rand() LIMIT 0,6

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas