I have a query as follows:
SELECT c.irn,
pLog.policingname,
ce.*
INTO #caselist
FROM employeereminder_ilog ce
JOIN cases c
ON ce.caseid = c.caseid
JOIN policinglog pLog
ON ( ce.logdatetimestamp BETWEEN
pLog.startdatetime AND pLog.finishdatetime )
WHERE ce.logdatetimestamp BETWEEN #start_pre AND #end_pre
employeereminder_iLOG is a pretty huge table, around 32M rows.
POLICINGLOG has around 50 rows.
CASES around 0.5m rows.
#start_pre and #end_pre are predefined variabled around 30 minutes apart.
This query took around 30 minutes to run, and returns around 600 results.
I was trying to find way to speed up the query by looking at the execution plan. I couldn't work out why however the insert was taking up 99% of the query, as opposed to the select from employeereminder_iLOG .
So, my questions are:
Why is the cost coming from the insert, and not the select from employeereminder_iLOG.
Is it possible to speed up this query?
Related
Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.
I encounter a problem with optimization.
When I use a query like this:
Select * (around 100 columns)
from x
where RepoDate = '2020-05-18'
It's taking around 0.2 seconds.
Unless I'm using query like this:
Select * (around 100 columns)
from x
where RepoDate = (select max(RepoDate) from y)
It takes around 1 hour.
Table y has only dates (2020-05-17, 2020-05-18, ... )
Can you tell me why there is so much difference in time to execute?
Technically, you are comparing a simple first query with a query having another subquery that might be processing a heavy table "y" already.
So these two queries are not the same to start with. We need to have explain plans to have an estimation of the cost of the subquery first (i.e. how many rows, index usage etc.) then we move to the parent query.
I have a query that cross joins two tables. TABLE_1 has 15,000 rows and TABLE_2 has 50,000 rows. A query very similar to this one has run in the past in roughly 10 minutes. Now it is running indefinitely with the same server situation (i.e. nothing else running), and the very similar query is also running indefinitely.
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) >= 0.9
When I run the estimated execution plan for this query, the Nested Loops (Inner Join) node is estimated at 96% of the execution. The estimated number of rows is 218 million, even though cross joining the tables should result in 15,000 * 50,000 = 750 million rows. When I add INSERT INTO #temp_table to the beginning of the query, the estimated execution plan puts Insert Into at 97% and estimates the number of rows as 218 million. In reality, there should be less than 100 matches that have a similarity score above 0.9.
I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?
I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?
Yes, this is true. It particularly affects optimizations involving join algorithms, aggregation algorithms, and indexes.
But it is not true for your query. Your query has to do a nested loops join with no indexes. All pairs of values in the two tables need to be compared. There is little algorithmic flexibility and (standard) indexes cannot really help.
For better performance, use the minScoreHint parameter. This allows to prevent doing the full similarity calculation for many pairs and early exit.
So this should run quicker:
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) >= 0.9
It is not clear from docs if 0.9 results would be included. If not, change 0.9 to 0.89
The link provided by scsimon will help you prove whether it's statistics or not. Have the estimates changed significantly since to when it was running fast?
Parallelism spring to mind. If the query was going parallel, but now isn't (e.g. if a server setting has been changed, or statistics) then that could cause significant performance degradation.
vw_project is a view which involves 20 CTEs, join them multiple times and return 56 columns
many of these CTEs are self-joins (the classic "last row per group", in our case we get the last related object product / customer / manager per Project)
most of the tables (maybe 40 ?) involved don't exceed 1000 rows, the view itself returns 634 rows.
We are trying to improve the very bad performances of this view.
We denormalized (went from TPT to TPH), and reduce by half the number of joins with almost no impact.
But i don't understand the following results i am obtaining :
select * from vw_Project (TPT)
2 sec
select * from vw_Project (TPH)
2 sec
select Id from vw_Project (TPH , TPT is instant)
6 sec
select 1 from vw_Project (TPH , TPT is instant)
6 sec
select count(1) from vw_Project (TPH , TPT is instant)
6 sec
Execution plan for the last one (6 sec) :
https://www.brentozar.com/pastetheplan/?id=r1DqRciBW
execution plan after sp_updatestats
https://www.brentozar.com/pastetheplan/?id=H1Cuwsor-
To me, that seems absurd, I don't understand what's happening and it's hard to know whether my optimization strategies are relevant since I have no idea what justifies the apparently irrationnal behaviors I'm observing...
Any clue ?
CTE has no guarantee order to run the statements and 20 CTEs are far too many in my opinion. You can use OPTION (FORCE ORDER) to force execution from top to bottom.
For selecting few thousand rows however anything more than 1 sec is not acceptable regardless of complexity. I would choose an approach of a table function so i would have the luxury to create hash tables or table variables inside to have full control of each step. This way you limit the optimizer scope within each step alone.
I have a table of IP addresses and a table of IP address ranges (start ip, end ip) that I'd like to join together. I've been able to make this work with the following query:
SELECT * FROM `ips` i
JOIN `ranges` a
ON NET.SAFE_IP_FROM_STRING(i.ip)
BETWEEN NET.SAFE_IP_FROM_STRING(a.start_ip)
AND NET.SAFE_IP_FROM_STRING(a.end_ip)
The problem I'm having is that it scales really badly. To do it for 10 IPs takes around 8 seconds, 100 takes 30 seconds, and 1000 takes a few minutes. I'd like to be able to do this for tens of millions of rows. (I have tried writing the output of NET.SAFE_IP_FROM_STRING to the ranges table, but it only speeds things up by around 10%, and doesn't help with scaling).
The ranges don't overlap, so for every row in the input table I expect 0 or 1 rows in the output table. A LATERAL JOIN would let me do that and almost certainly speed things up, but I don't think BigQuery supports them. Is there any other way to make this query faster and scalable?
After reviewing the article at https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html that was linked to in Felipe's answer I was able to put something together that is incredibly fast and scales really well. As Felipe alluded to, the trick is to do a direct join on a prefix (I went with /16), and then filter with a between. I'm pre-processing the ranges to split anything larger than a /16 into multiple blocks. I then overwrite the table with this query, which adds some additional fields:
SELECT *,
NET.SAFE_IP_FROM_STRING(start_ip) AS start_b,
NET.SAFE_IP_FROM_STRING(end_ip) AS end_b,
NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(start_ip), 16) as prefix
The join query then looks something like this:
SELECT * FROM `ips` i
JOIN `ranges` a
ON a.prefix = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 16)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_b AND a.end_b
Joining 10 million IPs to 1 million ranges now takes less than 30 seconds at billing tier 1!
I did something like this on https://stackoverflow.com/a/20156581
I'll need to update my queries for #standardSQL, but the basic secret is generating a smaller JOIN area.
If you can share a sample dataset, I'll be happy to provide a new query.