I've got a pretty basic SQL query that's become a bottleneck in my processing. It's selecting a large varchar(999) column that's slowing it down. Removing that column from the select speeds it up considerably so I know it's the column that's causing problem.
I was experimenting with breaking it up into smaller 300 record batches to see if that helped and I saw something weird happening. Some of the batches were taking almost 30 seconds, and some were taking 0.012 seconds. I don't know what's causing this discrepancy.
I have a reproducible scenario where the first query is taking many times faster than the 2nd:
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1118482415 and 1118509835
0.3 seconds
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1115330220 and 1118482415
8 seconds
I see no visible differences in the returned data. They both return 300 records and all of the record's "Data" column values are about 170 characters long. I'm running this directly from the SqlStudio client. Also there's no other traffic in this database.
Does anybody know what could be causing this problem or have any suggestions to try? I can't decrease the size of the column because there are some bigger records in there, just not in this example. I do have indexes on all the columns used in the joins (Calls.ID, RawData.ID, RawData.FileID, DataFiles.ID).
Related
Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.
I have a SQLite database A with numeric columns for start and stop that is quite large (1M rows). And I have a second list of numeric ranges B beginning and end that is medium (10K rows).
I would like to find the set of entries in A that overlap with ranges in B.
I could do this with a python script that iterates through list B and does 10K database queries, but I'm wondering if there's a more SQLish way to do it. List B could potentially be slurped into the database as an indexed TEMP TABLE if that helps the process.
Possible simplification, though not optimal, is that list A could be treated as a single location, position, allowing us to only look for A.position that fall inside B.beginning and B.end.
One trick I use to speed this up is to define a CHUNK. This can be as simple as the midpoint of the start and end, divided by a chunksize and then cast as an integer. To build off #Gordon Linoff's answer, you could use a 10k window chunk as follows:
with a_chunk as (
select a.*, cast((a.start+a.end)/(2*10000) as integer) as CHUNK
from a
),
b_chunk as (
select b.*, cast((b.start+b.end)/(2*10000) as integer) as CHUNK
from b
)
select ac.*, bc.*
from a_chunk ac join b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start;
This divides your search space so that rather than joining every row in a against every row in b, you're only joining entries within the same 10k-width window. This should still be an O(m*n) operation but will be considerably faster due to the restricted search space and smaller m/n sizes.
However, this comes with caveats. For instance, the intervals (9995, 9999) and (9998, 10008) will get placed in different chunks despite being clearly overlapping, and your resultant query would miss that. Therefore you can get your edge cases by replacing the single select statement with
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK - 1
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK + 1
and ac.start < bc.end
and ac.end > bc.start;
Even this isn't perfect though. If you have intervals significantly larger than your 10k window size, you could likely still overlook some results. Increasing the window size to accommodate this would come at the cost of joining more entries at a time, which the chunks were designed to avoid. The best balance will likely be finding an appropriate window size and then covering edge cases by including enough UNIONs to include on ac.CHUNK = bc.CHUNK + {-n...n} for however large you think n should be.
Rather than using a CTE, you can also speed this up in SQLite by hard-coding CHUNK as a column in your tables and then creating clustered indexes on each table for (CHUNK, start). You may or may not benefit from including end in this index as well, though you'll have to EXPLAIN QUERY PLAN on your specific case to see whether the optimizer actually does this. The trade-off, of course, is increased storage space, which may not be ideal depending on what you're trying to do.
This admittedly feels like a hack and I'm trying to answer a similar question for my own project. I've heard that the only efficient solution is to manually take the data and implement an interval tree. However, with millions of rows, I'm not sure how efficient it would be to take this from sqlite and build a tree manually in your programming language of choice. If anyone has any better solutions I'd be happy to hear. At least in python, the ncls library seems like it could get the job done.
You can easily express this in SQL as a join. For partial overlaps, this would be:
select a.*, b.*
from a join
b
on a.start < b.end and a.end > b.start;
However, this will be slow, because it will be doing a nested loop comparison. So, although concise, this won't necessarily be much faster.
I currently have two tables, cts(time, symbol, open, close, high, low, volume) and dividends(time, symbol, dividend). i am attempting to make a third table named, dividend_percent with columns Time, Date and Percent. To get the percentage for the dividend I believe the formula to be ((close-(open+dividend))/open)*100.
The request however exceeded the size allowed by oraclexe and thus failed but i don't believe my request should have been that big.
SQL> create table dividend_percent
2 as (select c.Time, c.Symbol, (((c.close-(c.open+d.dividend))/c.open)*100) P
RCNT
3 from cts c inner join dividend d
4 on c.Symbol=d.Symbol);
from cts c inner join dividend d
*
ERROR at line 3:
ORA-12953: The request exceeds the maximum allowed database size of 11 GB
Am i writing the query wrong or in such a way that's really inefficient? the two tables are big but i don't think too big.
Perhaps you could make a view which combines the two tables and performs the necessary calculations when needed:
CREATE VIEW DIVIDEND_PERCENT_VIEW AS
SELECT c.TIME,
c.SYMBOL,
((c.CLOSE - (c.OPEN + d.DIVIDEND)) / c.OPEN) * 100 AS PRCNT
FROM CTS c
INNER JOIN DIVIDEND d
ON c.SYMBOL = d.SYMBOL AND
c.TIME = d.TIME
WHERE c.OPEN <> 0;
This would avoid duplicating the data, eliminate the need to store everything twice, and perform the PRCNT calculation for data added after the view is created as well as for pre-existing data.
Perhaps you could use materialized view if you are intending to perform DML operations as well as keep the table in sync.
I'm using an existing Oracle database (that I did not construct, and know nothing about beyond its table structure). Some queries are pretty fast, and other seemingly very similar ones are very slow. For example
SELECT a.price, c.banner_id, c.short_name
FROM ret_price_current a
JOIN ret_store b ON a.store_id = b.store_id
JOIN ret_banner c ON b.banner_id = c.banner_id
JOIN ret_store2cbsa_csa d ON a.store_id = d.store_id
WHERE rownum<3
(1.09, 74, 'Safeway')
(1.09, 74, 'Safeway')
that took 0.243073940277 seconds
but if I add a seemingly simple WHERE condition:
SELECT a.price, c.banner_id, c.short_name
FROM ret_price_current a
JOIN ret_store b ON a.store_id = b.store_id
JOIN ret_banner c ON b.banner_id = c.banner_id
JOIN ret_store2cbsa_csa d ON a.store_id = d.store_id
WHERE c.banner_id = 74
AND rownum<3
it has been running without returning for many minutes now. What is going on? (For reference, ret_price_current has ~300m entries and the others are much smaller.) I imagine it has to do with indices -- can someone point me to a book about database algorithms (like how queries actually work on the back end) so I can understand wtf is going on?
The reason is that ROWNUM is generated on the rows as they are outputted.
Your first query has no critera, therefore it will spit out the first 3 rows and be done with it. You can generally find any 3 rows that match pretty fast.
Your second has to find 3 rows that match the criteria before it can stop (and it might never find those 3 rows).
The queries are completely different hence the different times to execute.
The way to get this running fast would be to index c.banner_id (and, in fact, all your FKs).
oops - just noticed the timestamp on this by another answer. I'll leave this here anyway as it does answer the questions, as does one of the comments.
I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem.
The problem query:
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000'
ORDER BY actionhistory_id
)
WHERE rownum <= 30000;
The "fix"
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000'
ORDER BY actionhistory_id
)
All of the _id columns are indexed sequences.
The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database.
Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.
This is because of the index on the actionhistory_id column.
During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set.
During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query.
Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks).
EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.
What are the plans for both queries?
Are the statistics on your tables up to date?
Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo?
Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times?
My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.