understanding sql query times - sql

I'm using an existing Oracle database (that I did not construct, and know nothing about beyond its table structure). Some queries are pretty fast, and other seemingly very similar ones are very slow. For example
SELECT a.price, c.banner_id, c.short_name
FROM ret_price_current a
JOIN ret_store b ON a.store_id = b.store_id
JOIN ret_banner c ON b.banner_id = c.banner_id
JOIN ret_store2cbsa_csa d ON a.store_id = d.store_id
WHERE rownum<3
(1.09, 74, 'Safeway')
(1.09, 74, 'Safeway')
that took 0.243073940277 seconds
but if I add a seemingly simple WHERE condition:
SELECT a.price, c.banner_id, c.short_name
FROM ret_price_current a
JOIN ret_store b ON a.store_id = b.store_id
JOIN ret_banner c ON b.banner_id = c.banner_id
JOIN ret_store2cbsa_csa d ON a.store_id = d.store_id
WHERE c.banner_id = 74
AND rownum<3
it has been running without returning for many minutes now. What is going on? (For reference, ret_price_current has ~300m entries and the others are much smaller.) I imagine it has to do with indices -- can someone point me to a book about database algorithms (like how queries actually work on the back end) so I can understand wtf is going on?

The reason is that ROWNUM is generated on the rows as they are outputted.
Your first query has no critera, therefore it will spit out the first 3 rows and be done with it. You can generally find any 3 rows that match pretty fast.
Your second has to find 3 rows that match the criteria before it can stop (and it might never find those 3 rows).
The queries are completely different hence the different times to execute.
The way to get this running fast would be to index c.banner_id (and, in fact, all your FKs).
oops - just noticed the timestamp on this by another answer. I'll leave this here anyway as it does answer the questions, as does one of the comments.

Related

SQL query to get overlapping entries between two big lists of ranges

I have a SQLite database A with numeric columns for start and stop that is quite large (1M rows). And I have a second list of numeric ranges B beginning and end that is medium (10K rows).
I would like to find the set of entries in A that overlap with ranges in B.
I could do this with a python script that iterates through list B and does 10K database queries, but I'm wondering if there's a more SQLish way to do it. List B could potentially be slurped into the database as an indexed TEMP TABLE if that helps the process.
Possible simplification, though not optimal, is that list A could be treated as a single location, position, allowing us to only look for A.position that fall inside B.beginning and B.end.
One trick I use to speed this up is to define a CHUNK. This can be as simple as the midpoint of the start and end, divided by a chunksize and then cast as an integer. To build off #Gordon Linoff's answer, you could use a 10k window chunk as follows:
with a_chunk as (
select a.*, cast((a.start+a.end)/(2*10000) as integer) as CHUNK
from a
),
b_chunk as (
select b.*, cast((b.start+b.end)/(2*10000) as integer) as CHUNK
from b
)
select ac.*, bc.*
from a_chunk ac join b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start;
This divides your search space so that rather than joining every row in a against every row in b, you're only joining entries within the same 10k-width window. This should still be an O(m*n) operation but will be considerably faster due to the restricted search space and smaller m/n sizes.
However, this comes with caveats. For instance, the intervals (9995, 9999) and (9998, 10008) will get placed in different chunks despite being clearly overlapping, and your resultant query would miss that. Therefore you can get your edge cases by replacing the single select statement with
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK - 1
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK + 1
and ac.start < bc.end
and ac.end > bc.start;
Even this isn't perfect though. If you have intervals significantly larger than your 10k window size, you could likely still overlook some results. Increasing the window size to accommodate this would come at the cost of joining more entries at a time, which the chunks were designed to avoid. The best balance will likely be finding an appropriate window size and then covering edge cases by including enough UNIONs to include on ac.CHUNK = bc.CHUNK + {-n...n} for however large you think n should be.
Rather than using a CTE, you can also speed this up in SQLite by hard-coding CHUNK as a column in your tables and then creating clustered indexes on each table for (CHUNK, start). You may or may not benefit from including end in this index as well, though you'll have to EXPLAIN QUERY PLAN on your specific case to see whether the optimizer actually does this. The trade-off, of course, is increased storage space, which may not be ideal depending on what you're trying to do.
This admittedly feels like a hack and I'm trying to answer a similar question for my own project. I've heard that the only efficient solution is to manually take the data and implement an interval tree. However, with millions of rows, I'm not sure how efficient it would be to take this from sqlite and build a tree manually in your programming language of choice. If anyone has any better solutions I'd be happy to hear. At least in python, the ncls library seems like it could get the job done.
You can easily express this in SQL as a join. For partial overlaps, this would be:
select a.*, b.*
from a join
b
on a.start < b.end and a.end > b.start;
However, this will be slow, because it will be doing a nested loop comparison. So, although concise, this won't necessarily be much faster.

Access query speed differs

I have a local access database and in it a query which takes values from a form to populate a drop down menu. The weird (to me) thing is that with most options this query is quick (blink of an eye), but with a few options it's very slow (>10 seconds).
What the query is does is a follows: It populates a dropdown menu to record animals seen at a specific sighting, but only those animals which have not been recorded at that specific sighting yet (to avoid duplicate entries).
SELECT DISTINCT tblAnimals.AnimalID, tblAnimals.Nickname, tblAnimals.Species
FROM tblSightings INNER JOIN (tblAnimals INNER JOIN tblAnimalsatSighting ON tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID) ON tblSightings.SightingID = tblAnimalsatSighting.SightingID
WHERE (((tblAnimals.Species)=[form]![Species]) AND ((tblAnimals.CurrentGroup)=[form]![AnimalGroup2]) AND ((tblAnimals.[Dead?])=False) AND ((Exists (select tblAnimalsatSighting.AnimalID FROM tblAnimalsatSighting WHERE tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID AND tblAnimalsatSighting.SightingID = [form]![SightingID]))=False));
It performs well for all groups of 2 of the 4 possible species, for 1 species it performs well for 4 of the 5 groups, but not for the last group, and for the last species it performs very slowly for both groups. Anybody an idea what can be the cause of this kind of behavior? Is it problems with the query? Or duplicate entries in the tables which can cause this? I don't think it's duplicates in the tables, I've checked that, and there are some, but they appear both for groups where there are problems and where there aren't. Could I re-write the query so it performs faster?
As noted in our comments above, you confirmed that the extra joins were not really need and were in fact going to limit the results to animal that had already had a sighting. Those joins would also likely contribute to a slowdown.
I know that Access probably added most of the parentheses automatically but I've removed them and converted the subquery to a not exists form that's a lot more readable.
SELECT tblAnimals.AnimalID, tblAnimals.Nickname, tblAnimals.Species
FROM tblAnimals
WHERE
tblAnimals.Species = [form]![Species]
AND tblAnimals.CurrentGroup = [form]![AnimalGroup2]
AND tblAnimals.[Dead?] = False
AND NOT EXISTS (
SELECT tblAnimalsatSighting.AnimalID
FROM tblAnimalsatSighting
WHERE
tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID
AND tblAnimalsatSighting.SightingID = [form]![SightingID]
);

Speed discrepancies when selecting different records in same table

I've got a pretty basic SQL query that's become a bottleneck in my processing. It's selecting a large varchar(999) column that's slowing it down. Removing that column from the select speeds it up considerably so I know it's the column that's causing problem.
I was experimenting with breaking it up into smaller 300 record batches to see if that helped and I saw something weird happening. Some of the batches were taking almost 30 seconds, and some were taking 0.012 seconds. I don't know what's causing this discrepancy.
I have a reproducible scenario where the first query is taking many times faster than the 2nd:
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1118482415 and 1118509835
0.3 seconds
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1115330220 and 1118482415
8 seconds
I see no visible differences in the returned data. They both return 300 records and all of the record's "Data" column values are about 170 characters long. I'm running this directly from the SqlStudio client. Also there's no other traffic in this database.
Does anybody know what could be causing this problem or have any suggestions to try? I can't decrease the size of the column because there are some bigger records in there, just not in this example. I do have indexes on all the columns used in the joins (Calls.ID, RawData.ID, RawData.FileID, DataFiles.ID).

Group By seems to add inordinate amount of computation in a simple Postgres query

I have a Postgres query as such:
select id,
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id
order by id;
It's nothing fancy but works fine -- only takes about 3 minutes when querying a large database to return the result.
My question is, why does this slight modification to the above code cause the time for the query to more-than quadruple?
select id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id, b.ad_description
order by id;
What is going on? The mere inclusion of one simple column of (albeit unique) information is bogging my query down. I am somehow asking Postgres to do a tremendously larger amount of work. For the life of me, I don't see how.
I'd like to preemptively apologize for not including any raw data. I'm hoping this simplified example of what I'm really facing is clear enough for some kind soul to make an enlightening comment. I can say that I'm going over a million rows in each table.
Thanks in advance.
i think the size of the return set is the issue. since your result set now has a millionish rows, that would explain the extra time (although it does seem excessive). a couple of things. first, the first column in the select is not bound to a range variable, it probably doesn't matter. but, you might want to make it select b.id, b.ad_description instead of id, b.ad_description. second, normally i use a group by when one of the columns is an aggregate that isn't in the group by statement, like a 'count()' or something. maybe i'm missing something, but, you might get the same result with:
select distinct b.id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
order by b.id, b.ad_description;
you might want to play with some of the numbers in the postgresql.conf file to beef up workspace/query space.
finally, i have this funny hunch that the order by and the group by matching might also help performance.
the LIMIT 5 wouldn't help much with performance, because the entire query would need to be finished before the first 5 rows would come out.
-g

Why does changing the where clause on this criteria reduce the execution time so drastically?

I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem.
The problem query:
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000'
ORDER BY actionhistory_id
)
WHERE rownum <= 30000;
The "fix"
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000'
ORDER BY actionhistory_id
)
All of the _id columns are indexed sequences.
The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database.
Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.
This is because of the index on the actionhistory_id column.
During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set.
During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query.
Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks).
EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.
What are the plans for both queries?
Are the statistics on your tables up to date?
Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo?
Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times?
My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.