SQL Server - Weird Index Usage - sql

So here is the original query I'm working with
SELECT TOP(10) *
FROM Orders o
WHERE (o.DateAdded >= DATEADD(DAY, - 30, getutcdate())) AND (o.DateAdded <= GETUTCDATE())
ORDER BY o.DateAdded ASC,
o.Price ASC
o.Quantity DESC
Datatype:
DateAdded - smalldatetime
Price - decimal(19,8)
Quantity - int
I have an index on the Orders table with the same 3 columns in the same order, so when I run this, it's perfect. Time < 0ms, Live Query Statistics shows it only reads the 10 rows. Awesome.
However, as soon as I add this line to the WHERE clause
AND o.Price BETWEEN convert(decimal(19,8), 0) AND #BuyPrice
It all goes to hell (and unfortunately I need that line). It also behaves the same if it's just o.Price <= #BuyPrice. Live Query Statistics shows the number of rows read is ~30k. It also shows that the o.Price comparison isn't being used as a seek predicate, and I'm having a hard time understanding why it isn't. I've verified #BuyPrice is the right datatype, as I found several articles that discuss issues with implicit conversions. At first I thought it was because I had two ranges: first the dateAdded then Price, but I have other queries doing with multi column indexes and multiple ranges and they all perform just fine. I'm absolutely baffled as to why this one has decided to be a burden. I've tried changing the order of columns in the index, changing them from ASC to DESC, but nada.
Would highly appreciate anyone telling me what I'm missing. Thanks

It is impossible for the optimizer to use two range predicates at the same time.
Think about it: It starts scanning from a certain spot in the index sorted by DateAdded. It now needs, within each individual DateAdded value to seek to a particular Price, start scanning, and stop at another Price, then jump to the next DateAdded.
This is called skip-scanning, it is only efficient when the first predicate is not very many values, otherwise it is inefficient, and because of this, only Oracle has implemented it, not SQL Server.

I think this is due to the TOP 10 which cannot take place before the ORDER BY.
And this ORDER BY must wait until the result set is ready.
Without your additional price range, the TOP 10 can be taken from the existing index directly. But adding the second range will force another operation to be run first.
In short:
First your filter must get the rows for the price range together with the date range.
The resulting set is sorted and the top 10 rows are taken.
Did you try to add a separate index on your price column? This should speed up the first filter.
We cannot predict the execution plan in many cases, but you might try to
write an intermediate set, filtered by the date range, into a temp table and proceed from there. You might even create an index on the price column there (Depends on the expected row count. Probably the best option).
use a CTE to define a set filtered by the the date range and use this set to apply your price range. But a CTE is not the same as a temp table. The final execution plan might be the same as before...
use two CTEs to define two sets (one per range) and use INNER JOIN as a way to get the same as with WHERE condition1 AND condition2.

Related

Why does using CONVERT(DATETIME, [date], [format]) in WHERE clause take so long?

I'm running the following code on a dataset of 100M to test some things out before I eventually join the entire range (not just the top 10) on another table to make it even smaller.
SELECT TOP 10 *
FROM Table
WHERE CONVERT(datetime, DATE, 112) BETWEEN '2020-07-04 00:00:00' AND '2020-07-04 23:59:59'
The table isn't mine but a client's, so unfortunately I'm not responsible for the data types of the columns. The DATE column, along with the rest of the data, is in varchar. As for the dates in the BETWEEN clause, I just put in a relatively small range for testing.
I have heard that CONVERT shouldn't be in the WHERE clause, but I need to convert it to dates in order to filter. What is the proper way of going about this?
Going to summarise my comments here, as they are "second class citizens" and thus could be removed.
Firstly, the reason your query is slow is because of theCONVERT on the column DATE in your WHERE. Applying functions to a column in your WHERE will almost always make your query non-SARGable (there are some exceptions, but that doesn't make them a good idea). As a result, the entire table must be scanned to find rows that are applicable for your WHERE; it can't use an index to help it.
The real problem, therefore, is that you are storing a date (and time) value in your table as a non-date (and time) datatype; presumably a (n)varchar. This is, in truth, a major design flaw and needs to be fixed. String type values aren't validated to be valid dates, so someone could easily insert the "date" '20210229' or even 20211332'. Fixing the design not only stops this, but also makes your data smaller (a date is 3 bytes in size, a varchar(8) would be 10 bytes), and you could pass strongly typed date and time values to your query and it would be SARGable.
"Fortunately" it appears your data is in the style code 112, which is yyyyMMdd; this at least means that the ordering of the dates is the same as if it were a strongly typed date (and time) data type. This means that the below query will work and return the results you want:
SELECT TOP 10 * --Ideally don't use * and list your columns properly
FROM dbo.[Table]
WHERE [DATE] >= '20210704' AND [DATE] < '20210705'
ORDER BY {Some Column};
you can use like this to get better performance:
SELECT TOP 10 *
FROM Table
WHERE cast(DATE as date) BETWEEN '2020-07-04' AND '2020-07-04' and cast(DATE as time) BETWEEN '00:00:00' AND '23:59:59'
No need to include time portion if you want to search full day.

SQL to group time intervals by arbitrary time period

I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?
Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

SQL: Minimising rows in subqueries/partitioning

So here's an odd thing. I have limited SQL access to a database - the most relevant restriction here being that if I create a query, a maximum of 10,000 rows is returned.
Anyway, I've been trying to have a query return individual case details, but only at busy times - say when 50+ cases are attended to in an hour. So, I inserted the following line:
COUNT(CaseNo) OVER (PARTITION BY DATEADD(hh,
DATEDIFF(hh, 0, StartDate), 0)) AS CasesInHour
... And then used this as a subquery, selecting only those cases where CasesInHour >= 50
However, it turns out that the 10,000 rows limit affects the partitioning - when I tried to run over a longer period nothing came up, as it was counting the cases in any given hour from only a (fairly random) much smaller selection.
Can anyone think of a way to get around this limit? The final total returned will be much lower than 10,000 rows, but it will be looking at far more than 10,000 as a starting point.
If this is really MySQL we're talking about, sql_big_selects and max_join_size affects the number of rows examined, not the number of rows "returned". So, you'll need to reduce the number of rows examined by being more selective and using proper indexes.
For example, the following query may be examining over 10,000 rows:
SELECT * FROM stats
To limit the selectivity, you might want to grab only the rows from the last 30 days:
SELECT * FROM stats
WHERE created > DATESUB(NOW(), INTERVAL 30 DAY)
However, this only reduces the number of rows examined if there is an index on the created column and the cardinality of the index is sufficient to reduce the rows examined.

Converting a string to a date time in SQL

I'm importing data from a different system and the datetime is stored as string in this format:
20061105084755ES
yyyymmddhhmmss(es/ed) where es is EST and ed is EDT.
I will have to query this table for the last 30 days. I'm using the conversion query:
select convert(
datetime,
left(cdts, 4)+'-'+substring(cdts, 5,2)+'-'substring(cdts, 7,2)+' '+substring(cdts, 9,2) +':'+substring(cdts, 11,2)+':'+substring(cdts, 13,2)
as dt
from tb1
where dt < getdate()-30
I'm looking for a more efficient query that will reduce the time taken. This table has around 90 million records and the query runs forever.
No calculation at runtime is going to speed this query up if you are performing the calculation and then need to filter against the result of the calculation - SQL Server will be forced to perform a table scan. The main problem is that you've chosen to store your dates as a string. For a variety of reasons, this is a terrible decision. Is the string column indexed at least? If so, then this may help get the data only from the last 30 days:
DECLARE #ThirtyDays CHAR(8);
SET #ThirtyDays = CONVERT(CHAR(8),DATEADD(DAY,DATEDIFF(DAY,0,GETDATE()),0)-30,112);
SELECT ...
WHERE cdts >= #ThirtyDays;
If you need to return all the data from all of history except the past 30 days, this isn't going to help either, because unless you are only pulling data from the indexed column, the most efficient approach for retrieving most of the data in the table is to use a clustered index scan. (If you are retrieving a narrow set of columns, it may opt for an index scan, if you have a covering index.) So, your bottleneck in much of these scenarios is not something a formula can fix, but rather the time it takes to actually retrieve a large volume of data, transmit it over the network, and render it on the client.
Also, as an aside, you can't do this:
SELECT a + b AS c FROM dbo.somewhere
WHERE c > 10;
c doesn't exist in dbo.somewhere, it is an expression derived in the SELECT list. The SELECT list is parsed second last (right before ORDER BY), so you can't reference something in the WHERE clause that doesn't exist yet. Typical workarounds are to repeat the expression or use a subquery / CTE.
One potential option is to add a date column to your table and populate that information on load. This way the conversion is all done before you need to query for it.
Then, make sure you have an index on that field which the actual query can take advantage of.
convert(datetime,stuff(stuff(stuff(datevalue, 9, 0, ' '), 12, 0, ':'), 15, 0, ':'))
or
Convert(time,Dateadd(SECOND,
Right(DateValue,2)/1,
Dateadd(MINUTE,
Right(DateValue,4)/100,
Dateadd(hour,
Right(DateValue,6)/10000,
'1900-01-01')))) +
convert(datetime,LEFT(datevalue,8))
Link

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc