How to optimize SQL Server code? - sql

I have a table with the columns: Id, time, value.
First step: Given input parameters as signal id, start time and end time, I want to first extract rows with the the signal id and time is between start time and end time.
Second: Assume I have selected 100 rows in the first step. Given another input parameter which is max_num, I want to further select max_num samples out of 100 rows but in a uniform manner. For example, if max_num is set to 10, then I will select 1, 11, 21, .. 91 rows out of 100 rows.
I am not sure if the stored procedure below is optimal, if you find any inefficiencies of the code, please point that out to me and give some suggestion.
create procedure data_selection
#sig_id bigint,
#start_time datetime2,
#end_time datetime2,
#max_num float
AS
BEGIN
declare #tot float
declare #step int
declare #selected table (id int primary key identity not null, Date datetime2, Value real)
// first step
insert into #selected (Date, Value) select Date, Value from Table
where Id = #sig_id
and Date > = #start_time and Date < = #end_time
order by Date
// second step
select #tot = count(1) from #selected
set #step = ceiling(#tot / #max_num)
select * from #selected
where id % #step = 1
END

EDITED to calculate step on the fly. I had first thought this was an argument.
;with data as (
select row_number() over (order by [Date]) as rn, *
from Table
where Id = #sig_id and Date between #start_time and #end_time
), calc as (
select cast(ceiling(max(rn) / #max_num) as int) as step from data
)
select * from data cross apply calc as c
where (rn - 1) % step = 0 --and rn <= (#max_num - 1) * step + 1
Or I guess you can just order/filter by your identity value as you already had it:
;with calc as (select cast(ceiling(max(rn) / #max_num) as int) as step from #selected)
select * from #selected cross apply calc as c
where (id - 1) % step = 0 --and id <= (#max_num - 1) * step + 1
I think that because you're rounding step up with ceiling you'll easily find scenarios where you get fewer rows than #max_num. You might want to round down instead: case when floor(max(rn) / #max_num) = 0 then 1 else floor(max(rn) / #max_num) end as step?

Related

SQL Loop to build Case based on Variables

I am trying to build a case query based on variables
The idea being is when the variables are populated the case statement would alter accordingly.
My Current query takes Values from a table and groups them together into a sort of Bucket.
This works fine providing its always going to be the set ranges and number of ranges, I want to make this configurable by passing variables
From my original query all i wanted was to configure the Number of Buckets and the value of From and Two for each bucket i.e. +5 or +10
Here is my original query:
SELECT subq.Bucket, COUNT(*) 'Count'
FROM
(
SELECT
CASE
WHEN R.Value < 10 THEN '0-10'
WHEN R.Value Between 10 and 20 THEN '10-20'
WHEN R.Value Between 20 and 30 THEN '20-30'
WHEN R.Value Between 30 and 40 THEN '30-40'
WHEN R.Value > 40 THEN '40+'
END Bucket
FROM Table R
Where DateTime Between '2022-10-01' and '2022-11-10' and Type = 1
) subq
GROUP BY subq.Bucket
This is what i was trying to achomplish if it makes any sense in the realm of SQL
DECLARE #NoRows Int, #Range Int, #Count Int, #StartRange Int
Set #NoRows = 5
Set #StartRange = 0
Set #Range = 10
Set #Count = 0
SELECT subq.Bucket, COUNT(*) 'Count'
FROM
(
WHILE #NoRows <= #Count
BEGIN
SELECT
(
CASE
WHEN R.Value Between #StartRange and #Range THEN '#StartRange-#Range'
SET #Count = #Count + 1
SET #StartRange = #StartRange + #Range
END
WHEN R.Value > #StartRange THEN '#StartRange'
END Bucket
FROM Table R
Where DateTime Between '2022-10-01' and '2022-11-10' and Type = 1
) subq
GROUP BY subq.Bucket
This is untested, due to no sample data, but this should be enough to get you to where you need to be. I use an inline tally here to generate the data, but you could also use a tally function, or even build you own bucket function:
DECLARE #NoRows int = 5,
#Range int = 10,
#StartRange int = 0;
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT TOP(#NoRows)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2), --UP to 100 rows, add more cross joins for more rows
Buckets AS(
SELECT #StartRange + ((I-1)*#Range) AS RangeStart,
#StartRange + ((I)*#Range) AS RangeEnd
FROM Tally)
SELECT YT.{Needed Columns},
CASE WHEN B.RangeStart IS NULL THEN CONCAT(#NoRows * #Range,'+')
ELSE CONCAT(B.RangeStart,'-', B.RangeEnd-1)
END AS Bucket
FROM dbo.YourTable YT
LEFT JOIN Buckets B ON YT.YourColumn >= B.RangeStart
AND YT.YourColumn < B.RangeEnd;
In SQL Server 2022+, you even have the built in function GENERATE_SERIES, which makes this even easier.

Incremental Group BY

How I can achieve incremental grouping in query ?
I need to group by all the non-zero values into different named groups.
Please help me write a query based on columns date and subscribers.
If you have SQL Server 2012 or newer, you can use few tricks with windows functions to get this kind of grouping without cursors, with something like this:
select
Date, Subscribers,
case when Subscribers = 0 then 'No group'
else 'Group' + convert(varchar, GRP) end as GRP
from (
select
Date, Subscribers,
sum (GRP) over (order by Date asc) as GRP
from (
select
*,
case when Subscribers > 0 and
isnull(lag(Subscribers) over (order by Date asc),0) = 0 then 1 else 0 end as GRP
from SubscribersCountByDay S
) X
) Y
Example in SQL Fiddle
In general I advocate AGAINST cursors but in this case it ill not hurt since it ill iterate, sum up and do the conditional all in one pass.
Also note I hinted it with FAST_FORWARD to not degrade performance.
I'm guessing you do want what #HABO commented.
See the working example below, it just sums up until find a ZERO, reset and starts again. Note the and #Sum > 0 handles the case where the first row is ZERO.
create table dbo.SubscribersCountByDay
(
[Date] date not null
,Subscribers int not null
)
GO
insert into dbo.SubscribersCountByDay
([Date], Subscribers)
values
('2015-10-01', 1)
,('2015-10-02', 2)
,('2015-10-03', 0)
,('2015-10-04', 4)
,('2015-10-05', 5)
,('2015-10-06', 0)
,('2015-10-07', 7)
GO
declare
#Date date
,#Subscribers int
,#Sum int = 0
,#GroupId int = 1
declare #Result as Table
(
GroupName varchar(10) not null
,[Sum] int not null
)
declare ScanIt cursor fast_forward
for
(
select [Date], Subscribers
from dbo.SubscribersCountByDay
union
select '2030-12-31', 0
) order by [Date]
open ScanIt
fetch next from ScanIt into #Date, #Subscribers
while ##FETCH_STATUS = 0
begin
if (#Subscribers = 0 and #Sum > 0)
begin
insert into #Result (GroupName, [Sum]) values ('Group ' + cast(#GroupId as varchar(6)), #Sum)
set #GroupId = #GroupId + 1
set #Sum = 0
end
else begin
set #Sum = #Sum + #Subscribers
end
fetch next from ScanIt into #Date, #Subscribers
end
close ScanIt
deallocate ScanIt
select * from #Result
GO
For the OP: Please next time write the table, just posting an image is lazy
In a version of SQL Server modern enough to support CTEs you can use the following cursorless query:
-- Sample data.
declare #SampleData as Table ( Id Int Identity, Subscribers Int );
insert into #SampleData ( Subscribers ) values
-- ( 0 ), -- Test edge case when we have a zero first row.
( 200 ), ( 100 ), ( 200 ),
( 0 ), ( 0 ), ( 0 ),
( 50 ), ( 50 ), ( 12 ),
( 0 ), ( 0 ),
( 43 ), ( 34 ), ( 34 );
select * from #SampleData;
-- Run the query.
with ZerosAndRows as (
-- Add IsZero to indicate zero/non-zero and a row number to each row.
select Id, Subscribers,
case when Subscribers = 0 then 0 else 1 end as IsZero,
Row_Number() over ( order by Id ) as RowNumber
from #SampleData ),
Groups as (
-- Add a group number to every row.
select Id, Subscribers, IsZero, RowNumber, 1 as GroupNumber
from ZerosAndRows
where RowNumber = 1
union all
select FAR.Id, FAR.Subscribers, FAR.IsZero, FAR.RowNumber,
-- Increment GroupNumber only when we move from a non-zero row to a zero row.
case when Groups.IsZero = 1 and FAR.IsZero = 0 then Groups.GroupNumber + 1 else Groups.GroupNumber end
from ZerosAndRows as FAR inner join Groups on Groups.RowNumber + 1 = FAR.RowNumber
)
-- Display the results.
select Id, Subscribers,
case when IsZero = 0 then 'no group' else 'Group' + Cast( GroupNumber as VarChar(10) ) end as Grouped
from Groups
order by Id;
To see the intermediate results just replace the final select with select * from FlagsAndRows or select * from Groups.

Group data without changing query flow

For me it's hard to explait what do I want so article's name may be unclear, but I hope I can describe it with code.
I have some data with two most important value, so let it be time t and value f(t). It's stored in the table, for example
1 - 1000
2 - 1200
3 - 1100
4 - 1500
...
I want to plot a graph using it, and this graph should contain N points. If table has rows less than this N, then we just return this table. But if it hasn't, we should group this points, for example, N = Count/2, then for an example above:
1 - (1000+1200)/2 = 1100
2 - (1100+1500)/2 = 1300
...
I wrote an SQL script (it works fine for N >> Count) (MonitoringDateTime - is t, and ResultCount if f(t))
ALTER PROCEDURE [dbo].[usp_GetRequestStatisticsData]
#ResourceTypeID bigint,
#DateFrom datetime,
#DateTo datetime,
#EstimatedPointCount int
AS
BEGIN
SET NOCOUNT ON;
SET ARITHABORT ON;
declare #groupSize int;
declare #resourceCount int;
select #resourceCount = Count(*)
from ResourceType
where ID & #ResourceTypeID > 0
SELECT d.ResultCount
,MonitoringDateTime = d.GeneratedOnUtc
,ResourceType = a.ResourceTypeID,
ROW_NUMBER() OVER(ORDER BY d.GeneratedOnUtc asc) AS Row
into #t
FROM dbo.AgentData d
INNER JOIN dbo.Agent a ON a.CheckID = d.CheckID
WHERE d.EventType = 'Result' AND
a.ResourceTypeID & #ResourceTypeID > 0 AND
d.GeneratedOnUtc between #DateFrom AND #DateTo AND
d.Result = 1
select #groupSize = Count(*) / (#EstimatedPointCount * #resourceCount)
from #t
if #groupSize = 0 -- return all points
select ResourceType, MonitoringDateTime, ResultCount
from #t
else
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
where [Row] % #groupSize = 0
group by ResourceType, [Row]
order by MonitoringDateTime
END
, but it's doesn't work for N ~= Count, and spend a lot of time for inserts.
This is why I wanted to use CTE's, but it doesn't work with if else statement.
So i calculated a formula for a group number (for use it in GroupBy clause), because we have
GroupNumber = Count < N ? Row : Row*NumberOfGroups
where Count - numer of rows in the table, and NumberOfGroups = Count/EstimatedPointCount
using some trivial mathematics we get a formula
GroupNumber = Row + (Row*Count/EstimatedPointCount - Row)*MAX(Count - Count/EstimatedPointCount,0)/(Count - Count/EstimatedPointCount)
but it doesn't work because of Count aggregate function:
Column 'dbo.AgentData.ResultCount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
My english is very bad and I know it (and i'm trying to improve it), but hope dies last, so please advice.
results of query
SELECT d.ResultCount
, MonitoringDateTime = d.GeneratedOnUtc
, ResourceType = a.ResourceTypeID
FROM dbo.AgentData d
INNER JOIN dbo.Agent a ON a.CheckID = d.CheckID
WHERE d.GeneratedOnUtc between '2015-01-28' AND '2015-01-30' AND
a.ResourceTypeID & 1376256 > 0 AND
d.EventType = 'Result' AND
d.Result = 1
https://onedrive.live.com/redir?resid=58A31FC352FC3D1A!6118&authkey=!AATDebemNJIgHoo&ithint=file%2ccsv
Here's an example using NTILE and your simple sample data at the top of your question:
declare #samples table (ID int, sample int)
insert into #samples (ID,sample) values
(1,1000),
(2,1200),
(3,1100),
(4,1500)
declare #results int
set #results = 2
;With grouped as (
select *,NTILE(#results) OVER (order by ID) as nt
from #samples
)
select nt,AVG(sample) from grouped
group by nt
Which produces:
nt
-------------------- -----------
1 1100
2 1300
If #results is changed to 4 (or any higher number) then you just get back your original result set.
Unfortunately, I don't have your full data nor can I fully understand what you're trying to do with the full stored procedure, so the above would probably need to be adapted somewhat.
I haven't tried it, but how about instead of
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
where [Row] % #groupSize = 0
group by ResourceType, [Row]
order by MonitoringDateTime
perhaps something like
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
group by ResourceType, convert(int,[Row]/#groupSize)
order by MonitoringDateTime
Maybe that points you in some new direction? by converting to int we are truncating everything after the decimal so Im hoping that will give you a better grouping? you might need to put your row-number over resource type for this to work?

SQL Data Sampling

We have had a request to provide some data to an external company.
They require only a sample of data, simple right? wrong.
Here is their sampling criteria:
Total Number of records divided by 720 (required sample size) - this gives sampling interval (if result is a fraction, round down to next whole number).
Halve the sampling interval to get the starting point.
Return each record by adding on the sampling interval.
EXAMPLE:
10,000 Records - Sampling interval = 13 (10,000/720)
Starting Point = 6 (13/2 Rounded)
Return records 6, 19 (6+13), 32 (19+13), 45 (32+13) etc.....
Please can someone tell me how (if) something like this is possible in SQL.
If you have use of ROW_NUMBER(), then you can do this relatively easily.
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER (ORDER BY a, b, c, d) AS record_id,
*
FROM
yourTable
)
AS data
WHERE
(record_id + 360) % 720 = 0
ROW_NUMBER() gives all your data a sequential identifier (this is important as the id field must both be unique and NOT have ANY gaps). It also defines the order you want the data in (ORDER BY a, b, c, d).
With that id, if you use Modulo (Often the % operator), you can test if the record is the 720th record, 1440th record, etc (because 720 % 720 = 0).
Then, if you offset your id value by 360, you can change the starting point of your result set.
EDIT
After re-reading the question, I see you don't want every 720th record, but uniformly selected 720 records.
As such, replace 720 with (SELECT COUNT(*) / 720 FROM yourTable)
And replace 360 with (SELECT (COUNT(*) / 720) / 2 FROM yourTable)
EDIT
Ignoring the rounding conditions will allow a result of exactly 720 records. This requires using non-integer values, and the result of the modulo being less than 1.
WHERE
(record_id + (SELECT COUNT(*) FROM yourTable) / 1440.0)
%
((SELECT COUNT(*) FROM yourTable) / 720.0)
<
1.0
declare #sample_size int, #starting_point int
select #sample_size = 200
select top (#sample_size) col1, col2, col3, col4
from (
select *, row_number() over (order by col1, col2) as row
from your_table
) t
where (row % ((select count(*) from your_table) / #sample_size)) - (select count(*) from your_table) / #sample_size / 2) = 0
It's going to work in SQL Server 2005+.
TOP (#variable) is used to limit rows (where condition because of integers rounding might not be enough, may return more rows then needed) and ROW_NUMBER() to number and order rows.
Working example: https://data.stackexchange.com/stackoverflow/query/62315/sql-data-sampling below code:
declare #tab table (id int identity(1,1), col1 varchar(3), col2 varchar(3))
declare #i int
set #i = 0
while #i <= 1000
begin
insert into #tab
select 'aaa', 'bbb'
set #i = #i+1
end
declare #sample_size int
select #sample_size = 123
select ((select count(*) from #tab) / #sample_size) as sample_interval
select top (#sample_size) *
from (
select *, row_number() over (order by col1, col2, id desc) as row
from #tab
) t
where (row % ((select count(*) from #tab) / #sample_size)) - ((select count(*) from #tab) / #sample_size / 2) = 0
SQL server has in-built function for it.
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT) ;
You can use rank to get a row-number. The following code will create 10000 records in a table, then select the 6th, 19th, 32nd, etc, for a total of 769 rows.
CREATE TABLE Tbl (
Data varchar (255)
)
GO
DECLARE #i int
SET #i = 0
WHILE (#i < 10000)
BEGIN
INSERT INTO Tbl (Data) VALUES (CONVERT(varchar(255), NEWID()))
SET #i = #i + 1
END
GO
DECLARE #interval int
DECLARE #start int
DECLARE #total int
SELECT #total = COUNT(*),
#start = FLOOR(COUNT(*) / 720) / 2,
#interval = FLOOR(COUNT(*) / 720)
FROM Tbl
PRINT 'Start record: ' + CAST(#start as varchar(10))
PRINT 'Interval: ' + CAST(#interval as varchar(10))
SELECT rank, Data
FROM (
SELECT rank()
OVER (ORDER BY t.Data) as rank, t.Data AS Data
FROM Tbl t) q
WHERE ((rank + 1) + #start) % #interval = 0

Clearing prioritized overlapping ranges in SQL Server

This one is nasty complicated to solve.
I have a table containing date ranges, each date range has a priority. Highest priority means this date range is the most important.
Or in SQL
create table #ranges (Start int, Finish int, Priority int)
insert #ranges values (1 , 10, 0)
insert #ranges values (2 , 5 , 1)
insert #ranges values (3 , 4 , 2)
insert #ranges values (1 , 5 , 0)
insert #ranges values (200028, 308731, 0)
Start Finish Priority
----------- ----------- -----------
1 10 0
2 5 1
3 4 2
1 5 0
200028 308731 0
I would like to run a series of SQL queries on this table that will result in the table having no overlapping ranges, it is to take the highest priority ranges over the lower ones. Split off ranges as required, and get rid of duplicate ranges. It allows for gaps.
So the result should be:
Start Finish Priority
----------- ----------- -----------
1 2 0
2 3 1
3 4 2
4 5 1
5 10 0
200028 308731 0
Anyone care to give a shot at the SQL? I would also like it to be as efficient as possible.
This is most of the way there, possible improvement would be joining up adjacent ranges of the same priority. It's full of cool trickery.
select Start, cast(null as int) as Finish, cast(null as int) as Priority
into #processed
from #ranges
union
select Finish, NULL, NULL
from #ranges
update p
set Finish = (
select min(p1.Start)
from #processed p1
where p1.Start > p.Start
)
from #processed p
create clustered index idxStart on #processed(Start, Finish, Priority)
create index idxFinish on #processed(Finish, Start, Priority)
update p
set Priority =
(
select max(r.Priority)
from #ranges r
where
(
(r.Start <= p.Start and r.Finish > p.Start) or
(r.Start >= p.Start and r.Start < p.Finish)
)
)
from #processed p
delete from #processed
where Priority is null
select * from #processed
Here is something to get you started. It is helpful if you use a calendar table:
CREATE TABLE dbo.Calendar
(
dt SMALLDATETIME NOT NULL
PRIMARY KEY CLUSTERED
)
GO
SET NOCOUNT ON
DECLARE #dt SMALLDATETIME
SET #dt = '20000101'
WHILE #dt < '20200101'
BEGIN
INSERT dbo.Calendar(dt) SELECT #dt
SET #dt = #dt + 1
END
GO
Code to setup the problem:
create table #ranges (Start DateTime NOT NULL, Finish DateTime NOT NULL, Priority int NOT NULL)
create table #processed (dt DateTime NOT NULL, Priority int NOT NULL)
ALTER TABLE #ranges ADD PRIMARY KEY (Start,Finish, Priority)
ALTER TABLE #processed ADD PRIMARY KEY (dt)
declare #day0 datetime,
#day1 datetime,
#day2 datetime,
#day3 datetime,
#day4 datetime,
#day5 datetime
select #day0 = '2000-01-01',
#day1 = #day0 + 1,
#day2 = #day1 + 1,
#day3 = #day2 + 1,
#day4 = #day3 + 1,
#day5 = #day4 + 1
insert #ranges values (#day0, #day5, 0)
insert #ranges values (#day1, #day4, 1)
insert #ranges values (#day2, #day3, 2)
insert #ranges values (#day1, #day4, 0)
Actual solution:
DECLARE #start datetime, #finish datetime, #priority int
WHILE 1=1 BEGIN
SELECT TOP 1 #start = start, #finish = finish, #priority = priority
FROM #ranges
ORDER BY priority DESC, start, finish
IF ##ROWCOUNT = 0
BREAK
INSERT INTO #processed (dt, priority)
SELECT dt, #priority FROM calendar
WHERE dt BETWEEN #start and #finish
AND NOT EXISTS (SELECT * FROM #processed WHERE dt = calendar.dt)
DELETE FROM #ranges WHERE #start=start AND #finish=finish AND #priority=priority
END
Results: SELECT * FROM #processed
dt Priority
----------------------- -----------
2000-01-01 00:00:00.000 0
2000-01-02 00:00:00.000 1
2000-01-03 00:00:00.000 2
2000-01-04 00:00:00.000 2
2000-01-05 00:00:00.000 1
2000-01-06 00:00:00.000 0
The solution is not in the exact same format, but the idea is there.
I'm a little confused about what you want to end up with. Is this the same as simply having a set of dates where one range continues until the next one starts (in which case you don't really need the Finish date, do you?)
Or can a range Finish and there's a gap until the next one starts sometimes?
If the range Start and Finish are explicitly set, then I'd be inclined to leave both, but have the logic to apply the higher priority during the overlap. I'd suspect that if dates start getting adjusted, you'll eventually need to roll back a range that got shaved, and the original setting will be gone.
And you'll never be able to explain "how it got that way".
Do you want simply a table with a row for each date, including its priority value? Then when you have a new rule, you can bump the dates that would be trumped by the new rule?
I did a medical office scheduling app once that started with work/vacation/etc. requests with range-type data (plus a default work-week template.) Once I figured out to store the active schedule info as user/date/timerange records, things fell into place a lot more easily. YMMV.
This can be done in 1 SQL (i first made the query in Oracle using lag and lead, but since MSSQL doesn't support those functions i rewrote the query using row_number. I'm not sure if the result is MSSQL compliant, but it should be very close):
with x as (
select rdate rdate
, row_number() over (order by rdate) rn
from (
select start rdate
from ranges
union
select finish rdate
from ranges
)
)
select d.begin
, d.end
, max(r.priority)
from (
select begin.rdate begin
, end.rdate end
from x begin
, x end
where begin.rn = end.rn - 1
) d
, ranges r
where r.start <= d.begin
and r.finish >= d.end
and d.begin <> d.end
group by d.begin
, d.end
order by 1, 2
I first made a table (x) with all dates. Then I turned this into buckets by joining x with itself and taking 2 following rows. After this I linked all the possible priorities with the result. By taking the max(priority) I get the requested result.