SQL, Auxiliary table of numbers

SQL, Auxiliary table of numbers - sql

For certain types of sql queries, an auxiliary table of numbers can be very useful. It may be created as a table with as many rows as you need for a particular task or as a user defined function that returns the number of rows required in each query.
What is the optimal way to create such a function?

Heh... sorry I'm so late responding to an old post. And, yeah, I had to respond because the most popular answer (at the time, the Recursive CTE answer with the link to 14 different methods) on this thread is, ummm... performance challenged at best.
First, the article with the 14 different solutions is fine for seeing the different methods of creating a Numbers/Tally table on the fly but as pointed out in the article and in the cited thread, there's a very important quote...
"suggestions regarding efficiency and
performance are often subjective.
Regardless of how a query is being
used, the physical implementation
determines the efficiency of a query.
Therefore, rather than relying on
biased guidelines, it is imperative
that you test the query and determine
which one performs better."
Ironically, the article itself contains many subjective statements and "biased guidelines" such as "a recursive CTE can generate a number listing pretty efficiently" and "This is an efficient method of using WHILE loop from a newsgroup posting by Itzik Ben-Gen" (which I'm sure he posted just for comparison purposes). C'mon folks... Just mentioning Itzik's good name may lead some poor slob into actually using that horrible method. The author should practice what (s)he preaches and should do a little performance testing before making such ridiculously incorrect statements especially in the face of any scalablility.
With the thought of actually doing some testing before making any subjective claims about what any code does or what someone "likes", here's some code you can do your own testing with. Setup profiler for the SPID you're running the test from and check it out for yourself... just do a "Search'n'Replace" of the number 1000000 for your "favorite" number and see...
--===== Test for 1000000 rows ==================================
GO
--===== Traditional RECURSIVE CTE method
WITH Tally (N) AS
(
SELECT 1 UNION ALL
SELECT 1 + N FROM Tally WHERE N < 1000000
)
SELECT N
INTO #Tally1
FROM Tally
OPTION (MAXRECURSION 0);
GO
--===== Traditional WHILE LOOP method
CREATE TABLE #Tally2 (N INT);
SET NOCOUNT ON;
DECLARE #Index INT;
SET #Index = 1;
WHILE #Index <= 1000000
BEGIN
INSERT #Tally2 (N)
VALUES (#Index);
SET #Index = #Index + 1;
END;
GO
--===== Traditional CROSS JOIN table method
SELECT TOP (1000000)
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS N
INTO #Tally3
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.ALL_Columns ac2;
GO
--===== Itzik's CROSS JOINED CTE method
WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
E02(N) AS (SELECT 1 FROM E00 a, E00 b),
E04(N) AS (SELECT 1 FROM E02 a, E02 b),
E08(N) AS (SELECT 1 FROM E04 a, E04 b),
E16(N) AS (SELECT 1 FROM E08 a, E08 b),
E32(N) AS (SELECT 1 FROM E16 a, E16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
SELECT N
INTO #Tally4
FROM cteTally
WHERE N <= 1000000;
GO
--===== Housekeeping
DROP TABLE #Tally1, #Tally2, #Tally3, #Tally4;
GO
While we're at it, here's the numbers I get from SQL Profiler for the values of 100, 1000, 10000, 100000, and 1000000...
SPID TextData Dur(ms) CPU Reads Writes
---- ---------------------------------------- ------- ----- ------- ------
51 --===== Test for 100 rows ============== 8 0 0 0
51 --===== Traditional RECURSIVE CTE method 16 0 868 0
51 --===== Traditional WHILE LOOP method CR 73 16 175 2
51 --===== Traditional CROSS JOIN table met 11 0 80 0
51 --===== Itzik's CROSS JOINED CTE method 6 0 63 0
51 --===== Housekeeping DROP TABLE #Tally 35 31 401 0
51 --===== Test for 1000 rows ============= 0 0 0 0
51 --===== Traditional RECURSIVE CTE method 47 47 8074 0
51 --===== Traditional WHILE LOOP method CR 80 78 1085 0
51 --===== Traditional CROSS JOIN table met 5 0 98 0
51 --===== Itzik's CROSS JOINED CTE method 2 0 83 0
51 --===== Housekeeping DROP TABLE #Tally 6 15 426 0
51 --===== Test for 10000 rows ============ 0 0 0 0
51 --===== Traditional RECURSIVE CTE method 434 344 80230 10
51 --===== Traditional WHILE LOOP method CR 671 563 10240 9
51 --===== Traditional CROSS JOIN table met 25 31 302 15
51 --===== Itzik's CROSS JOINED CTE method 24 0 192 15
51 --===== Housekeeping DROP TABLE #Tally 7 15 531 0
51 --===== Test for 100000 rows =========== 0 0 0 0
51 --===== Traditional RECURSIVE CTE method 4143 3813 800260 154
51 --===== Traditional WHILE LOOP method CR 5820 5547 101380 161
51 --===== Traditional CROSS JOIN table met 160 140 479 211
51 --===== Itzik's CROSS JOINED CTE method 153 141 276 204
51 --===== Housekeeping DROP TABLE #Tally 10 15 761 0
51 --===== Test for 1000000 rows ========== 0 0 0 0
51 --===== Traditional RECURSIVE CTE method 41349 37437 8001048 1601
51 --===== Traditional WHILE LOOP method CR 59138 56141 1012785 1682
51 --===== Traditional CROSS JOIN table met 1224 1219 2429 2101
51 --===== Itzik's CROSS JOINED CTE method 1448 1328 1217 2095
51 --===== Housekeeping DROP TABLE #Tally 8 0 415 0
As you can see, the Recursive CTE method is the second worst only to the While Loop for Duration and CPU and has 8 times the memory pressure in the form of logical reads than the While Loop. It's RBAR on steroids and should be avoided, at all cost, for any single row calculations just as a While Loop should be avoided. There are places where recursion is quite valuable but this ISN'T one of them.
As a side bar, Mr. Denny is absolutely spot on... a correctly sized permanent Numbers or Tally table is the way to go for most things. What does correctly sized mean? Well, most people use a Tally table to generate dates or to do splits on VARCHAR(8000). If you create an 11,000 row Tally table with the correct clustered index on "N", you'll have enough rows to create more than 30 years worth of dates (I work with mortgages a fair bit so 30 years is a key number for me) and certainly enough to handle a VARCHAR(8000) split. Why is "right sizing" so important? If the Tally table is used a lot, it easily fits in cache which makes it blazingly fast without much pressure on memory at all.
Last but not least, every one knows that if you create a permanent Tally table, it doesn't much matter which method you use to build it because 1) it's only going to be made once and 2) if it's something like an 11,000 row table, all of the methods are going to run "good enough". So why all the indigination on my part about which method to use???
The answer is that some poor guy/gal who doesn't know any better and just needs to get his or her job done might see something like the Recursive CTE method and decide to use it for something much larger and much more frequently used than building a permanent Tally table and I'm trying to protect those people, the servers their code runs on, and the company that owns the data on those servers. Yeah... it's that big a deal. It should be for everyone else, as well. Teach the right way to do things instead of "good enough". Do some testing before posting or using something from a post or book... the life you save may, in fact, be your own especially if you think a recursive CTE is the way to go for something like this. ;-)
Thanks for listening...

The most optimal function would be to use a table instead of a function. Using a function causes extra CPU load to create the values for the data being returned, especially if the values being returned cover a very large range.

This article gives 14 different possible solutions with discussion of each. The important point is that:
suggestions regarding efficiency and
performance are often subjective.
Regardless of how a query is being
used, the physical implementation
determines the efficiency of a query.
Therefore, rather than relying on
biased guidelines, it is imperative
that you test the query and determine
which one performs better.
I personally liked:
WITH Nbrs ( n ) AS (
SELECT 1 UNION ALL
SELECT 1 + n FROM Nbrs WHERE n < 500 )
SELECT n FROM Nbrs
OPTION ( MAXRECURSION 500 )

This view is super fast and contains all positive int values.
CREATE VIEW dbo.Numbers
WITH SCHEMABINDING
AS
WITH Int1(z) AS (SELECT 0 UNION ALL SELECT 0)
, Int2(z) AS (SELECT 0 FROM Int1 a CROSS JOIN Int1 b)
, Int4(z) AS (SELECT 0 FROM Int2 a CROSS JOIN Int2 b)
, Int8(z) AS (SELECT 0 FROM Int4 a CROSS JOIN Int4 b)
, Int16(z) AS (SELECT 0 FROM Int8 a CROSS JOIN Int8 b)
, Int32(z) AS (SELECT TOP 2147483647 0 FROM Int16 a CROSS JOIN Int16 b)
SELECT ROW_NUMBER() OVER (ORDER BY z) AS n
FROM Int32
GO

From SQL Server 2022 you will be able to do
SELECT Value
FROM GENERATE_SERIES(START = 1, STOP = 100, STEP=1)
In the public preview of SQL Server 2022 (CTP2.0) there are some very promising elements and other less so. Hopefully the negative aspects can be addressed before the actual release.
✅ Execution time for number generation
The below generates 10,000,000 numbers in 700 ms in my test VM (the assigning to a variable removes any overhead from sending results to the client)
DECLARE #Value INT
SELECT #Value =[value]
FROM GENERATE_SERIES(START=1, STOP=10000000)
✅ Cardinality estimates
It is simple to calculate how many numbers will be returned from the operator and SQL Server takes advantage of this as shown below.
❌ Unnecessary Halloween Protection
The plan for the below insert has a completely unnecessary spool - presumably as SQL Server does not currently have logic to determine the source of the rows is not potentially the destination.
CREATE TABLE dbo.NumberHeap(Number INT);
INSERT INTO dbo.Numbers
SELECT [value]
FROM GENERATE_SERIES(START=1, STOP=10);
When inserting into a table with a clustered index on Number the spool may be replaced by a sort instead (that also provides the phase separation)
❌ Unnecessary sorts
The below will return the rows in order anyway but SQL Server apparently does not yet have the properties set to guarantee this and take advantage of it in the execution plan.
SELECT [value]
FROM GENERATE_SERIES(START=1, STOP=10)
ORDER BY [value]
RE: This last point Aaron Bertrand indicates that this is not a box currently ticked but that this may be forthcoming.

Using SQL Server 2016+ to generate numbers table you could use OPENJSON :
-- range from 0 to #max - 1
DECLARE #max INT = 40000;
SELECT rn = CAST([key] AS INT)
FROM OPENJSON(CONCAT('[1', REPLICATE(CAST(',1' AS VARCHAR(MAX)),#max-1),']'));
LiveDemo
Idea taken from How can we use OPENJSON to generate series of numbers?

edit: see Conrad's comment below.
Jeff Moden's answer is great ... but I find on Postgres that the Itzik method fails unless you remove the E32 row.
Slightly faster on postgres (40ms vs 100ms) is another method I found on here adapted for postgres:
WITH
E00 (N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ),
E01 (N) AS (SELECT a.N FROM E00 a CROSS JOIN E00 b),
E02 (N) AS (SELECT a.N FROM E01 a CROSS JOIN E01 b ),
E03 (N) AS (SELECT a.N FROM E02 a CROSS JOIN E02 b
LIMIT 11000 -- end record 11,000 good for 30 yrs dates
), -- max is 100,000,000, starts slowing e.g. 1 million 1.5 secs, 2 mil 2.5 secs, 3 mill 4 secs
Tally (N) as (SELECT row_number() OVER (ORDER BY a.N) FROM E03 a)
SELECT N
FROM Tally
As I am moving from SQL Server to Postgres world, may have missed a better way to do tally tables on that platform ... INTEGER()? SEQUENCE()?

Still much later, I'd like to contribute a slightly different 'traditional' CTE (does not touch base tables to get the volume of rows):
--===== Hans CROSS JOINED CTE method
WITH Numbers_CTE (Digit)
AS
(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9)
SELECT HundredThousand.Digit * 100000 + TenThousand.Digit * 10000 + Thousand.Digit * 1000 + Hundred.Digit * 100 + Ten.Digit * 10 + One.Digit AS Number
INTO #Tally5
FROM Numbers_CTE AS One CROSS JOIN Numbers_CTE AS Ten CROSS JOIN Numbers_CTE AS Hundred CROSS JOIN Numbers_CTE AS Thousand CROSS JOIN Numbers_CTE AS TenThousand CROSS JOIN Numbers_CTE AS HundredThousand
This CTE performs more READs then Itzik's CTE but less then the Traditional CTE.
However, it consistently performs less WRITES then the other queries.
As you know, Writes are consistently quite much more expensive then Reads.
The duration depends heavily on the number of cores (MAXDOP) but, on my 8core, performs consistently quicker (less duration in ms) then the other queries.
I am using:
Microsoft SQL Server 2012 - 11.0.5058.0 (X64)
May 14 2014 18:34:29
Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: )
on Windows Server 2012 R2, 32 GB, Xeon X3450 #2.67Ghz, 4 cores HT enabled.

Related

Possible explanation on WITH RECURSIVE Query Postgres

I have been reading around With Query in Postgres. And this is what I'm surprised with
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
I'm not able to understand how does the evaluation of the query work.
t(n) it sound like a function with a parameter. how does the value of n is passed.
Any insight on how the break down happen of the recursive statement in SQL.

This is called a common table expression and is a way of expressing a recursive query in SQL:
t(n) defines the name of the CTE as t, with a single column named n. It's similar to an alias for a derived table:
select ...
from (
...
) as t(n);
The recursion starts with the value 1 (that's the values (1) part) and then recursively adds one to it until the 99 is reached. So it generates the numbers from 1 to 99. Then final query then sums up all those numbers.
n is a column name, not a "variable" and the "assignment" happens in the same way as any data retrieval.
WITH RECURSIVE t(n) AS (
VALUES (1) --<< this is the recursion "root"
UNION ALL
SELECT n+1 FROM t WHERE n < 100 --<< this is the "recursive part"
)
SELECT sum(n) FROM t;
If you "unroll" the recursion (which in fact is an iteration) then you'd wind up with something like this:
select x.n + 1
from (
select x.n + 1
from (
select x.n + 1
from (
select x.n + 1
from (
values (1)
) as x(n)
) as x(n)
) as x(n)
) as x(n)
More details in the manual:
https://www.postgresql.org/docs/current/static/queries-with.html

If you are looking for how it is evaluated, the recursion occurs in two phases.
The root is executed once.
The recursive part is executed until no rows are returned. The documentation is a little vague on that point.
Now, normally in databases, we think of "function" in a different way than we think of them when we do imperative programming. In database terms, the best way to think of a function is "a correspondence where for every domain value you have exactly one corresponding value." So one of the immediate challenges is to stop thinking in terms of programming functions. Even user-defined functions are best thought about in this other way since it avoids a lot of potential nastiness regarding the intersection of running the query and the query planner... So it may look like a function but that is not correct.
Instead the WITH clause uses a different, almost inverse notation. Here you have the set name t, followed (optionally in this case) by the tuple structure (n). So this is not a function with a parameter, but a relation with a structure.
So how this breaks down:
SELECT 1 as n where n < 100
UNION ALL
SELECT n + 1 FROM (SELECT 1 as n) where n < 100
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 as n)) where n < 100
Of course that is a simplification because internally we keep track of the cte state and keep joining against the last iteration, so in practice these get folded back to near linear complexity (while the above diagram would suggest much worse performance than that).
So in reality you get something more like:
SELECT 1 as n where 1 < 100
UNION ALL
SELECT 1 + 1 as n where 1 + 1 < 100
UNION ALL
SELECT 2 + 1 AS n WHERE 2 + 1 < 100
...
In essence the previous values carry over.

SQL accommodating meter rollover in utility meter readings

The job is actually a machine cycle count that rolls over to zero at 32,000 but the utility / electricity / odometer analogy gets the idea across.
Let's say we have a three digit meter. After 999 it will roll over to 0.
Reading Value Difference
1 990 -
2 992 2
3 997 5
4 003 6 *
5 008 5
I have a CTE query generating the difference between rows but the line
Cur.Value - Prv.Value as Difference
on reading 4 above returns -994 due to the clock rollover. (It should return '6'.)
Can anyone suggest an SQL trick to accommodate the rollover?
e.g., Here's a trick to get around SQL's lack of "GREATEST" function.
-- SQL doesn't have LEAST/GREATEST functions so we use a math trick
-- to return the greater number:
-- 0.5*((A+B) + abs(A-B))
0.5 * (Cur._VALUE - Prv._VALUE + ABS(Cur._VALUE - Prv._VALUE)) AS Difference
Can anyone suggest a similar trick for the rollover problem?
Fiddle: http://sqlfiddle.com/#!3/ce9d4/10

You could use a CASEstatement to detect the negative value-- which indicates a rollover condition-- and compensate for it:
--Create CTE
;WITH tblDifference AS
(
SELECT Row_Number()
OVER (ORDER BY Reading) AS RowNumber, Reading, Value
FROM t1
)
SELECT
Cur.Reading AS This,
Cur.Value AS ThisRead,
Prv.Value AS PrevRead,
CASE WHEN Cur.Value - Prv.Value < 0 -- this happens during a rollover
THEN Cur.Value - Prv.Value + 1000 -- compensate for the rollover
ELSE Cur.Value - Prv.Value
END as Difference
FROM
tblDifference Cur
LEFT OUTER JOIN tblDifference Prv
ON Cur.RowNumber=Prv.RowNumber+1
ORDER BY Cur.Reading

Ordered count of consecutive repeats / duplicates

I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems.
You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?
I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?

Test case
First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:
CREATE TEMP TABLE data(
system_measured int
, time_of_measurement int
, measurement int
);
INSERT INTO data VALUES
(1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);
Simplified query
Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:
WITH x AS (
SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
ORDER BY time_of_measurement) = measurement
THEN 0 ELSE 1 END AS step
FROM data
)
, y AS (
SELECT *, sum(step) OVER(PARTITION BY system_measured
ORDER BY time_of_measurement) AS grp
FROM x
)
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM y
ORDER BY system_measured, time_of_measurement;
Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.
Faster with plpgsql function:
CREATE OR REPLACE FUNCTION x.f_repeat_ct()
RETURNS TABLE (
system_measured int
, time_of_measurement int
, measurement int, repeat_ct int
) LANGUAGE plpgsql AS
$func$
DECLARE
r data; -- table name serves as record type
r0 data;
BEGIN
-- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed, see below!
repeat_ct := 0; -- init
FOR r IN
SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
IF r.system_measured = r0.system_measured
AND r.measurement = r0.measurement THEN
repeat_ct := repeat_ct + 1; -- start new array
ELSE
repeat_ct := 0; -- start new count
END IF;
RETURN QUERY SELECT r.*, repeat_ct;
r0 := r; -- remember last row
END LOOP;
END
$func$;
Call:
SELECT * FROM x.f_repeat_ct();
Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.
Billions of rows
If you have billions of rows, you may want to split this operation up. I quote the manual here:
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation. Currently, the point at
which data begins being written to disk is controlled by the work_mem
configuration variable. Administrators who have sufficient memory to
store larger result sets in memory should consider increasing this
parameter.
Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.
One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

Selecting SUM of TOP 2 values within a table with multiple GROUP in SQL

I've been playing with sets in SQL Server 2000 and have the following table structure for one of my temp tables (#Periods):
RestCTR HoursCTR Duration Rest
----------------------------------------
1 337 2 0
2 337 46 1
3 337 2 0
4 337 46 1
5 338 1 0
6 338 46 1
7 338 2 0
8 338 46 1
9 338 1 0
10 339 46 1
...
What I'd like to do is to calculate the Sum of the 2 longest Rest periods for each HoursCTR, preferably using sets and temp tables (rather than cursors, or nested subqueries).
Here's the dream query that just won't work in SQL (no matter how many times I run it):
Select HoursCTR, SUM ( TOP 2 Duration ) as LongestBreaks
FROM #Periods
WHERE Rest = 1
Group By HoursCTR
The HoursCTR can have any number of Rest periods (including none).
My current solution is not very elegant and basically involves the following steps:
Get the max duration of rest, group by HoursCTR
Select the first (min) RestCTR row that returns this max duration for each HoursCTR
Repeat step 1 (excluding the rows already collected in step 2)
Repeat step 2 (again, excluding rows collected in step 2)
Combine the RestCTR rows (from step 2 and 4) into single table
Get SUM of the Duration pointed to by the rows in step 5, grouped by HoursCTR
If there are any set functions that cut this process down, they would be very welcome.

The best way to do this in SQL Server is with a common table expression, numbering the rows in each group with the windowing function ROW_NUMBER():
WITH NumberedPeriods AS (
SELECT HoursCTR, Duration, ROW_NUMBER()
OVER (PARTITION BY HoursCTR ORDER BY Duration DESC) AS RN
FROM #Periods
WHERE Rest = 1
)
SELECT HoursCTR, SUM(Duration) AS LongestBreaks
FROM NumberedPeriods
WHERE RN <= 2
GROUP BY HoursCTR
edit: I've added an ORDER BY clause in the partitioning, to get the two longest rests.
Mea culpa, I did not notice that you need this to work in Microsoft SQL Server 2000. That version doesn't support CTE's or windowing functions. I'll leave the answer above in case it helps someone else.
In SQL Server 2000, the common advice is to use a correlated subquery:
SELECT p1.HoursCTR, (SELECT SUM(t.Duration) FROM
(SELECT TOP 2 p2.Duration FROM #Periods AS p2
WHERE p2.HoursCTR = p1.HoursCTR
ORDER BY p2.Duration DESC) AS t) AS LongestBreaks
FROM #Periods AS p1

SQL 2000 does not have CTE's, nor ROW_NUMBER().
Correlated subqueries can need an extra step when using group by.
This should work for you:
SELECT
F.HoursCTR,
MAX (F.LongestBreaks) AS LongestBreaks -- Dummy max() so that groupby can be used.
FROM
(
SELECT
Pm.HoursCTR,
(
SELECT
COALESCE (SUM (S.Duration), 0)
FROM
(
SELECT TOP 2 T.Duration
FROM #Periods AS T
WHERE T.HoursCTR = Pm.HoursCTR
AND T.Rest = 1
ORDER BY T.Duration DESC
) AS S
) AS LongestBreaks
FROM
#Periods AS Pm
) AS F
GROUP BY
F.HoursCTR

Unfortunately for you, Alex, you've got the right solution: correlated subqueries, depending upon how they're structured, will end up firing multiple times, potentially giving you hundreds of individual query executions.
Put your current solution into the Query Analyzer, enable "Show Execution Plan" (Ctrl+K), and run it. You'll have an extra tab at the bottom which will show you how the engine went about the process of gathering your results. If you do the same with the correlated subquery, you'll see what that option does.
I believe that it's likely to hammer the #Periods table about as many times as you have individual rows in that table.
Also - something's off about the correlated subquery, seems to me. Since I avoid them like the plague, knowing that they're evil, I'm not sure how to go about fixing it up.

Display more than one row with the same result from a field

I need to show more than one result from each field in a table. I need to do this with only one SQL sentence, I don´t want to use a Cursor.
This seems silly, but the number of rows may vary for each item. I need this to print afterwards this information as a Crystal Report detail.
Suppose I have this table:
idItem Cantidad <more fields>
-------- -----------
1000 3
2000 2
3000 5
4000 1
I need this result, using one only SQL Sentence:
1000
1000
1000
2000
2000
3000
3000
3000
3000
3000
4000
where each idItem has Cantidad rows.
Any ideas?

It seems like something that should be handled in the UI (or the report). I don't know Crystal Reports well enough to make a suggestion there. If you really, truly need to do it in SQL, then you can use a Numbers table (or something similar):
SELECT
idItem
FROM
Some_Table ST
INNER JOIN Numbers N ON
N.number > 0 AND
N.number <= ST.cantidad
You can replace the Numbers table with a subquery or function or whatever other method you want to generate a result set of numbers that is at least large enough to cover your largest cantidad.

Check out UNPIVOT (MSDN)
Another example

If you use a "numbers" table that is useful for this and many similar purposes, you can use the following SQL:
select t.idItem
from myTable t
join numbers n on n.num between 1 and t.Cantidad
order by t.idTtem
The numbers table should just contain all integer numbers from 0 or 1 up to a number big enough so that Cantidad never exceeds it.

As others have said, you need a Numbers or Tally table which is just a sequential list of integers. However, if you knew that Cantidad was never going to be larger than five for example, you can do something like:
Select idItem
From Table
Join (
Select 1 As Value
Union All Select 2
Union All Select 3
Union All Select 4
Union All Select 5
) As Numbers
On Numbers.Value <= Table.Cantidad
If you are using SQL Server 2005, you can use a CTE to do:
With Numbers As
(
Select 1 As Value
Union All
Select N.Value + 1
From Numbers As N
)
Select idItem
From Table
Join Numbers As N
On N.Value <= Table.Cantidad
Option (MaxRecursion 0);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL, Auxiliary table of numbers - sql

The most optimal function would be to use a table instead of a function. Using a function causes extra CPU load to create the values for the data being returned, especially if the values being returned cover a very large range.

Related

Possible explanation on WITH RECURSIVE Query Postgres

SQL accommodating meter rollover in utility meter readings

Ordered count of consecutive repeats / duplicates

Selecting SUM of TOP 2 values within a table with multiple GROUP in SQL

Display more than one row with the same result from a field

Categories

Resources