Updating a subset of data through a CTE - sql

Question
I've just come across the concept of using update statements on CTEs.
This seems a great approach, but I've not seen it used before, and the context in which I was introduced to it (i.e. uncovered it in some badly written code) suggests the author didn't know what they were doing.
Is anyone aware of any reason not to perform updates on CTEs / any considerations which should be made when doing so (assuming the CTE gives some benefit; such as to allow you to update an arbitrary subset of data).
Full Info
I recently found some horrendous code in our production environment where someone had clearly been experimenting on ways to update a single row or data. I've tidied up the layout to make it readable, but have left the original logic as is.
CREATE PROCEDURE [dbo].[getandupdateWorkOrder]
-- Add the parameters for the stored procedure here
-- #p1 xml Output
AS
BEGIN
WITH T AS
(
SELECT XMLDoc
, Retrieved
from [Transcode].[dbo].[WorkOrder]
where WorkOrderId IN
(
SELECT TOP 1 WorkOrderId
FROM
(
SELECT DISTINCT(WorkOrderId)
, Retrieved
FROM [Transcode].[dbo].[WorkOrder]
WHERE Retrieved = 0
) as c
)
AND Retrieved = 0
)
UPDATE T
SET Retrieved = 1
Output inserted.XMLDoc
END
I can easily update this to the below without affecting the logic:
CREATE PROCEDURE [dbo].[GetAndUpdateWorkOrder]
AS
BEGIN
WITH T AS
(
SELECT top 1 XMLDoc
, Retrieved
from [WorkOrder]
where Retrieved = 0
)
UPDATE T
SET Retrieved = 1
Output inserted.XMLDoc
END
However the code also introduced me to a new concept; that you could update CTEs / see those updates in the underlying tables (I'd previously assumed that CTEs were read only in-memory copies of the data selected from the original table, and thus not possible to amend).
Had I not seen the original code, but needed something which behaved like this I'd have written it as follows:
CREATE PROCEDURE [dbo].[GetAndUpdateWorkOrder]
AS
BEGIN
UPDATE [WorkOrder]
SET Retrieved = 1
Output inserted.XMLDoc
where Id in
(
select top 1 Id
from [WorkOrder]
where Retrieved = 0
--order by Id --I'd have included this too; but not including here to ensure my rewrite is *exactly* the same as the original in terms of functionality, including the unpredictable part (the bonus of not including this is a performance benefit; though that's negligible given the data in this table)
)
END
The code which performs the update via the CTE looks much cleaner (i.e. you don't even need to rely on a unique id for this to work).
However because the rest of the original code is badly written I'm apprehensive about this new technique, so want to see what the experts say about this approach before adding it to my arsenal.

Updating CTEs is fine. There are limitations on the subqueries that you can use (such as no aggregations).
However, you have a misconception about CTEs in SQL Server. They do not create in-memory tables. Instead, they operate more like views, where the code is included in the query. The overall query is then optimized. Note: this behavior differs from other databases and, there is no way to override this, even with a hint.
This is an important distinction. If you have a complex CTE and use it more than once, then it will typically execute for each reference in the overall query.

Updating through CTEs are fine. It's especially handy when you have to deal with window functions. For example, you can use this query to give the top 10 performing employees in each department a 10% raise:
WITH TopPerformers AS
(
SELECT DepartmentID, EmployeeID, Salary,
RANK() OVER (PARTITION BY DepartmentID ORDER BY PerformanceScore DESC) AS EmployeeRank
FROM Employees
)
UPDATE TopPerformers
SET Salary = Salary * 1.1
WHERE EmployeeRank <= 10
(I'm ignoring the fact that there can be more than 10 employees per department in case many have the same score, but that's beyond the point here.)
Nice clean and easy to understand. I see CTE as a temporary view so I tend to follow what Microsoft says about updating views. See the Updatable Views section on this page.

Related

How can I create a temporary numbers table with SQL?

So I came upon a question where someone asked for a list of unused account numbers. The query I wrote for it works, but it is kind of hacky and relies on the existence of a table with more records than existing accounts:
WITH tmp
AS (SELECT Row_number()
OVER(
ORDER BY cusno) a
FROM custtable
fetch first 999999 rows only)
SELECT tmp.a
FROM tmp
WHERE a NOT IN (SELECT cusno
FROM custtable)
This works because customer numbers are reused and there are significantly more records than unique customer numbers. But, like I said, it feels hacky and I'd like to just generate a temporary table with 1 column and x records that are numbered 1 through x. I looked at some recursive solutions, but all of it looked way more involved than the solution I wound up using. Is there an easier way that doesn't rely on existing tables?
I think the simple answer is no. To be able to make a determination of absence, the platform needs to know the expected data set. You can either generate that as a temporary table or data set at runtime - using the method you've used (or a variation thereof) - or you can create a reference table once, and compare against it each time. I'd favour the latter - a table with a single column of integers won't put much of a dent in your disk space and it doesn't make sense to compute an identical result set over and over again.
Here's a really good article from Aaron Bertrand that deals with this very issue:
https://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
(Edit: The queries in that article are TSQL specific, but they should be easily adaptable to DB2 - and the underlying analysis is relevant regardless of platform)
If you search all unused account number you can do it :
with MaxNumber as
(
select max(cusno) MaxID from custtable
),
RecurceNumber (id) as
(
values 1
union all
select id + 1 from RecurceNumber cross join MaxNumber
where id<=MaxID
)
select f1.* from RecurceNumber f1 exception join custtable f2 on f1.id=f2.cusno

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

Create a function with whole columns as input and output

I have several programs written in R that now I need to translate in T-SQL to deliver them to the client. I am new to T-SQL and I'm facing some difficulties in translating all my R functions.
An example is the numerical derivative function, which for two input columns (values and time) would return another column (of same length) with the computed derivative.
My current understanding is:
I can't use SP, because I'll need to use this functions inline with
select statement, like:
SELECT Customer_ID, Date, Amount, derivative(Amount, Date) FROM Customer_Detail
I can't use UDF, because they can take, as input parameter, only scalar. I'll need vectorised function due to speed and also because for some functions I have, like the one above, running row by row wouldn't be meaningful (for each value it needs the next and the previous)
UDA take whole column but, as the name says..., they will aggregate the column like sum or avg would.
If the above is correct, which other techniques would allow me to create the type of function I need? An example of SQL built-in function similar to what I'm after is square() which (apparently) takes a column and returns itself^2. My goal is creating a library of functions which behave like square, power, etc. But internally it'll be different cause square takes and returns each scalar is read through the rows. I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?
NB: At the moment I'm on SQL-Server 2005 but we'll switch soon to 2012 (or possibly 2014 in few months) so answers based on any 2005+ version of SQL-Server are fine.
EDIT: added the R tag for R developers who have, hopefully, already faced such difficulties.
EDIT2: Added CLR tag: I went through CLR user defined aggregate as defined in the Pro t-sql 2005 programmers guide. I already said above that this type of function wouldn't fit my needs but it was worth looking into it. The 4 methods needed by a UDA are: Init, Accumulate, Merge and Terminate. My request would need the whole data being analysed all together by the same instance of the UDA. So options including merge methods to group together partial results from multicore processing won't be working.
I think you may consider changing your mind a bit. SQL language is very good when working with sets of data, especially modern RDBMS implementations (like SQL Server 2012), but you have to think in sets, not in rows or columns. While I stilldon't know your exact tasks, let's see - SQL Server 2012 have very nice set of window functions + ranking functions + analytic functions + common table expressions, so you can write almost any query inline. You can use chains of common table expression to turn your data any way you want, to calculate running totals, to calculate averages or other aggregates over window and so on.
Actually, I've always liked SQL and when I've learned functional language (ML and Scala) a bit, my thought was that my approach to SQL is very similar to functional language paradigm - just slicing and dicing data without saving anything into variables, untils you have resultset your need.
Just quick example, here's a question from SO - How to get average of the 'middle' values in a group?. The goal was to get the average for each group of the middle 3 values:
TEST_ID TEST_VALUE GROUP_ID
1 5 1 -+
2 10 1 +- these values for group_id = 1
3 15 1 -+
4 25 2 -+
5 35 2 +- these values for group_id = 2
6 5 2 -+
7 15 2
8 25 3
9 45 3 -+
10 55 3 +- these values for group_id = 3
11 15 3 -+
12 5 3
13 25 3
14 45 4 +- this value for group_id = 4
For me, it's not an easy task to do in R, but in SQL it could be a really simple query like this:
with cte as (
select
*,
row_number() over(partition by group_id order by test_value) as rn,
count(*) over(partition by group_id) as cnt
from test
)
select
group_id, avg(test_value)
from cte
where
cnt <= 3 or
(rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
You can also easily expand this query to get 5 values around the middle.
TAke closer look to analytical functions, try to rethink your calculations in terms of window functions, may be it's not so hard to rewrite your R procedures in plain SQL.
Hope it helps.
I would solve this by passing a reference to the record(s) you want to process, and use so called "inline table-valued function" to return the record(s) after processing the initial records.
You find the table-function reference here:
http://technet.microsoft.com/en-en/library/ms186755.aspx
A Sample:
CREATE FUNCTION Sales.CustomerExtendedInfo (#CustomerID int)
RETURNS TABLE
AS
RETURN
(
SELECT FirstName + LastName AS CompleteName,
DATEDIFF(Day,CreateDate,GetDate()) AS DaysSinceCreation
FROM Customer_Detail
WHERE CustomerID = #CustomerID
);
GO
StoreID would be the Primary-Key of the Records you want to process.
Table-Function can afterwards be joined to other Query results if you want to process more than one record at once.
Here is a Sample:
SELECT * FROM Customer_Detail
CROSS APPLY Sales.CustomerExtendedInfo (CustomerID)
Using a normal Stored Procedure would do the same more or less, but it's a bit tricky to work with the results programmatically.
But keep one thing in mind: SQL-Server is not really good for "functional-programming". It's brilliant working with data and sets of data, but the more you use it as a "application server" the more you will realize it's not made for that.
I don't think this is possible in pure T-SQL without using cursors. But with cursors, stuff will usually be very slow. Cursors are processing the table row-by/row, and some people call this "slow-by-slow".
But you can create your own aggregate function (see Technet for more details). You have to implement the function using the .NET CLR (e.g. C# or R.NET).
For a nice example see here.
I think interfacing R with SQL is a very nice solution. Oracle is offering this combo as a commercial product, so why not going the same way with SQL Server.
When integrating R in the code using the own aggregate functions, you will only pay a small performance penalty. Own aggregate functions are quite fast according to the Microsoft documentation: "Managed code generally performs slightly slower than built-in SQL Server aggregate functions". And the R.NET solution seems also to be quite fast by loading the native R DLL directly in the running process. So it should be much faster than using R over ODBC.
ORIGINAL RESPONSE:
if you know already what are the functions you will need one of the approach I can think of is, creating one In-Line function for each method/operation you want to apply per table.
what I mean by that? for example you mentioned FROM Customer_Detail table when you select you might want need one method "derivative(Amount, Date)". let's say second method you might need (I am just making up for explanation) is "derivative1(Amount1, Date1)".
we create two In-Line Functions, each will do its own calculation inside function on intended columns and also returns remaining columns as it is. that way you get all columns as you get from table and also perform custom calculation as a set-based operation instead scalar operation.
later you can combine the Independent calculation of columns in same function if make sense.
you can still use this all functions and do JOIN to get all custom calculation in single set if needed as all functions will have common/unprocessed columns coming as it is.
see the example below.
IF object_id('Product','u') IS NOT NULL
DROP TABLE Product
GO
CREATE TABLE Product
(
pname sysname NOT NULL
,pid INT NOT NULL
,totalqty INT NOT NULL DEFAULT 1
,uprice NUMERIC(28,10) NOT NULL DEFAULT 0
)
GO
INSERT INTO Product( pname, pid, totalqty, uprice )
SELECT 'pen',1,100,1.2
UNION ALL SELECT 'book',2,300,10.00
UNION ALL SELECT 'lock',3,500,15.00
GO
IF object_id('ufn_Product_totalValue','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValue
GO
CREATE FUNCTION ufn_Product_totalValue
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice AS totalValue
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
IF object_id('ufn_Product_totalValuePct','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValuePct
GO
CREATE FUNCTION ufn_Product_totalValuePct
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice/100 AS totalValuePct
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
SELECT * FROM ufn_Product_totalValue(10,5)
SELECT * FROM ufn_Product_totalValuepct(10,5)
select tv.pname,tv.pid,tv.totalValue,pct.totalValuePct
from ufn_Product_totalValue(10,5) tv
join ufn_Product_totalValuePct(10,5) pct
on tv.pid=pct.pid
also check the output as shown below.
EDIT2:
three point smoothing Algorithms
IF OBJECT_ID('Test3PointSmoothingAlgo','u') IS NOT NULL
DROP TABLE Test3PointSmoothingAlgo
GO
CREATE TABLE Test3PointSmoothingAlgo
(
qty INT NOT NULL
,id INT IDENTITY NOT NULL
)
GO
INSERT Test3PointSmoothingAlgo( qty ) SELECT 10 UNION SELECT 20 UNION SELECT 30
GO
IF object_id('ufn_Test3PointSmoothingAlgo_qty','IF') IS NOT NULL
DROP FUNCTION ufn_Test3PointSmoothingAlgo_qty
GO
CREATE FUNCTION ufn_Test3PointSmoothingAlgo_qty
(
#ID INT --this is a dummy parameter
)
RETURNS TABLE AS
RETURN
(
WITH CTE_3PSA(SmoothingPoint,Coefficients)
AS --finding the ID of adjacent points
(
SELECT id,id
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id-1
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id+1
FROM Test3PointSmoothingAlgo
)
--Apply 3 point Smoothing algorithms formula
SELECT a.SmoothingPoint,SUM(ISNULL(b.qty,0))/3 AS Qty_Smoothed--this is a using 3 point smoothing algoritham formula
FROM CTE_3PSA a
LEFT JOIN Test3PointSmoothingAlgo b
ON a.Coefficients=b.id
GROUP BY a.SmoothingPoint
)
GO
SELECT SmoothingPoint,Qty_Smoothed FROM dbo.ufn_Test3PointSmoothingAlgo_qty(NULL)
I think you may need to break you functionalities into two parts - into UDA which can work on scopes thank to OVER (...) clause and formulas which combine the result scalars.
What you are asking for - to define objects in such a way as to make it a aggregate/scalar combo - is probably out of scope of regular SQL Server's capabilities, unless you fall back into CLR code the effectively would be equivalent to cursor in terms of performance or worse.
Your best shot is to probably defined SP (I know you don't what that) that will produce the whole result. Like create [derivative] stored procedure that will take in parameters with table and column names as parameters. You can even expand on the idea but in the end that's not what you want exactly.
Since you mention you will be upgrading to SQL Server 2012 - SQL Server 2008 introduced Table Valued Parameters
This feature will do what you want. You will have to define a User Defined Type (UDT) in your DB which is like a table definition with columns & their respective types.
You can then use that UDT as a parameter type for any other stored procedure or function in your DB.
You can combine these UDTs with CLR integration to achieve what you require.
As mentioned SQL is not good when you are comparing rows to other rows, it's much better at set based operations where every row is treated as an independent entity.
But, before looking at cursors & CLR, you should make sure it can't be done in pure TSQL which will almost always be faster & scale better as your table grows.
One method for comparing rows based on order is wrap your data in a CTE, adding a ranking function like ROW_NUMBER to set the row order, followed by a self-join of the CTE onto itself.
The join will be performed on the ordered field e.g. ROW_NUMBER=(ROW_NUMBER-1)
Look at this article for an example

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog