Update behaviour - sql

When I made a mistake in update query text, I spotted unpredictable query result. Here is query text for update.
DECLARE #T TABLE (Id int,[Name] nvarchar(100),RNA int)
INSERT INTO #T(Id,[Name])
SELECT [Id],[Name]
FROM (VALUES (1,N'D'),
(2,N'B'),
(3,N'S'),
(4,N'A'),
(5,N'F')
) AS vtable([Id],[Name])
UPDATE #T
SET RNA = T.RN
FROM (
select PP.Name,ROW_NUMBER() OVER(ORDER BY PP.Name) RN,PP.RNA from #T PP
) T
select * from #T
I know where mistake was made:
UPDATE #T
should be
UPDATE T
But why result (with "bad" query) looks like:
Id Name RNA
---- ----- -------
1 D 1
2 B 5
3 S 1
4 A 5
5 F 1
I suspect that 1 and 5 values are MIN(Id) and MAX(Id).
Execution plan look like:
Will this situation be the same in every situation with this kind of mistake?
If yes, has this behaviour any practical value?

The situation will not be the same for every kind of mistake. You have a non-determinisitic update statement, that is to say theoritically any of the values for RN in your subquery T could be applied to any of the values in #T. You are essentially running the UPDATE version of this:
SELECT *
FROM #t a
CROSS JOIN
( SELECT TOP 1
PP.Name,
ROW_NUMBER() OVER(ORDER BY PP.Name) RN,
PP.RNA
FROM #T PP
ORDER BY NEWID()
) T
OPTION (FORCE ORDER);
The online manual states:
The results of an UPDATE statement are undefined if the statement
includes a FROM clause that is not specified in such a way that only
one value is available for each column occurrence that is updated,
that is if the UPDATE statement is not deterministic.
What is slightly interesting is that if you run the above you will get a different result each time (barring the 1/25 chance of getting the same result twice in a row), if you remove the random sorting using NEWID() you will get the same value of RN for each row, but the update consistently returns the same results, with 2 different RNs. I am not surprised the result remains consistent with no random ordering because with no changes to the data, and no random factor introduced I would expect the optimiser to come up with the same execution plan no matter how many times it is run.
Since no explicit ordering is specified in your update query, the order is due to the order of the records on the leaf, if the order of the records is altered the result is altered. This can be shown by inserting the records of #T into a new table with different IDs
DECLARE #T2 TABLE (Id int,[Name] nvarchar(100),RNA int);
INSERT #T2
SELECT id, Name, NULL
FROM #T
ORDER BY ROW_NUMBER() OVER(ORDER BY NEWID())
OPTION (FORCE ORDER);
UPDATE #T2
SET RNA = T.RN
FROM (
select PP.Name,ROW_NUMBER() OVER(ORDER BY PP.Name) RN,PP.RNA from #T2 PP
) T
SELECT *
FROM #T2;
I can see no reason why this is always the min or max value of RN though, I expect you would have to delve deep into the optimiser to find this. Which is probably a new question better suited for the dba stack exchange.

Related

SQL Server 2008: Need to do math on the previous row

Working in SQL Server 2008 so the analytical functions are not an option.
Basically I have amount financed and payment made, but need to calculate interest for the first row - which is done, but need for the next row so need to grab the balance from the previous row.
Without any schema context, I can only provide a general structure, but in SQL Server 2008 you should be able to do something like this:
-- This is called a CTE (Common Table Expression)
-- Think of it as a named sub-query
;WITH computed_table AS (
-- The ROW_NUMBER() function produces an ordered computed
-- column ordered by the values in the column specified in
-- the OVER clause
SELECT ROW_NUMBER() OVER(ORDER BY Id) AS row_num
,*
FROM my_table
)
SELECT *
-- perform calculations on t1 and t2
,(t1.amount - t2.amount) AS CalculatedAmt -- example calcuation
FROM computed_table t1
OUTER APPLY (
SELECT *
FROM computed_table t2
WHERE t2.row_num = t1.row_num - 1
) AS prev
The CTE and the ROW_NUMBER() function are necessary to make sure you have a perfectly ordered column with no gaps, something which can't be guaranteed with a primary key field since rows could be deleted. The OUTER APPLY allows you to perform a table-valued operation on the individual values of the rows in the left hand table.
EDIT: To insert the results into a table, rather than just selecting them, you can add a INSERT clause after the SELECT clause:
...(CTE HERE)...
SELECT *
-- perform calculations on t1 and t2
,(t1.amount - t2.amount) AS CalculatedAmt -- example calcuation
-- This INSERT clause will insert the result set into my_table. Make
-- sure the column aliases in the SELECT clause match the column names
-- in my_table.
INTO my_table
FROM computed_table t1
...(REST OF QUERY HERE)...
Try this example
DECLARE #tbl TABLE(ID INT, Test VARCHAR(100),SortKey INT);
INSERT INTO #tbl VALUES(1,'Test 1 3',3),(2,'Test 2 4',4),(3,'Test 3 1',1),(4,'Test 4 2',2);
WITH Sorted AS
(
SELECT ROW_NUMBER() OVER(ORDER BY SortKey) AS Nr
,*
FROM #tbl
)
SELECT s.Test
,(SELECT prev.Test FROM Sorted AS prev WHERE s.Nr=prev.Nr+1) AS PreviousRow
,(SELECT nxt.Test FROM Sorted AS nxt WHERE s.Nr=nxt.Nr-1) AS NextRow
FROM sorted AS s
Attention
ROW_NUMBER() OVER() will only work as expected, if the values you are sorting after are unique!
The result
Test PreviousRow NextRow
Test 3 1 NULL Test 4 2
Test 4 2 Test 3 1 Test 1 3
Test 1 3 Test 4 2 Test 2 4
Test 2 4 Test 1 3 NULL

How to do batch insert when there is no identity column?

Table with million rows, two columns.
code | name
xyz | product1
abc | Product 2
...
...
I want to do insert in small batches (10000) via the insert into/select query.
How can we do this when there is no identity key to create a batch?
You could use a LEFT OUTER JOIN in your SELECT statement to identify records that are not already in the INSERT table, then use TOP to grab the first 10000 that the database finds. Something like:
INSERT INTO tableA
SELECT TOP 10000 code, name
FROM tableB LEFT OUTER JOIN tableA ON tableB.Code = tableA.Code
WHERE tableA.Code IS NULL;
And then run that over and over and over again until it's full.
You could also use Windowing functions to batch like:
INSERT INTO tableA
SELECT code, name
FROM (
SELECT code, name, ROW_NUMBER() OVER (ORDER BY name) as rownum
FROM tableB
)
WHERE rownum BETWEEN 1 AND 100000;
And then just keep changing the BETWEEN to get your batch. Personally, if I had to do this, I would use the first method though since it's guaranteed to catch everything that isn't already in TableA.
Also, if there is the possibility that tableb will gain records during this batching process, then option 1 is definitely better. Essentially, with option2, it will determine the row_number() on the fly, so newly inserted records will cause records to be missed if they show up in the middle of batches.
If TableB is static, then Option 2 may be faster since the DB just has to sort and number the records, instead of having to join HUGE table to HUGE table and then grab 10000 records.
You can do the pagination on SELECT and select the records by batch/page size of say 10000 or whatever you need and insert in the target table. In the below sample you will have to change the value of #Min and #Max for each iteration of the batch size you desire to have.
INSERT INTO EmployeeNew
SELECT Name
FROM
(
SELECT DENSE_RANK OVER(ORDER BY EmployeeId) As Rank, Name
FROM Employee
) AS RankedEmployee
WHERE Rank >= #Min AND Rank < #Max

Get value of a specific column of a row without primary key

This is purely out of curiosity.
create table test (ename varchar(50))
insert into test values ('abcd')
insert into test values ('pqrs')
insert into test values ('lmno')
insert into test values ('xxxx')
insert into test values ('tops')
I want the value of 3rd row from this table in a variable. i.e "lmno"
If I do this :
Declare #value varchar(50)
Select #value = ename from
(
select Row_number() over(order by ename) Rowno, * from test
) X where Rowno=3
print #value
I will get pqrs.
I cannot use this:
Declare #value varchar(50)
Select #value = ename from
(
select Row_number() over(order by 1) Rowno, * from test
) X where Rowno=3
because
Windowed functions do not support integer indices as ORDER BY clause
expressions.
Any options?
EDIT :
If I query it as
Select * from test
I do get records in the order in which they were inserted. That means somewhere there is a record as to how they were inserted. I just want to capture this sequence.
You are making a very very poor assumption about RDBMS's. The order that a RDBMS stores records, or they order they are written to the table is 100% absolutely inconsequential. It means nothing. It's arbitrary and you can't rely on it.
You will need to either add a new column to be the 'order' that you desire, or you will have to better define why you want pqrs in your recordset since 3rd record is meaningless in this sense.
To your edit: There is no record of the order which the records were inserted. There is an order by which the records are returned to the record set, and that they naturally lay in the DB's structure underneath, but it is arbitrary. The reason you get them back in the order in which they were written is because you have a tiny little table on a RDBMS that stores data in a single spot. This fails as soon as you scale your architecture up. You can not and should never ever rely on the order that your RDBMS retrieves records.
Let's look at it step by step.
select ename from test order by ename;
This orders by ename.
select ename from test order by 1;
Here 1 is an alias for the 1st element in your select clause, which is ename. So you order by ename again.
select Row_number() over(order by ename) Rowno, * from test
The row_number function works on records ordered by ename.
select Row_number() over(order by 1) Rowno, * from test
What is 1 supposed to mean here? We are inside an over clause and there is no first element the 1 could refer to. So it is not allowed to use a number here (it would only be confusing, as it could only mean a literal 1 for every record which doesn't order anything).
As to "I do get records in the order in which they were inserted. That means somewhere there is a record as to how they were inserted. I just want to capture this sequence.": No, that isn't the case. Right now you happen to get the records in the order they were inserted, but this is in no way guaranteed. The only way to guarantee an order is to have fields to represent the desired order and use them in ORDER BY.
Try This code
create table test (ename varchar(50))
insert into test values ('abcd')
insert into test values ('pqrs')
insert into test values ('lmno')
insert into test values ('xxxx')
insert into test values ('tops')
SELECT *FROM (
SELECT ROW_NUMBER() over (order by HH) AS RNO,ename FROM (SELECT ENAME,'' AS HH FROM test) T)T1
WHERE RNO=3
WITH MyCte AS
(
SELECT *, row_number() OVER(ORDER BY (SELECT 0)) ID FROM test
)
SELECT *
FROM MyCte
WHERE ID = 3
Try:
create table #test (ename varchar(50))
insert into #test values ('abcd')
insert into #test values ('pqrs')
insert into #test values ('lmno')
insert into #test values ('xxxx')
insert into #test values ('tops')
CREATE TABLE #Temp(RowID INT PRIMARY KEY IDENTITY(1,1), ename VARCHAR(10))
INSERT INTO #temp(ename)
SELECT ename from #test
SELECT T2.*
FROM #temp T1
JOIN #test T2 ON T1.ename = T2.ename
WHERE T1.RowID = 3

Retrieving most recent data in SQL

Total disclosure: I'm a SQL beginner.
I have a data set of certain accounting and governance metrics for US companies. It has about 15 columns and roughly 18 million rows. Each row is a unique combination of company, date and metric being measured. The columns include certain identifiers like isin number, ticker symbol, etc, the date the metric was released, the metric description, and the metric itself.
What I'm trying to do is write a query that will yield the NEWEST values for a certain metric for all companies. In my hopeless search over the past few days I've come to think that the GROUP BY clause may be what I'm looking for. However, it doesn't seem to do exactly what I need. I've got it working with just 2 columns: isin number (company identifier), and date. In other words, I can spit out a list that shows the most recent date for each company, but I'm not sure how to add more columns to this, how to specify what metric to look at.
Any guidance would be appreciated, even if it's just pointing me in the right direction towards what kind of commands I should be looking into.
Thanks!
EDIT: Wow. Thanks for the quick and thorough replies. And point taken on the clarity and example data sets/starting query. Update: I think I have it working. Here's what I used:
SELECT a1.["id_isin_number"], a1.["metric_description"], a1.["date_period_ends"], a1.["company_metric_value"], a2.maxdate
FROM [AGR Metrics].[dbo].[Audit_Integrity_Metric_Data_File_NA Original_0] a1
INNER JOIN (
SELECT a2.["id_isin_number"], MAX(a2.["date_period_ends"]) AS maxdate
FROM [AGR Metrics].[dbo].[Audit_Integrity_Metric_Data_File_NA Original_0] a2
GROUP BY a2.["id_isin_number"]
) a2
ON a1.["date_period_ends"] = a2.maxdate
AND a1.["id_isin_number"] = a2.["id_isin_number"]
WHERE a1.["metric_description"] = '"Litigation: Class Action"'
I'm looking over the responses now to make sure I'm doing this as efficiently as possible.
You can use the ROW_NUMBER() function for this (if using SQL Server 2005 or newer):
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY isin ORDER BY [date] DESC) AS RowRank
FROM YourTable
)sub
WHERE RowRank = 1
Just list out the fields you want in place of * if you don't want them all returned.
The ROW_NUMBER() function adds a number to each row, PARTITION BY is optional and is used to define a group for which numbering will start over at 1, in this case, you want the most recent for each value of isin so we PARTITION BY that. ORDER BY is required and defines the order of the numbering, in this case by date.
Your current query can also be used, but the ROW_NUMBER() method is simpler and more efficient:
SELECT a.*
FROM YourTable a
JOIN (SELECT isin, MAX([date])
FROM YourTable
GROUP BY isin
)b
ON a.isin = b.isin
AND a.[date] = b.[date]
Well as you quote the date the metric was released , So you can use it to sort your table using Order By .
This is a very basic example which can be used to simply sort data and selecting top 1 value.
Please refer This
CREATE TABLE trialOne (
Id INT NULL,
NAME VARCHAR(50) NULL,
[Date] DATETIME NULL
)
SELECT * FROM dbo.ETProgram
INSERT INTO trialone VALUES(1,'john','2009-01-06 11:39:51.827')
INSERT INTO trialone VALUES(2,'joseph','2010-01-06' )
INSERT INTO trialone VALUES(3,'Ajay','2009-05-06' )
INSERT INTO trialone VALUES(4,'Dave','2009-11-06' )
INSERT INTO trialone VALUES(5,'jonny','2004-01-06')
INSERT INTO trialone VALUES(6,'sunny','2005-01-06')
INSERT INTO trialone VALUES(7,'elle','2013-01-06' )
INSERT INTO trialone VALUES(8,'mac','2012-01-06' )
INSERT INTO trialone VALUES(8,'Sam','2008-01-06' )
INSERT INTO trialone VALUES(10,'xxxxx','2013-08-06')
SELECT TOP(1)name FROM trialone ORDER BY Date DESC

SQL "over" partition WHERE date between two values

I have a query that partitions and ranks "Note" records, grouping them by ID_Task (users add notes for each task). I want to rank the notes by date, but I also want to restrict it so they're ranked between two dates.
I'm using SQL Server 2008. So far my SELECT looks like this:
SELECT Note.ID,
Note.ID_Task,
Note.[Days],
Note.[Date],
ROW_NUMBER() OVER (PARTITION BY ID_Task ORDER BY CAST([Date] AS DATE), Edited ASC) AS Rank
FROM
Note
WHERE
Note.Locked = 1 AND Note.Deleted = 0
Now, I assume that if I put the WHERE clause at the bottom, although they'll still have ranks, I might or might not get item with rank 1, as it might get filtered out. So is there a way I can only partition records WHERE , ignoring all of the others? I could partition a sub-query I guess.
The intention is to use the rank number to find the most recent note for each task, in another query. So in that query I'll join with this result WHERE rank = 1.
row_number() operates after where. You'll always get a row 1.
For example:
declare #t table (id int)
insert #t values (3), (1), (4)
select row_number() over (order by id)
from #t
where id > 1
This prints:
1
2