I would want to leave a sorted table, ie when I did a query select * from NewTable I obtained the sorted table.
I've tried, but not sort the table how I specify
select column1,column2,column3,column4
into NewTable
from Table1,Table2
order by column1,column2
You only get result sets in a particular order when you use order by. Tables represent unordered sets, so they have no order except when being output as result sets.
However, you can use a trick in SQL Server to make that order by fast. The trick is to using the order by in insert and have an identity primary key. Then ordering by the primary key should be very efficient. You could do this as:
create table NewTable (
NewTableId int identity(1, 1) not null primary key,
column1 . . .
. . .
);
insert into NewTable(column1, column2, column3, column4)
select column1, column2, column3, column4
from Table1 cross joinTable2
order by column1, column2;
Now when you select from the table doing:
select column1, column2, column3, column4
from NewTable
order by id;
You are ordering by the primary key and no real sort is being done.
The clustered index of a table decides how the data is ordered, this example will demonstrate it:
CREATE TABLE test (id int, value varchar)
INSERT INTO test VALUES(1, 'z')
INSERT INTO test VALUES(2, 'y')
INSERT INTO test VALUES(3, 'x')
SELECT * FROM test
CREATE CLUSTERED INDEX IX_test ON test (value ASC)
SELECT * FROM test
This is the result:
id value
----------- -----
1 z
2 y
3 x
id value
----------- -----
3 x
2 y
1 z
After creating the index, the result is reversed, since the index is sorting the value-column ascending.
However please note, as others have mentioned, that the only 100% guaranteed way to get a correctly ordered result is to use an ORDER BY clause.
The answer "The clustered index of a table decides how the data is ordered..." is incorrect.
Without an ORDER BY the result set is returned in random order.
It depends on numerous facts. But one of the obvious ones is the index being used:
Here's a simple example to show that the previous statement is wrong:
create table #t (id int identity (1,1) primary key clustered, col1 int)
INSERT INTO #t (col1)
values
(5),
(4),
(3),
(2),
(1),
(0)
SELECT col1 FROM #t
CREATE INDEX IX_t
ON #t (col1);
SELECT col1 FROM #t
Even if the clustered index is present, with a covering index in a different sort order data will be returned more likely in the order of the index being used instead of the clustered index.
But if there are some pages already in memory and other ones need to get loaded from disc, the result set might look different again.
To summarize: Without ORDER BY the sort order cannot be guaranteed.
The ORDER of your result set is dictated by the execution plan, if the query uses an index AND there is no ORDER BY then the sort will be a result of that index. There is no guarantee on the order unless you issue an ORDER BY
Related
You see that the SKU1 has 2 rows, but actually the content of these 2 rows are the same, just the sequence of "b" and "c" makes difference.
What if I want to remove the duplicate rows as shown in the 2nd picture?
In Oracle there is a LEAST/GREATEST function that can realize it, but I used SQL Server, therefore it doesn't work following the instruction of the below post:
How to remove duplicate rows in SQL
If it's only 2 columns where the order should not matter for the group by?
Then you could use IIF (or a CASE WHEN) to calculate the maximum and minimum values.
And use those calculated values in the GROUP BY.
For example:
select Name,
MAX(Val1) as Val1,
MIN(Val2) as Val2
from Table1
GROUP BY Name,
IIF(Val2 is null or Val1 < Val2, Val1, Val2),
IIF(Val1 is null or Val1 < Val2, Val2, Val1);
For the example records that would give the result:
Name Val1 Val2
SKU1 20 10
SKU2 20 10
Or if you want to use a fancy XML trick :
select Name, max(Val1) as Val1, min(Val2) as Val2
from (
select *,
cast(
convert(XML,
concat('<n>',Val1,'</n><n>',Val2,'</n>')
).query('for $n in /n order by $n return string($n)'
) as varchar(6)) as SortedValues
from Table1
) q
group by Name, SortedValues;
The last method could be more usefull when there are more columns involved.
To actually remove the duplicates?
Here's an example that uses a table variable to demonstrate:
declare #Table1 TABLE (Id int, Name varchar(20), Val1 int, Val2 int);
Insert Into #Table1 values
(1,'SKU1',10,20),
(2,'SKU1',20,10),
(3,'SKU1',12,15),
(4,'SKU2',10,null),
(5,'SKU2',null,10),
(6,'SKU2',10,20);
delete from #Table1
where Id in (
select Id
from (
select Id,
row_number() over (partition by Name,
IIF(Val2 is null or Val1 < Val2, Val1, Val2),
IIF(Val1 is null or Val1 < Val2, Val2, Val1)
order by Val1 desc, Val2 desc
) as rn
from #Table1
) q
where rn > 1
);
select * from #Table1;
Please Use Max() and Min Function instead of least and greatest of oracle if used the follwoing steps and got the same result.
Create Table Transactions (Name varchar(255),Quantity1 int,Quantity2 int)
Insert Into Transactions values
('SKU1',10,20),
('SKU1',20,10),
('SKU2',10,20),
('SKU2',10,20)
Now I used the query below to find the solution of your answer
Select T1.Name,MAX(T1.Quantity1),MIN(T2.Quantity2) From Transactions T1
join Transactions T2
on T1.Name=T2.Name
group by T1.Name
Please Reply
greatest() can be simulated using a CASE expression
greatest(b,c) is the same as:
case
when b > c then b
else c
end
You can use this together with a distinct to remove your duplicates:
select distinct
a,
case when b > c then b else c end as x
from the_table
order by a;
Try %%physloc%%. It is the equivalent of oracle's RowId.
Find it
select *, %%physloc%% from [MyTable] where ...
Delete what you want
delete from [MyTable] where %%physloc%% = 0xDEADBEEF -- (your address)
Consider adding a Unique / Primary Key to prevent future occurrences.
SELECT * FROM abc where A='SKU1'and B=20 || A='SKU2'and B=10
a b c
SKU1 20 10
SKU2 10 20
This may seem a little complicated at first, but we can also use PIVOT/UNPIVOT to obtain the results
Below is the query
select *
from
(
select
*,
'quantity'+ cast(row_number() over (partition by name order by data) as nvarchar) cols
from
(
select
distinct name, data
from
(select * from transactions)s
unpivot
(
data for cols in (quantity1,quantity2)
)u
)s
)s
pivot
(
max(data) for cols in (quantity1,quantity2)
)p
From your question it is not clear whether you want to filter duplicates by row or duplicates by column. Let me describe both to make sure your question is addressed completely.
In Example 1, you can see that we have duplicate rows:
To filter them, just add the keyword DISTINCT to your query, as follows:
SELECT DISTINCT * FROM myTable;
It filters the duplicate rows and returns:
Hence, you don't need a least or greatest function in this case.
In Example 2, you can see that we have duplicates in the columns:
Here, SELECT DISTINCT * from abc will still return all 4 rows.
If we regard only the first column in the filtering, it can be achieved by the following query:
select distinct t.Col1,
(select top 1 Col2 from myTable ts where t.Col1=ts.Col1) Col2,
(select top 1 Col3 from myTable ts where t.Col1=ts.Col1) Col3
from myTable t
It will pick the first matching value in each column, so the result of the query will be:
The difference between Example 1 and this example is that it eliminated just the duplicate occurances of the values in Col1 of myTable and then returned the related values of the other columns - hence the results in Col1 and Col2 are different.
Note:
In this case you can't just join the table myTable because then you would be forced to list the columns in the select distinct which would return more rows then you want to have. Unfortunately, T-SQL does not offer something like SELECT DISTINCT ON(fieldname) i.e. you can't directly specify a distinct (single) fieldname.
You might have thought "why not use GROUP BY?" The answer of that question is here: With GROUP BY you are forced to either specify all columns which is a technical DISTINCT equivalent, or you need to use aggregate functions like MIN or MAX which aren't returning what you want either.
A more advanced query (you might have seen is once before!) which has the same result is:
SELECT Col1, Col2, Col3
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber
FROM myTable
) t
WHERE RowNumber=1
This statement numbers each occurance of a value on Col1 in the subquery and then takes the first of each duplicate rows - which effectively is a grouping by Col1 (but without the disadvantages of GROUP BY).
N.B. In the examples above, I am assuming a table definition like:
CREATE TABLE [dbo].[myTable](
[Col1] [nvarchar](max) NULL,
[Col2] [int] NULL,
[Col3] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
For the examples above, we don't need to declare a primary key column. But generally spoken you'll need a primary key in database tables, to be able to reference rows efficiently.
If you want to permanently delete rows not needed, you should introduce a primary key, because then you can delete the rows not displayed easily as follows (i.e. it is the inverse filter of the advanced query mentioned above):
DELETE FROM [dbo].[myTable]
WHERE myPK NOT IN
(SELECT myPK
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber
FROM [dbo].[myTable]
) t
WHERE RowNumber=1 and myPK=t.myPK)
This assumes you have added an integer primary key myPK which auto-increments (you can do that via the SQL Management Studio easily by using the designer).
Or you can execute the following query to add it to the existing table:
BEGIN TRANSACTION
GO
ALTER TABLE dbo.myTable ADD
myPK int NOT NULL IDENTITY (1, 1)
GO
ALTER TABLE dbo.myTable ADD CONSTRAINT
PK_myTable PRIMARY KEY CLUSTERED (myPK)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
GO
ALTER TABLE dbo.myTable SET (LOCK_ESCALATION = TABLE)
GO
COMMIT
You can find some more examples here at MSDN.
Ok so I know you can't pull specific fields without an aggregate when you perform a SQL command with a group by, but it seems to me that if you are doing an aggregate on a primary key that is guaranteed to be unique, there should be a way to pull the other rows of that column along with it. Something like this:
SELECT Max(id),
foo,
bar
FROM mytable
GROUP BY value1,
value2
So ID is guaranteed to be unique, so it will have exactly 1 value for foo and bar, is there a way to generate a query like this?
I have tried this, but MyTable in this case has millions of rows, and the run-time for this is unacceptable:
SELECT *
FROM mytable
WHERE id IN (SELECT Max(id)
FROM mytable
GROUP BY value1,
value2)
AND ...
Ideally I would like a solution that works at least as far back as SQL server 2005, but if there are better solutions in the later versions I would like to hear them as well.
Make sure you have an index defined such as:
CREATE NONCLUSTERED INDEX idx_GroupBy ON mytable (value1, value2)
IN can be tend to be slow if you have many rows. It may help to turn your IN into an INNER JOIN.
SELECT
data.*
FROM
mytable data
INNER JOIN
(
SELECT id = MAX(id)
FROM mytable
GROUP BY value1,
value2
) ids
ON data.id = ids.id
Unfortunately, Sql Server does not have any features that will do this any better.
Try this:
SELECT foo
,bar
,MAX(ID) OVER(Partition By Value1, Value2 Order by Value1,Value2)
FROM myTable
CREATE NONCLUSTERED INDEX myIndex ON MyTable(Value1,Value2) INCLUDE(foo,bar)
This NCI will cover the whole query, so it will not need to lookup the table at all
When you create an index on a column or number of columns in MS SQL Server (I'm using version 2005), you can specify that the index on each column be either ascending or descending. I'm having a hard time understanding why this choice is even here. Using binary sort techniques, wouldn't a lookup be just as fast either way? What difference does it make which order I choose?
This primarily matters when used with composite indexes:
CREATE INDEX ix_index ON mytable (col1, col2 DESC);
can be used for either:
SELECT *
FROM mytable
ORDER BY
col1, col2 DESC
or:
SELECT *
FROM mytable
ORDER BY
col1 DESC, col2
, but not for:
SELECT *
FROM mytable
ORDER BY
col1, col2
An index on a single column can be efficiently used for sorting in both ways.
See the article in my blog for details:
Descending indexes
Update:
In fact, this can matter even for a single column index, though it's not so obvious.
Imagine an index on a column of a clustered table:
CREATE TABLE mytable (
pk INT NOT NULL PRIMARY KEY,
col1 INT NOT NULL
)
CREATE INDEX ix_mytable_col1 ON mytable (col1)
The index on col1 keeps ordered values of col1 along with the references to rows.
Since the table is clustered, the references to rows are actually the values of the pk. They are also ordered within each value of col1.
This means that that leaves of the index are actually ordered on (col1, pk), and this query:
SELECT col1, pk
FROM mytable
ORDER BY
col1, pk
needs no sorting.
If we create the index as following:
CREATE INDEX ix_mytable_col1_desc ON mytable (col1 DESC)
, then the values of col1 will be sorted descending, but the values of pk within each value of col1 will be sorted ascending.
This means that the following query:
SELECT col1, pk
FROM mytable
ORDER BY
col1, pk DESC
can be served by ix_mytable_col1_desc but not by ix_mytable_col1.
In other words, the columns that constitute a CLUSTERED INDEX on any table are always the trailing columns of any other index on that table.
For a true single column index it makes little difference from the Query Optimiser's point of view.
For the table definition
CREATE TABLE T1( [ID] [int] IDENTITY NOT NULL,
[Filler] [char](8000) NULL,
PRIMARY KEY CLUSTERED ([ID] ASC))
The Query
SELECT TOP 10 *
FROM T1
ORDER BY ID DESC
Uses an ordered scan with scan direction BACKWARD as can be seen in the Execution Plan. There is a slight difference however in that currently only FORWARD scans can be parallelised.
However it can make a big difference in terms of logical fragmentation. If the index is created with keys descending but new rows are appended with ascending key values then you can end up with every page out of logical order. This can severely impact the size of the IO reads when scanning the table and it is not in cache.
See the fragmentation results
avg_fragmentation avg_fragment
name page_count _in_percent fragment_count _size_in_pages
------ ------------ ------------------- ---------------- ---------------
T1 1000 0.4 5 200
T2 1000 99.9 1000 1
for the script below
/*Uses T1 definition from above*/
SET NOCOUNT ON;
CREATE TABLE T2( [ID] [int] IDENTITY NOT NULL,
[Filler] [char](8000) NULL,
PRIMARY KEY CLUSTERED ([ID] DESC))
BEGIN TRAN
GO
INSERT INTO T1 DEFAULT VALUES
GO 1000
INSERT INTO T2 DEFAULT VALUES
GO 1000
COMMIT
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('T1'), 1, NULL, 'DETAILED')
WHERE index_level = 0
UNION ALL
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('T2'), 1, NULL, 'DETAILED')
WHERE index_level = 0
It's possible to use the spatial results tab to verify the supposition that this is because the later pages have ascending key values in both cases.
SELECT page_id,
[ID],
geometry::Point(page_id, [ID], 0).STBuffer(4)
FROM T1
CROSS APPLY sys.fn_PhysLocCracker( %% physloc %% )
UNION ALL
SELECT page_id,
[ID],
geometry::Point(page_id, [ID], 0).STBuffer(4)
FROM T2
CROSS APPLY sys.fn_PhysLocCracker( %% physloc %% )
The sort order matters when you want to retrieve lots of sorted data, not individual records.
Note that (as you are suggesting with your question) the sort order is typically far less significant than what columns you are indexing (the system can read the index in reverse if the order is opposite what it wants). I rarely give index sort order any thought, whereas I agonize over the columns covered by the index.
#Quassnoi provides a great example of when it does matter.
In oracle we would use rownum on the select as we created this table. Now in teradata, I can't seem to get it to work. There isn't a column that I can sort on and have unique values (lots of duplication) unless I use 3 columns together.
The old way would be something like,
create table temp1 as
select
rownum as insert_num,
col1,
col2,
col3
from tables a join b on a.id=b.id
;
This is how you can do it:
create table temp1 as
(
select
sum(1) over( rows unbounded preceding ) insert_num
,col1
,col2
,col3
from a join b on a.id=b.id
) with data ;
Teradata has a concept of identity columns on their tables beginning around V2R6.x. These columns differ from Oracle's sequence concept in that the number assigned is not guaranteed to be sequential. The identity column in Teradata is simply used to guaranteed row-uniqueness.
Example:
CREATE MULTISET TABLE MyTable
(
ColA INTEGER GENERATED BY DEFAULT AS IDENTITY
(START WITH 1
INCREMENT BY 20)
ColB VARCHAR(20) NOT NULL
)
UNIQUE PRIMARY INDEX pidx (ColA);
Granted, ColA may not be the best primary index for data access or joins with other tables in the data model. It just shows that you could use it as the PI on the table.
This works too:
create table temp1 as
(
select
ROW_NUMBER() over( ORDER BY col1 ) insert_num
,col1
,col2
,col3
from a join b on a.id=b.id
) with data ;
I have a query, and I need to find the row number that the query return the answer.
I do not have a counter field. How do I do it?
Thanks in advance.
SELECT
ROW_NUMBER() OVER (<your_id_field_here> ORDER BY <field_here>) as RowNum,
<the_rest_of_your_fields_here>
FROM
<my_table>
If you have a primary key, you can use this method on SQL Server 2005 and up:
SELECT
ROW_NUMBER() OVER (ORDER BY PrimaryKeyField) as RowNumber,
Field1,
Field2
FROM
YourSourceTable
If you don't have a primary key, what you may have to do is duplicate your table into a mermory table (or a temp table if it is very large) using a method like this:
DECLARE #NewTable table (
RowNumber BIGINT IDENTITY(1,1) PRIMARY KEY,
Field1 varchar(50),
Field2 varchar(50),
)
INSERT INTO #NewTable
(Field1, Field2)
SELECT
Field1,
Field2
FROM
YourSourceTable
SELECT
RowNumber,
Field1,
Field2
FROM
#NewTable
The nice part about this, is that you will be able to detect identical rows if your source table does not have a primary key.
Also, at this point, I would suggest adding a primary key to every table you have, if they don't already have one.
create table t1 (N1 int ,N2 int)
insert into t1
select 200,300
insert into t1
select 200,300
insert into t1
select 300,400
insert into t1
select 400,400
.....
......
select row_number() over (order by [N1]) RowNumber,* from t1
I have a Query, and I need to find the
row number that the query return the
answer. I do not have any counter
field. how I do it ?
If I understand your question correctly, you have some sort of query that returns a row, or a few rows (answer, answers) and you want to have a row number associates with an "answer"
As you said you do not have any counter field so what you need to to is the following:
decide on the ordering criteria
add the counter
run the query
It would really help if you provided more details to your question(s).
If you confirm that I understand your question correctly, I will add a code example