SQL Server: update primary key after different insert statement - sql

I want to update the primary key in SQL Server. I executed three insert statement in my table. And the primary key column is like this.
Id NUM
-------
1 T1
2 T2
3 T3
7 T4
8 T5
9 T6
13 T7
14 T8
15 T9
16 T10
I want to update the column Id to get this:
Id NUM
-------
1 T1
2 T2
3 T3
4 T4
5 T5
6 T6
7 T7
8 T8
9 T9
10 T10
Can someone please guide me on how to resolve this?
Thanks in advance.

Don't do it! Remember the purpose of primary keys. They are non-NULL keys that uniquely identify each row in a table. They serve multiple uses. In particular, they are used for foreign key references. And, in SQL Server, they are (by default) used to sort the original data.
The identity column provides an increasing sequence of numbers, balancing the objective of an increasing number with performance. As a result, gaps appear for various reasons, but particularly due to deletes, failed inserts, and performance optimizations in a parallel environment.
In general, the aesthetics of gapless numbers are less important than the functionality provided by the keys -- and gaps have basically no impact on performance.
And, in particular, changing primary keys can be quite expensive:
The data on the pages needs to be re-sorted for the clustered index. This is true even when the ordering does not change.
Foreign keys have to be updated, if you have cascading updates set for the indexes.
Foreign keys are invalidated -- a really bad thing -- if you happen not to have the proper foreign key definitions.
And, even if you do go through the trouble of doing this, gaps are going to appear in the future, due to deletes, failed inserts, and database optimizations.

use row_number() to generate the new sequence. You need to order by NUM ignoring the first character T
UPDATE t
SET Id = rn
FROM
(
SELECT Id, NUM,
rn = row_number() OVER (ORDER BY convert(int,
substring(NUM, 2, len(NUM) - 1) ) )
FROM yourtable
) t

Related

SQL Server 2016 : query performance with join and without join

I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.
It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.
[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?
To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.

Why query with "in" and "on" statement runs infinitely

I have three tables, table3 is bascially the intermediate table of table1 and table2. When I execute the query statement that contains "in" and joins table1 and table3, it just kept running and I could not get the result. If I use id=134 instead of id in (134,267,390,4234 ... ), the result comes up. I don't understand why "in" has the effect, does anyone have an idea?
Query statement:
select count(*) from table1, table3 on id=table3.table1_id where table3.table2_id = 123 and id in (134,267,390,4234) and item = 30;
table structure:
table1:
id integer primary key,
item integer
table2:
id integer,
item integer
table3:
table1_id integer,
table2_id integer
-- the DB without index was 0.8 TB after the three indices is now 2.5 TB
indices on: table1.item, table3.table1_id, table3.table2_id
env: Linux, sqlite 3.7.17
from table1, table3 is a cross join on most databases, with the size of your data a cross join is enormous, but in SQLite3 it's an inner join. From the SQLite SELECT docs
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite.
That's not your problem in this specific instance, but let's not tempt fate; always write out your joins explicitly.
select count(*)
from table1
join table3 on id=table3.table1_id
where table3.table2_id = 123
and id in (134,267,390,4234);
Since you're just counting, you don't need any data from table1 but the ID. table3 has table1_id, so there's no need to join with table1. We can do this entirely with the table3 join table.
select count(*)
from table3
where table2_id = 123
and table1_id in (134,267,390,4234);
SQLite can only use one index per table. For this to be performant on such a large data set, you need a composite index of both columns: table3(table1_id, table2_id). Presumably you don't want duplicates, so this should take the form of a unique index. That will cover queries for just table1_id and queries for both table1_id and table2_id; you should drop your table1_id index to save space and time.
create unique index table3_unique on table3(table1_id, table2_id);
The composite index will not for queries which use only table2_id, keep your existing table2_id index.
Your query should now run lickity-split.
For more, read about the SQLite Query Optimizer.
A terabyte is a lot of data. While SQLite technicly can handle this, it might not be the best choice. It's great for small and simple databases, but it's missing a lot of features. You should look into a more powerful database such as PostgreSQL. It is not a magic bullet, all the same principles apply, but it is much more appropriate for data at that scale.

Improving run time of SQL - currently 61 hours

Complex select statement with approximately 20 left outer join statements. Many of the joins are essential to obtain data from a single column in that table (poorly designed database). The current runtime using EXPLAIN is estimated at 61 hours (45GB).
I have limited options due to user permissions. How can I optimise the SQL?
identifying and removing unnecessary joins
writing statements to include data rather than exclude data I don't need
trying to get user permission to CREATE Table ('hell no')
trying to get access to a sandpit like space on a server to create a view ('oh hells no no no').
SELECT t1.column1, t1.column2, t2.column1, t3.column2, t4.column3
--- (etc - approximately 30 items)
, CASE WHEN t1.column2 is NULL
THEN t2.column3
ELSE t1.column2
END as Derived_Column_1
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 t2
ON t1.column1 = t2.column3
LEFT OUTER JOIN TABLE3 T3
ON T1.column5 = t3.column6
AND t1.column6 = t3.column7
LEFT OUTER JOIN TABLE4 T4
ON T2.Column4 = T4.Column8
AND T2.Column5 = '16'
--- (etc - approximately 16 other joins, some of which are only required to connect table 1 to 5, because they have no direct common fields)
--- select data that was timestamped in the last 120 days
WHERE CAST(t1.Column3 as Date) > CURRENT_DATE - 120
-- de-duplify the data by four values and use the latest entry
QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1
Single output that has 30 fields + derived_column field
for data that was timestamped in the last 120 days.
Would like to remove duplicates based on four fields but the QUALIFY RANK() (PARTITION BY t1.column1, t2.column1, t3.column2, t3.column4 ORDER BY t1.Column3 desc) = 1 adds a lot of time to the run.
I think you could CREATE VOLATILE TABLE ... ON COMMIT PRESERVE ROWS to store some intermediate data. It may need some checking, but I think you would not need any special rights to do that (only a spool space quota you already have as a means to run your SELECT's).
The usual optimization technique is as follows: you take control of the execution plan by cutting your large SELECT to pieces, which sequentially compute intermediate results (saving those into volatile tabless) and redistribute them (by specifying the PRIMARY KEY on the volatile tables) to take advantage of the Teradata parallelism.
Usually, you choose the columns that are used in join conditions as a primary index; you may encounter a skew, which you may solve by cutting your intermediate volatile table in two and choosing different primary indexes for the two parts. That would make your code more sophisticated, but much more optimal.
By the way, do not let the "hours" estimate of the Teradata plan fool you; those are not the actual hours, minutes or seconds, only synthetic ones. Usually, they are pretty far from the actual query run time.

INSERT INTO SELECT CROSS JOIN Composite Primary Key

I'm performing an INSERT INTO SELECT statement in SQL Server. The situation is that there are two Primary keys of two different tables, without anything in common, that are both foreign keys of a third table, forming a composite primary key in that last table. This can usually be accomplished with a cross join - for example,
Table1.ID(PK)
Table2.Code(PK)
-- Composite PK for Table3
Table3.ID(FK)
Table3.Code(FK)
INSERT INTO Table3
SELECT ID, Code
FROM Table1
CROSS JOIN Table2
WHERE Some_conditions...
I'm getting a "Cannot insert duplicate key row" error. It will not allow Table2.Code to be repeated in Table3, since it is a unique ID, even though the primary key of Table3 is Table1.ID combined with Table2.Code. Hence, the following pairs should be recognized as different PK values in Table3 for example: {1024, PSV} and {1027, PSV}.
Is there a way to fix this, or have I designed this database incorrectly?
I have considered creating a third unique ID for Table3, but it is highly impractical in this scenario.
This will help you locate the problem:
SELECT ID, Code
FROM Table1
CROSS JOIN Table2
WHERE Some_conditions...
GROUP BY ID, Code
HAVING COUNT(*) > 1
I presume that the reason you are getting this error is because table 2 has multiple rows of the same code for the same ID.
For example, table 2 might have two or more rows of ID 1024 and code 'PSV'.
A simple solution to fix this would be to modify your code as follows:
INSERT INTO Table3
SELECT DISTINCT ID, Code
FROM Table1
CROSS JOIN Table2
WHERE Some_conditions...
SQL Server had created a unique, non-clustered index for Table3 that was preventing the INSERT INTO statement from executing. I disabled it with SQL Server Management Studio Object Explorer and it allowed me to enter the rows.

Index suggestion for a table which has id columns

I have a table whose all columns store ids of other tables (huge tables).
CREATE TABLE #mytable (
Table1Id int,
Table2Id int,
Table3Id int,
Table4Id int,
Table5Id int,
)
Now my select has join to all the tables whose ids are stored in the columns of my table.
select T1.col1, t2.Col1, T3.col1... from
#mytable MyTable inner join table1 T1 on MyTable.Table1Id = T1.Id
inner join table2 T2 on MyTable.Table2Id = T2.Id
inner join table3 T3 on MyTable.Table3Id = T3.Id
inner join table4 T4 on MyTable.Table4Id = T4.Id
inner join table5 T5 on MyTable.Table5Id = T5.Id
order by T1.Col1, T2.col1
At the moment I only have an index on Table1Id and on all the id columns of the other tables. Any suggestions to improve the performance.
You don't say which column your index is currently defined on, but based on your example query, you should create an index for all five columns;
Table1Id, Table2Id, Table3Id, Table4Id, Table5Id
This allows the SQL engine to resolve the query just by reading the index, which should be faster than reading the index, then reading the table.
If you run queries where you access some of the columns, then you need an index for those columns as well. Let's say you run a query on Table3Id and Table4Id. Then you need to create an index on;
Table3Id, Table4Id
I can't tell from the information you provided in your question if these indexes should be unique or non unique. You would have to make that determination.
Examine #mytable
You have no search criteria on that table
no where
no order by
no group by
You are just going to get those rows in no particular order.
There is no use for any index on #mytable
The index Table1Id is not used by that query and will slow down inserts
I suspect #mytable is just an output table and the where conditions are used to populate that table.
The join will use the ID on the table to be joined.
So index ID on table1-x and index it as a PK (clustered) if you can.
If that index is fragmented then defrag.
That join should be an index seek and you can't do any better.
Verify the query plan has index seeks on the joins.
If you don have index seeks on those joins then post the query plan.
You could experiment with hints on the join but I suspect the query optimizer will get it right - that may be a big query but it is not a complex query.
Since SQL will grab pages if you order the #mytable by the individual columns you have a better chance of that page being in memory.
A PK is free IF you can insert in the order of the PK.
In that case you would put the column with the most values in the last position.
Actually would would put the column with tightest groupings of PK in the last position.
And then sort by PK.
For the statement that you have put in your question, there is probably little that you can do. In fact, indexes could even hurt under some circumstances if you are in a memory limited environment.
As a first step, though, you should have indexes in the numbered tables on the id column. That is, you should be storing and then joining on the primary key of these tables (the index is automatic on a primary key).
Generally, the purpose of indexes is to prevent scanning an entire table to find a particular set of records. In this case, it looks like you want all the records anyway, so full-table scans are necessary. That limits the applicability of indexes. There is a good chance that SQL Server will turn these joins into hash joins, which is an efficient way of joining a table when you need to read all the rows.
Additional indexes might be warranted depending on where and group by clauses.