Most efficient way to maintain a 'set' in SQL Server 2008? - sql

I have ~2 million rows or so of data, each row with an artificial PK, and two Id fields (so: PK, ID1, ID2). I have a unique constraint (and index) on ID1+ID2.
I get two sorts of updates, both with a distinct ID1 per update.
100-1000 rows of all-new data (ID1 is new)
100-1000 rows of largely, but not necessarily completely overlapping data (ID1 already exists, maybe new ID1+ID2 pairs)
What's the most efficient way to maintain this 'set'? Here are the options as I see them:
Delete all the rows with ID1, insert all the new rows (yikes)
Query all the existing rows from the set of new data ID1+ID2, only insert the new rows
Insert all the new rows, ignore inserts that trigger unique constraint violations
Any thoughts?

If you're using SQL Server 2008 (or 2008 R2), you can look at the MERGE, something like:
MERGE INTO MyTable mt
USING NewRows nr
ON mt.ID1 = nr.ID1 and mt.ID2 = nr.ID2
WHEN NOT MATCHED THEN
INSERT (ID1,ID2,<more columns>) VALUES (nr.ID1,nr.ID2,<other columns>);

Not all of your listed solutions are functionally equivalent, so without more knowledge about what you want or need to accomplish, it's hard to say which is most appropriate.
You may lose data that you want or need to keep.
Based on the table schema that you mentioned, this should be reasonable.
This will only work if you perform each INSERT separately.
I'd suggest [2] based on the available info.

Related

Constraint on multiple columns with specific values

My schema has 3 columns: ID1, ID2, Status
Each of the above columns is a string.
I would like to create a constraint which is the following:
There cannot be multiple records which have the same ID1 and ID2 which are in the 'UNPROCESSED' state. It is ok if there are multiple records which have the same ID1 and ID2 which are not in the UNPROCESSED state.
Is it possible to do this in SQL Server?
Assuming you're using SQL Server 2008 or later, you can apply a filtered index:
CREATE UNIQUE INDEX UQ_Unprocessed_IDs ON UnnamedTable (ID1,ID2)
WHERE (Status='Unprocessed')
I don't believe you can do that with a constraint. You need to implement a trigger on insert/update operations. The problem with SQL Server is that triggers are 'AFTER' triggers. There's no such thing as a 'BEFORE' trigger (though there is an 'INSTEAD OF' trigger type.
Hence, you need to do all the work to perform the transaction, vet it and roll it back if the constraint fails, rather than simply checking to see if the transaction would cause the constraint to be violated.
I would do something like having a column called ProcessedId (instead of Status) and assign values based on the ID1 and ID2 being processed or not. To clarify, see below
[ID1, ID2, ProcessId]
SomeId1, SomeId2, -1
SomeOtherId1, SomeOtherId2, 100304
So any unprocessed set of Ids will always have a ProcessId of -1, blocking and duplicates and any PROCESSED set of Id's will be assigned some sort of sequential number, allowing duplicates. Make sense?
So continuing on with my above example, if record set came in again unprocessed it would have a ProcessId of -1 and cause a PK violation.
You can do this using an view that is a union of two underlying tables, one for processed records, one for unprocessed reports.
However, the view itself cannot be made updatable, and if records ever change from processed back to unprocessed, and do so very often, this will perform poorly.
It will also make your thinking about parallel processing scenarios more complex.
Both of the limitations above are because you are replacing some updates by delete+insert.
A filtered index is typically a better solution if you have at least SQL Server 2008.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

Adding new columns to a column database having billions of rows

I want to add a new column to a table which already consists billions of rows. The new columns are derived from existing columns.
For example,
new_col1 = old_col1 + old_col2
new_col2 = old_col1 / old_col2
I am trying to do this in following way -
Add new columns
ALTER TABLE table_name
ADD ( column_1 column-definition,
column_2 column-definition,
...
column_n column_definition )
Read rows one by one from the table and fill the values for new columns.
There is no primary key in the database. So I can not refer to an individual row. To read rows one by one, I have to do a select * which would give a huge resultset (considering billions of records).
Is there any better way to do this?
Different DBMS have different SQL dialects, it is useful to specify which you are using in the question.
In SQL Server you could use a Computed Column but this would calculate the result every time you select the data, you could flag it as persisted but it may take a while to make the change. But you can't do that if you are going to remove the old columns.
Alternatively create the new column allowing nulls and then update it in batches
UPDATE TOP (1000) table_name SET new_col1 = old_col1 + col_col2 WHERE new_col1 IS NULL
Again the query is for SQL Server, but there will alternatives for your DBMS.
Also read Mr Hoopers comment about adding an index to the new column to make sure that the performance of the UPDATE doesn't get worse as more data is added. The update is a read and write operation, the index will speed up the reads and slightly delay the writes (maintaining the index), but it should be worthwhile.
I think Mr Diver's method would be fine if you also added an index on one of your new columns; otherwise, as the job progresses, it will have to do more and more scanning to find the rows it hasn't already updated. Adding an index will mean it doesn't have to do that. A possible drawback is that the index differentiation will be frightful when the column is created, but I don't think that would be a problem as you only care about NULL or NOT NULL. You could drop the index when the update is complete.
use stored procedures, do an update by 100 of them, add the stored procedure as a job to run every say 30 seconds.

ignore insert of rows that violate duplicate key index

I perform an insert as follows:
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE ...
However, if some of the rows that are being inserted violate the duplicate key index on foo, I want the database to ignore those rows, and not insert them and continue inserting the other rows.
The DB in question is Informix 11.5. Currently all that happens is that the DB is throwing an exception. If I try to handle the exception with:
ON EXCEPTION IN (-239)
END EXCEPTION WITH RESUME;
... it does not help because after the exception is caught, the entire insert is skipped.
I don't think informix supports INSERT IGNORE, or INSERT ... ON DUPLICATE KEY..., but feel free to correct me if I am wrong.
Use IF statement and EXISTS function to check for existed records. Or you can probably include that EXISTS function in the WHERE clause like below
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE (NOT EXISTS(SELECT a FROM foo WHERE ...))
Depending on whether you want to know all about all the errors (typically as a result of a data loading operation), consider using violations tables.
START VIOLATIONS TABLE FOR foo;
This will create a pair of tables foo_vio and foo_dia to contain information about rows that violate the integrity constraints on the table.
When you've had enough, you use:
STOP VIOLATIONS TABLE FOR foo;
You can clean up the diagnostic tables at your leisure. There are bells and whistles on the command to control which table is used, etc. (I should perhaps note that this assumes you are using IDS (IBM Informix Dynamic Server) and not, say, Informix SE or Informix OnLine.)
Violations tables are a heavy-duty option - suitable for loads and the like. They are not ordinarily used to protect run-of-the-mill SQL. For that, the protected INSERT (with SELECT and WHERE NOT EXISTS) is fairly effective - it requires the data to be in a table already, but temp tables are easy to create.
There are a couple of other options to consider.
From version 11.50 onwards, Informix supports the MERGE statement. This could be used to insert rows from fubar where the corresponding row in foo does not exist, and to update the rows in foo with the values from fubar where the corresponding row already exists in foo (the duplicate key problem).
Another way of looking at it is:
SELECT fubar.*
FROM fubar JOIN foo ON fubar.pk = foo.pk
INTO TEMP duplicate_entries;
DELETE FROM fubar WHERE pk IN (SELECT pk FROM duplicate_entries);
INSERT INTO foo SELECT * FROM fubar;
...processs duplicate_entries
DROP TABLE duplicate_entries
This cleans the source table (fubar) of the duplicate entries (assuming it is only the primary key that is duplicated) before trying to insert the data. The duplicate_entries table contains the rows in fubar with the duplicate keys - the ones that need special processing in some shape or form. Or you can simply delete and ignore those rows, though in my experience, that is seldom a good idea.
Group by maybe your friend in this. To prevent duplicate rows from being entered. Use group by in your select. This will force the duplicates into a unique row. The only thing I would do is test to see if there any performance issues. Also, make sure you include all of the rows you want to be unique in the group by or you could exclude rows that are not duplicates.
INSERT INTO FOO(Name, Address, Age, Gadget, Price)
select Name, Age, Gadget, Price
from foobar
group by Name, Age, Gadget, Price
Where Name, Age, Gadget, Price form the primary key index (or unique key index).
The other possibility is to write the duplicated rows to an error table without the index and then resolve the duplicates before inserting them into the new table. Just need to add a having count(*) > 1 clause to the above.
I don't know about Informix, but with SQL Server, you can create an index, make it unique and then set a property to have it ignore duplicate keys so no error gets thrown on a duplicate. It's just ignored. Perhaps Informix has something similar.

Improving performance of Sql Delete

We have a query to remove some rows from the table based on an id field (primary key). It is a pretty straightforward query:
delete all from OUR_TABLE where ID in (123, 345, ...)
The problem is no.of ids can be huge (Eg. 70k), so the query takes a long time. Is there any way to optimize this?
(We are using sybase - if that matters).
There are two ways to make statements like this one perform:
Create a new table and copy all but the rows to delete. Swap the tables afterwards (alter table name ...) I suggest to give it a try even when it sounds stupid. Some databases are much faster at copying than at deleting.
Partition your tables. Create N tables and use a view to join them into one. Sort the rows into different tables grouped by the delete criterion. The idea is to drop a whole table instead of deleting individual rows.
Consider running this in batches. A loop running 1000 records at a time may be much faster than one query that does everything and in addition will not keep the table locked out to other users for as long at a stretch.
If you have cascade delete (and lots of foreign key tables affected) or triggers involved, you may need to run in even smaller batches. You'll have to experiement to see which is the best number for your situation. I've had tables where I had to delete in batches of 100 and others where 50000 worked (fortunate in that case as I was deleting a million records).
But in any even I would put my key values that I intend to delete into a temp table and delete from there.
I'm wondering if parsing an IN clause with 70K items in it is a problem. Have you tried a temp table with a join instead?
Can Sybase handle 70K arguments in IN clause? All databases I worked with have some limit on number of arguments for IN clause. For example, Oracle have limit around 1000.
Can you create subselect instead of IN clause? That will shorten sql. Maybe that could help for such a big number of values in IN clause. Something like this:
DELETE FROM OUR_TABLE WHERE ID IN
(SELECT ID FROM somewhere WHERE some_condition)
Deleting large number of records can be sped up with some interventions in database, if database model permits. Here are some strategies:
you can speed things up by dropping indexes, deleting records and recreating indexes again. This will eliminate rebalancing index trees while deleting records.
drop all indexes on table
delete records
recreate indexes
if you have lots of relations to this table, try disabling constraints if you are absolutely sure that delete command will not break any integrity constraint. Delete will go much faster because database won't be checking integrity. Enable constraints after delete.
disable integrity constraints, disable check constraints
delete records
enable constraints
disable triggers on table, if you have any and if your business rules allow that. Delete records, then enable triggers.
last, do as other suggested - make a copy of the table that contains rows that are not to be deleted, then drop original, rename copy and recreate integrity constraints, if there are any.
I would try combination of 1, 2 and 3. If that does not work, then 4. If everything is slow, I would look for bigger box - more memory, faster disks.
Find out what is using up the performance!
In many cases you might use one of the solutions provided. But there might be others (based on Oracle knowledge, so things will be different on other databases. Edit: just saw that you mentioned sybase):
Do you have foreign keys on that table? Makes sure the referring ids are indexed
Do you have indexes on that table? It might be that droping before delete and recreating after the delete might be faster.
check the execution plan. Is it using an index where a full table scan might be faster? Or the other way round? HINTS might help
instead of a select into new_table as suggested above a create table as select might be even faster.
But remember: Find out what is using up the performance first.
When you are using DDL statements make sure you understand and accept the consequences it might have on transactions and backups.
Try sorting the ID you are passing into "in" in the same order as the table, or index is stored in. You may then get more hits on the disk cache.
Putting the ID to be deleted into a temp table that has the Ids sorted in the same order as the main table, may let the database do a simple scanned over the main table.
You could try using more then one connection and spiting the work over the connections so as to use all the CPUs on the database server, however think about what locks will be taken out etc first.
I also think that the temp table is likely the best solution.
If you were to do a "delete from .. where ID in (select id from ...)" it can still be slow with large queries, though. I thus suggest that you delete using a join - many people don't know about that functionality.
So, given this example table:
-- set up tables for this example
if exists (select id from sysobjects where name = 'OurTable' and type = 'U')
drop table OurTable
go
create table OurTable (ID integer primary key not null)
go
insert into OurTable (ID) values (1)
insert into OurTable (ID) values (2)
insert into OurTable (ID) values (3)
insert into OurTable (ID) values (4)
go
We can then write our delete code as follows:
create table #IDsToDelete (ID integer not null)
go
insert into #IDsToDelete (ID) values (2)
insert into #IDsToDelete (ID) values (3)
go
-- ... etc ...
-- Now do the delete - notice that we aren't using 'from'
-- in the usual place for this delete
delete OurTable from #IDsToDelete
where OurTable.ID = #IDsToDelete.ID
go
drop table #IDsToDelete
go
-- This returns only items 1 and 4
select * from OurTable order by ID
go
Does our_table have a reference on delete cascade?