SQL Server unique index on nvarchar issue - sql

I have a SQL Server table with a nvarchar(50) column. This column must be unique and can't be the PK, since another column is the PK. So I have set a non-clustered unique index on this column.
During a large amount of insert statements in a serializable transaction, I want to perform select queries based on this column only, in different transaction. But these inserts seem to lock the table. If I change the datatype of the unique column to bigint for example, no locking occurs.
Why isn't nvarchar working, whereas bigint does? How can I achieve the same, using nvarchar(50) as the datatype?

After all, mystery solved! Rather stupid situation I guess..
The problem was in the select statement. The where clause was missing the quotes, but due to a devilish coincidence of the existing data were only numbers, the select wasn't failing but just wasn't executing until the inserts committed. When the first alphanumeric data were inserted, the select statement begun failing with 'Error converting data type nvarchar to numeric'
e.g
Instead of
SELECT [my_nvarchar_column]
FROM [dbo].[my_table]
WHERE [my_nvarchar_column] = '12345'
the select statement was
SELECT [my_nvarchar_column]
FROM [dbo].[my_table]
WHERE [my_nvarchar_column] = 12345
I guess a silent cast was performed, the unique index was not being used which resulted to the block.
Fixed the statement and everything works as expected now.
Thanks everyone for their help, and sorry for the rather stupid issue!

First, you can change the PK to be a non-clustered index, then you you could create a clustered index on this field. Of course, that may be a bad idea based on your usage, or just simply not help.
You might have a use case for a covering index, see previous question re: covering index
You might be able to change your "other queries" to non-blocking by changing the isolation level of those queries.
It is relatively uncommon for it to be a necessity to insert a large number of rows in of a single transaction. You may be able to simply not use a transaction, or split up into a smaller set of transactions to avoid locking large sections of the table. E.g., you can insert the records into a pending table (that is not otherwise used in normal activity) in a transaction, then migrate these records in smaller transactions to the main table if real-time posting to the main table is not required.
ADDED
Perhaps the most obvious question. Are you sure you have to use a serializable transaction to insert a large number of records? These are relatively rarely necessary outside of financial transactions, and impose a high concurrency cost compared to the other isolation levels?
ADDED
Based on your comment about "all or none", you are describing atomicity, not serializability. I.e., you might be able to use a different isolation level for your large insert transaction, and still get atomicity.
Second thing, I notice you specify a large amount of insert statements. This just sounds like you should be able to push these inserts into a pending/staging table, then perform a single insert or batches of inserts from the staging table into the production table. Yes, it is more work, but you may just have an existing problem that requires the extra effort.

You may want to add the NOLOCK hint (a.k.a. READUNCOMMITTED) to your query. It will allow to to perform a "dirty read" of the data that has been already inserted.
e.g.
SELECT
[my_nvarchar_column]
FROM
[dbo].[locked_table] WITH (NOLOCK)
Take a look at a better explanation here:
http://www.mssqltips.com/sqlservertip/2470/understanding-the-sql-server-nolock-hint/
And the READUNCOMMITTED section here:
http://technet.microsoft.com/en-us/library/ms187373.aspx

Related

Delete where Date = DD.MM.YYYY on table with 20m records

at the moment I am using the statement:
delete
from tbl_name
where trunc(tbl_name.Timestamp) = to_date('06.09.2013','DD/MM/YYYY');
But this takes very very long.
Is there a way to speed things up?
Thank you
First, this is certainly a big table. Otherwise I see no point in a statement like yours taking very long.
Then there are two possibilities:
1) The delete statement affects many, many records.
Then it's only natural for the statement to take long. The whole table will have to be scanned (full table scan). You could only speed this up by parallelizing the statement:
delete /*+parallel(tbl_name,4)*/ from tbl_name ...
2) The delete statement affects a rather small percentage of the records in the table.
Then it would be advisable for Oracle to use an index. As you are asking only for the date portion of the timestamp column, you would create a function index:
create index index_name on tbl_name( trunc(the_timestamp) );
Once there is that index available on the table, Oracle can use it to find the desired records on selects, updates and deletes based on trunc(the_timestamp).
Use a between value in right side instead of using trunc on left side.
delete from tbl_name
where tbl_name.Timestamp between to_date('06.09.2013 00:00:00','DD/MM/YYYY hh24:mi:ss')
and to_date('06.09.2013 23:59:59','DD/MM/YYYY hh24:mi:ss');
I am thinking that since you are referring to the tbl_name.Timestamp column, then I am sure there must be a primary key also associated with it. In case the primary key is a numeric value, you could use that index in your delete statement, kind of like:
delete from tbl_name where tbl_name.ID between LOWER_ID_VALUE AND UPPER_ID_VALUE
Note: The LOWER_ID_VALUE and UPPER_ID_VALUE are inclusive in nature, meaning records of those two IDs will also be deleted.
ppeterka 66's idea of breaking down the delete using rownum isn't feasible. Take a look at http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2345591157689
Assuming you are running Oracle, when the DELETE command is issued, the deleted rows are stored in the rollback segments, so, if required, changes can be undone. So there is an image of the rows in rollback which are currently not present in the table. Now all the rollback blocks are written to the redo log files too. So you have the data blocks with the table (without the deleted rows, of course) and the rollback blocks with the old image both producing redo, which accounts for additional archive logs.
Hence, its better to go for it in a single step rather than breaking it down into multiple transactions and doing it.
You could use a better condition instead of the trunc, especially if you have an index on the tbl_name.Timestamp field:
tbl_name.Timestamp >= to_date('06.09.2013','DD/MM/YYYY') --inclusive
AND tbl_name.Timestamp < to_date('07.09.2013','DD/MM/YYYY') --exclusive
I assumed you don't have a function index on trunc(tbl_name.Timestamp), in which case that should provide better performance.
Based on my previous experience, (which is not backed up by straight facts or theory, or even more, the facts are against this approach): You could break it up into smaller bunches
delete
from tbl_name
where
id in (select id from tbl_name
and trunc(tbl_name.Timestamp) = to_date('06.09.2013','DD/MM/YYYY')
and rownum<10000); -- tune this to your liking
And repeat this. Don't forget to commit in between... This helped me, we were using SNAPSHOT isolation level.

perfomance of single table vs. joined dual structure

This is not a question about using another tool. It is not a question about using different data structure. It is issue about WHY I see what I see -- please read to the end before answering. Thank you.
THE STORY
I have one single table which has one condition, the records are not deleted. Instead the record is marked as not active (there is field for that) and in such case all fields (except for identifiers and this isActive field) are considered irrelevant.
More on identifiers -- there are two fields:
id -- int, primary key, clustered
name -- unique, varchar, external index
How the update is done for example (I use C#/Linq/MSSQL2005): I fetch the record basing on name, then change required fields and commit the changes, so the update is executed (UPDATE uses id, not name).
However there is a problem with storage. So why not not break this table into dual structure -- "header" table (id, name, isActive) and data table (id, rest of the fields). In case of a problem with storage we can delete all records from data table for real (for isActive=false).
edit (by Shimmy): header+data are not retrieved by LINQ with join. Data records are loaded on demand (and this always occurs because of the code).
comment (by poster): AFAIR there is no join, so this is irrelevant. Data for headers were loaded manually. See below.
Performance -- (my) Theory
Now, what about performance? Which one will be faster? Let's say you have 10000 records in both tables (single, header, data) and you update them one by one (all 3 tables) -- fields isActive and some field from the "data" fields.
My calculation was/is:
mono table -- search using external index, then jumping into the structure, fetching all the data, update using primary key.
dual tables -- search using external index, jumping into the header table, fetching all the data, search using primary key on data table (no jumping here, it is clustered index), fetching all the data, update both tables using primary keys.
So, for me mono structure should be faster, because in dual case I have the same operations plus some extras.
The results
No matter what I do, update, select, insert, dual structure is either slightly better (speed) or up to 30% faster. And now I am all puzzled -- I would understand that I if were insert/update/select only header records, but in every case data records are used as well.
The question -- why/how dual structure can be faster?
I think this all boils down to how much data is being fetched, inserted, and updated.
SELECT case - in the dual-table configuration you're fetching less data. Database runtime is heavily dominated by I/O time, so having the "header" fields replicated on each row in the single-table configuration means you have to read that same data over and over again. In the dual-table configuration you only read the header data once.
INSERT case - similar to the above, but related to writing the data instead of reading it.
UPDATE case - your code updates the "isActive" field, which if I've read it correctly is part of the "header" fields. In the single-table configuration you're forcing many rows to be updated for each "isActive" change. In the dual-table configuration you're only updating a single header row for each "isActive" change.
I think this is a case of premature optimization. I get the feeling that you understood that according to data normalization rules the dual-table configuration was "better" - but because the single-table case seemed like it would provide better performance you wanted to go with that design. Thankfully you took the time to test what would happen and found that observed performance did not match your expectations. GOOD JOB! I wish more people would take the time to test things out like this. I think the lesson to learn here is that data normalization is a Good Thing.
Remember, the best time to optimize something is NEVER! The second-best time to optimize things is when you have an observed performance problem. The worst time to optimize is during analysis.
I hope this helps.
Assumption: Sql Server for database.
Sql Server tends to be higher in performance on narrow tables rather than wide. While this might not be true for something such as a mainframe.
This really points to normalization until you decide NOT to for performance reasons, and in this case the assumption that de-normalized tables would be more efficient is incorrect. Normalized structures can be better managed in the resources than de-normalized in this environment. I suspect (no citable basis of this) that the resource (hardware, multiprocessors, threading etc) makes the normalized structure faster because more stuff gets done at the same time.
Have you looked at the two query plans? That often gives it away.
As for speculation, the size of a row in a table affects how fast you can scan it. smaller rows means more rows fit into a data page. the brunt of a query is usually in the I/O time, so using two smaller tables greatly reduces the amount of data you have to sift through in the indexes.
Also, the locks are more granular -- the first update could write to table1, and then the second update could write to table1 while you're finishing up table2.

MySQL SELECT statement using Regex to recognise existing data

My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot
can this cleaning be done in a SQL statement
Yes, you can write a stored procedure to do it in the database layer:
mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
-> RETURNS VARCHAR(255) DETERMINISTIC
-> RETURN REPLACE(s, '.', ' ');
mysql> SELECT clean_me('BANK.12323.DESCRIPTION');
BANK 12323 DESCRIPTION
This will perform very poorly across a large table though.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).
Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.
The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.
The cleanest way is indeed to make sure only correct data is in the database.
In this example the "BANK.12323.DESCRIPTION" would be returned by:
SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?
But this might impose performance issues when you have a lot of data in the table.
Another way that you could do it is as follows:
Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:
REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)
'on duplicate key' allows you to specify which columns to update on a duplicate key error:
INSERT INTO transaction (desc, dated_on, amount) values (?,?,?)
ON DUPLICATE KEY SET amount = amount
By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.
If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.
Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.

Is it safe to use MS SQL's WITH (NOLOCK) option for select statements and insert statements if

Is it safe to use MS SQL's WITH (NOLOCK) option for select statements and insert statements if you never modify a row, but only insert or delete rows?
I..e you never do an UPDATE to any of the rows.
If you're asking whether or not you'll get data that may no longer be accurate, then it depends on your queries. For example, if you do something like:
SELECT
my_id,
my_date
FROM
My_Table
WHERE
my_date >= '2008-01-01'
at the same time that a row is being inserted with a date on or after 2008-01-01 then you may not get that new row. This can also affect queries which generate aggregates.
If you are just mimicking updates through a delete/insert then you also may get an "old" version of the data.
Not in general. (i.e. UPDATE is not the only locking issue)
If you are inserting (or deleting) records and a select could potentially specify records which would be in that set, then yes, NOLOCK will give you a dirty read which may or may not include those records.
If the inserts or deletes would never potentially be selected (for instance the data read is always yesterday's data, wheras today's data coming in or being manipulated is never read), then yes, it is "safe".
If you are never doing any UPDATEs, then why does locking give you a problem in the first place?
If there are referential integrity or trigger-firing issues at play, then NOLOCK is just going to turn those errors into mysterious inconsistencies.
Well 'safe' is a very generic term; it all depends on the context of your application and its use. But there is always a chance of skipping and double-counting previously committed rows when NOLOCK hint is used.
Anyways, have a read of this:
http://blogs.msdn.com/b/sqlcat/archive/2007/02/01/previously-committed-rows-might-be-missed-if-nolock-hint-is-used.aspx
Not sure how SELECT statements could conflict if you're limiting yourself to INSERTs and DELETEs. INSERT is problematic because there may have been conflicting primary keys inserted during your query, for instance. Both INSERTs and DELETEs both expose you to the conditions expressed in your WHERE clause, or JOINs, etc. may become invalid during your statement's execution.

What is the best way to implement soft deletion?

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete