Sql Server - Index on nvarchar field - sql

What is the good approach to keep a nvarchar field unique. I have a field which is storing URLs of MP3 files. The URL length can be anything from 10 characters to 4000. I tried to create an index and it says it cannot create the index as the total length exceeds 900 bytes.
If the field is not indexed, it's going to be slow to search anything. I am using C#, ASP.net MVC for the front end.

You could use CHECKSUM command and put index on column with checksum.
--*** Add extra column to your table that will hold checksum
ALTER TABLE Production.Product
ADD cs_Pname AS CHECKSUM(Name);
GO
--*** Create index on new column
CREATE INDEX Pname_index ON Production.Product (cs_Pname);
GO
Then you can retrieve data fast using following query:
SELECT *
FROM Production.Product
WHERE CHECKSUM(N'Bearing Ball') = cs_Pname
AND Name = N'Bearing Ball';
Here is the documentation: http://technet.microsoft.com/en-us/library/ms189788.aspx

You can use a hash function (although theoretically it doesn't guarantee that two different titles will have different hashes, but should be good enough: MD5 Collisions) and then apply the index on that column.
MD5 in SQL Server

You could create a hash code of the url and use this integer as a unique index on your db. Beware of converting all characters to lowercase first to ensure that all url are in the same format. Same url will generate equal hash code.

Related

Oracle SQL trying to Hash 25 columns in one column

So when I was trying to hash 25 columns using ORA_HASH function I was getting error: too many parameter.
Is there any way we can hash all 25 columns and quickly because we have around 60M rows and no Update date :(
select ORA_HASH
(id,name,c....,...) form table name
Use concatenation with some special string as delimited e.g. here chr(10) assuming this charter doesn't appear in you data
col1||chr(10)||col1||....
Be carefull with numeric and data columns.
Either convert them explicitely in character columns, e.g.
...||to_char(col_date,'yyyy-mm-dd hh24:mi:ss')||...
or temorary override the session setting to have a constant values
ALTER SESSION SET NLS_NUMERIC_CHARACTERS = ',.']';
ALTER SESSION SET NLS_DATE_FORMAT = 'DD.MM.YYYY HH24:MI:SS';
The problem with NLS setting is, when they change and you perform a default conversion to character string - you get a different hash code.
Note also, that ORA_HASH can lead to duplicates, consider e.g. MD5 hash code to recognise change in the table data.
Final note Oracle has a (not well known) function DBMS_SQLHASH.GETHASH whitch may or may not be what you are looking for.
Surely your ultimate goal is not to get a hash? What is the hash for? It may very well not be the right way to achieve your goal.
Second, ORA_HASH is a weak, 32-bit hash that will produce a hash collision about every 25,000 rows! I wrote a whole blog post about this, see:
https://stewashton.wordpress.com/2014/02/15/compare-and-sync-tables-dbms_comparison/
Third, starting with version 12c there is a STANDARD_HASH function that seems to perform quite well and that goes up to 512 bits! (not bytes as I said before editing this answer...)
Finally, the right way to hash several things together is "hash chaining", not concatenating the values. ORA_HASH appears to support hash chaining (or something of similar effect) using the third parameter:
ora_hash(column1, 4294967295, ora_hash(column2))
With STANDARD_HASH, I would first use it on each column individually, then use UTL_RAW.CONCAT to concatenate the results, then either use STANDARD_HASH on the concatenated result or just use the concatenated value as if it were a big hash.

DB2 creating a random but unique character identifier upon row insert

I currently have a column in a DB2 table which is being passed through web calls and procedure by a character-encrypted value. It is type CHARACTER(13) with a CSSID for encryption.
This has become a huge pain to accommodate through multiple APIs but was initially intended to allow us a unique ID to use in calls that wasn't the primary key.
In DB2-400, what would be the next best thing as far as a 13 or more character string that is unique and randomly created upon insert, but doesn't require decryption (just a plain string)?
Is there a commonly-gravitated-to method for this? We aren't passing secure data, so there's no need for encryption, but we just want a randomly created and unique character
Try hex(generate_unique()). It's unique CHAR(26) string.
Or to_char(timestamp(generate_unique()), 'YYYYMMDDHH24MISSFF6'). You may play with format of the to_char function as well. May be useful to use, let's say, reverse format like FF6SSMIHH24DDMMYYYY to avoid unique index page contention upon heavy insert activity.
This is a comment that doesn't fit in the comments section.
I don't have access to a DB2-400 (anymore), but I tested the code below in DB2 10.5 for Linux.
create sequence seq1;
select concat('A', varchar_format(next value for seq1, '000000000000')) as my_id
from sysibm.sysdummy1;
Result, if you run it 4 times in a row:
A0000000000001
A0000000000002
A0000000000003
A0000000000004
Maybe there's something equivalent in DB2-400.
Sounds like you might be using GENERATE_UNIQUE()
GENERATE_UNIQUE function returns a bit data character string 13 bytes
long (CHAR(13) FOR BIT DATA)
Doesn't really have anything to do with encryption...
And pretty much the ideal solution in my opinion generating a unique value other than a simple numeric identity. So what the problem you are having?

SQL Server column datatype and indexable

I need to store variable string length in a SQL Server table, length can vary 0 to 10000 or so characters.
I need to filter the text with like operator %abc or abc%.
I thought of using varchar or nvarchar, but I'm unable to create index on this datatype.
Please suggest design advise for choosing the column data type and make the column indexable.
Thank you
SQL Server will use an index for like 'abc%'. However, it will not use the index if the wildcard is first.
If you are searching for complete words, then you should investigate contains() and full text search.
I thought of using varchar or nvarchar, but I'm unable to create index on this datatype.
Those are the proper datatypes for text based information.
The reason you're not able to create an index has nothing to do with the datatype per se - but in SQL Server, an index can be created only if the column (or set of columns) in the index have a max. theoretical size of 900 bytes (or less).
This means, a VARCHAR(1000) (or VARCHAR(MAX)) cannot be indexed. This size limit is your issue - not the datatype.
Solution: there's really only one solution: use fewer bytes in the columns you want to index. Or alternatively, as Gordon suggested, check out the full-text indexing capabilities of SQL Server.

Storing HASHBYTES output in NVARCHAR vs BYTES

I am going to create:
a table for storing IDs and unique text values (which are expected to
be large)
a stored procedure which will have a text value as input parameter
(it will check if the value exists in the above table and return the
corresponding ID if it exists, or inserted a new record if not and
return the new ID as well)
I want to optimize the search of text values using hash value of the text and created index on it. So, during the search I expect a non-clustered index to be used (not the clustered index).
I decided to use the HASHBYTES with SHA2_256 and I am wondering are there any differences/benefits if I am storing the hash value as BINARY(32) or NVARCHAR(16)?
You can't reasonably store a hash value as chars because binary data is not text. Various text processing and comparison functions interpret those chars. For example trailing whitespace is sometimes ignored leading to incorrect results.
Since you've got 32 totally random unstructured bytes to store a binary(32) is the most natural format and it is the fastest one.

Fastest way to find string by substring in SQL?

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
Sounds like you've ruled out all good alternatives.
You already know that your query
SELECT * FROM t WHERE TITLE LIKE '%abc%'
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
SELECT * FROM t WHERE TITLE LIKE 'abc%'
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')
Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
Use ASCII charset with clustered indexing the char column.
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)