In SQL Server, is it possible to generate a GUID using a specific piece of data as an input value. For example,
DECLARE #seed1 VARCHAR(10) = 'Test'
DECLARE #seed1 VARCHAR(10) = 'Testing'
SELECT NEWID(#seed1) -- will always return the same output value
SELECT NEWID(#seed2) -- will always return the same output value, and will be different to the example above
I know this completely goes against the point of GUIDs, in that the ID would not be unique. I'm looking for a way to detect duplicate records based on certain criteria (the #seed value).
I've tried generating a VARBINARY string using the HASHBYTES function, however joining between tables using VARBINARY seems extremely slow. I'm hoping to find a similar alternative that is more efficient.
Edit: for more information on why I'm looking to achieve this.
I'm looking for a fast and efficient way of detecting duplicate information that exists on two tables. For example, I have first name, last name & email. When these are concatenated, should can be used to check whether these records eexists in table A and table B.
Simply joining on these fields is possible and provides the correct result, however is quite slow. Therefore, I was hoping to find a way of transforming the data into something such as a GUID, which would make the joins much more efficient.
I think you can use CHECKSUM function for returning int type.
You should use hashbytes and not checksum like this:
SELECT hashbytes('MD5', 'JOHN' + ',' + 'SMITH' + ',' + 'JSMITH#EXAMPLE.COM')
Although it's only a small chance checksum can produce the same number with 2 completely different values, I've had it happen with datasets of around a million. As iamdave noted (thanks!), it's a good idea to throw in some kind delimiter (a comma in my example) so that you don't compare 'JOH' + 'NSMITH' and 'JOHN' + 'SMITH' as the same.
http://www.sqlservercentral.com/blogs/microsoft-business-intelligence-and-data-warehousing/2012/02/01/checksum-vs-hashbytes/
Related
Perhaps I am not creative or knowledgeable enough with SQL... but it looks like there is no way to do a DROP TABLE or DELETE FROM within a SELECT without the ability to start a new statement.
Basically, we have a situation where our codebase has some gigantic, "less-than-robust" SQL generation component that never uses prepared statements and we now have an API that interacts with this legacy component.
Right now we can modify a query by appending to the end of it, but have been unable to insert any semicolons. Thus, we can do something like this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
which will result in this
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');...
This is just one example.
Basically we can append pretty much anything to the end of any/most generated SQL queries, less adding a semicolon.
Any ideas on how this could potentially do some damage? Can you add something to the end of a SQL query that deletes from or drops tables? Or create a query so absurd that it takes up all CPU and never completes?
You said that this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');
so it looks like this:
/query?[...]&location_ids=');DROP%20TABLE users;--
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('');DROP TABLE users;--');
which is a SELECT, a DROP and a comment.
If it’s not possible to inject another statement, you limited to the existing statement and its abilities.
Like in this case, if you are limited to SELECT and you know where the injection happens, have a look at PostgreSQL’s SELECT syntax to see what your options are. Since you’re injecting into the WHERE clause, you can only inject additional conditions or other clauses that are allowed after the WHERE clause.
If the result of the SELECT is returned back to the user, you may want to add your own SELECT with a UNION operation. However, PostgreSQL requires compatible data types for corresponding columns:
The two SELECT statements that represent the direct operands of the UNION must produce the same number of columns, and corresponding columns must be of compatible data types.
So you would need to know the number and data types of the columns of the original SELECT first.
The number of columns can be detected with the ORDER BY clause by specifying the column number like ORDER BY 3, which would order the result by the values of the third column. If the specified column does not exist, the query will fail.
Now after determining the number of columns, you can inject a UNION SELECT with the appropriate number of columns with an null value for each column of your UNION SELECT:
loc1') UNION SELECT null,null,null,null,null --
Now you determine the types of each column by using a different value for each column one by one. If the types of a column are incompatible, you may an error that hints the expected data type like:
ERROR: invalid input syntax for integer
ERROR: UNION types text and integer cannot be matched
After you have determined enough column types (one column may be sufficient when it’s one that is presented the user), you can change your SELECT to select whatever you want.
Is there a "semi-portable" way to get the md5() or the sha1() of an entire row? (Or better, of an entire group of rows ordered by all their fields, i.e. order by 1,2,3,...,n)? Unfortunately not all DBs are PostgreSQL... I have to deal with at least microsoft SQL server, Sybase, and Oracle.
Ideally, I'd like to have an aggregator (server side) and use it to detect changes in groups of rows. For example, in tables that have some timestamp column, I'd like to store a unique signature for, say, each month. Then I could quickly detect months that have changed since my last visit (I am mirrorring certain tables to a server running Greenplum) and re-load those.
I've looked at a few options, e.g. checksum(*) in tsql (horror: it's very collision-prone, since it's based on a bunch of XORs and 32-bit values), and hashbytes('MD5', field), but the latter can't be applied to an entire row. And that would give me a solution just for one of the SQL flavors I have to deal with.
Any idea? Even for just one of the SQL idioms mentioned above, that would be great.
You could calculate the hashbytes value for the entire row on an update trigger, I used this as part of an ETL process where previously they were comparing all columns in the tables, the speed increase was huge.
Hashbytes works on varchar, nvarchar, or varbinary datatypes, and I wanted to compare integer keys and text fields, casting everything would have been a nightmare, so I used the FOR XML clause in SQL server as follows:
CREATE TRIGGER get_hash_value ON staging_table
FOR UPDATE, INSERT AS
UPDATE staging_table
SET sha1_hash = (SELECT hashbytes('sha1', (SELECT col1, col2, col3 FOR XML RAW)))
GO
alternatively, you could calculate the values in a similar way outside of a trigger, if you plan to do many updates on all the rows by using a subquery with the for xml clause also. If going this route, you can even change it to a SELECT *, but not in the trigger, as each time you run it you would be getting a different value because the sha1_hash column would be different each time.
You could modify the select statement to get more than 1 row
In MSSQL -- You can use HashBytes across the entire row by using xml..
SELECT MBT.id,
hashbytes('MD5',
(SELECT MBT.*
FROM (
VALUES(NULL))foo(bar)
FOR xml auto)) AS [Hash]
FROM <Table> AS MBT;
You need the from (values(null))foo(bar) clause to use xml auto, it serves no other purpose..
I am looking for a way to fuzzy match strings (in my case contact names) to see where there might be possible duplicates in the database. The 'duplicates' are actually cases where the names are very similar, as each row will have unique data.
I have been looking around and think that this: JaroWinkler Function would best suit my needs, which works quite well on small sets of strings.
However, I am looking to compare about 260,000 distinct strings, and want to see if there is a way to avoid checking through all possible combinations (as this would give me around 29 billion rows of checking).
As it stands the query I am using for a small sample set:
CREATE TABLE #data
(
ROW INT IDENTITY (1,1)
,string VARCHAR(50)
)
INSERT INTO #data SELECT 'Watts' AS string
UNION ALL SELECT 'Burns'
UNION ALL SELECT 'McLaughlan'
UNION ALL SELECT 'Darry'
UNION ALL SELECT 'Storie'
UNION ALL SELECT 'Mcluangan'
UNION ALL SELECT 'Burnsysx'
SELECT
data1.string as string1
,data1.row as row1
,data2.string as string2
,data2.row as row2
,dbo.JaroWinkler(data1.string,data2.string) as correlation
from #data data1
CROSS JOIN #data data2
WHERE data1.row < data2.row
Which for this sample data returns 21 rows, but I am only interested in rows where the correlation is above 0.7, so the majority of these can be removed from the output, and if possible not even used as a comparison point.
So for the example data above, I would want to return the following rows:
string1 row1 string2 row2 correlation
McLaughlan 3 Mcluangan 6 0.8962954
Burns 2 Burnsysx 7 0.874999125
I know that using inequality triangular joins is not a good idea, so would using a cursor be a better one? I do unfortunately need to check all records against each other to make sure duplicates don't exist.
For the purposes of testing, the Difference(data1.string,data2.string) could be used, filtering only cases where the value = 4 (so that I can at least get a sense of how best to move forwards with this)!!
Thanks!
The fuzzy logic feature in SSIS might be worth a shot, if you haven't tried it yet. It might be more performant than the query you have and has more "tweakable" parameters. It is relatively easy to set up.
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
If you are trying to find duplicate names, have you considered using the built-in SOUNDEX() function to find matches?
I've been reading around and found that using LIKE causes a big slowdown in queries.
A workmate recommended we use
Select Name
From mytable
a.Name IN (SELECT Name
FROM mytable
WHERE Name LIKE '%' + ISNULL(#Name, N'') + '%'
GROUP BY Name)
in lieu of
Select Name
From mytable
a.Name LIKE '%' + ISNULL(#Name, N'') + '%'
Now I'm no SQL expert and I don't really understand the inner workings of these statements. Is this a better option worth the effort of typing a few extra characters with each like statement? Is there an even better (and easier to type) alternative?
There are a couple of performance issues to address...
Don't Access the Same Table More Than Once, If Possible
Don't use a subquery for criteria that can be done without the need for referencing additional copies of the same table. It's acceptable if you need data from a copy of the table due to using aggregate functions (MAX, MIN, etc), though analytic functions (ROW_NUMBER, RANK, etc) might be more accommodating (assuming supported).
Don't Compare What You Don't Need To
If your parameter is NULL, and that means that you want any value for the columns you are comparing against, don't include filtration criteria. Statements like these:
WHERE a.Name LIKE '%' + ISNULL(#Name, N'') + '%'
...guarantee the optimizer will have to compare values for the name column, wildcarding or not. Worse still in the case with LIKE is that wildcarding the left side of the evaluation ensures that an index can't be used if one is present on the column being searched.
A better performing approach would be:
IF #Name IS NOT NULL
BEGIN
SELECT ...
FROM ...
WHERE a.name LIKE '%' + #Name + '%'
END
ELSE
BEGIN
SELECT ...
FROM ...
END
Well performing SQL is all about tailoring to exactly what you need. Which is why you should be considering dynamic SQL when you have queries with two or more independent criteria.
Use The Right Tool
The LIKE operator isn't very efficient at searching text when you're checking for the existence of a string within text data. Full Text Search (FTS) technology was designed to address the shortcomings:
IF #Name IS NOT NULL
BEGIN
SELECT ...
FROM ...
WHERE CONTAINS(a.name, #Name)
END
ELSE
BEGIN
SELECT ...
FROM ...
END
Always Test & Compare
I agree with LittleBobbyTables - the solution ultimately relies on checking the query/execution plan for all the alternatives because table design & data can impact optimizer decision & performance. In SQL Server, the one with the lowest subtreecost is the most efficient, but it can change over time if the table statistics and indexes aren't maintained.
Simply compare the execution plans and you should see the difference.
I don't have your exact data, but I ran the following queries against a SQL Server 2005 database of mine (yes, it's nerdy):
SELECT UnitName
FROM Units
WHERE (UnitName LIKE '%Space Marine%')
SELECT UnitName
FROM Units
WHERE UnitName IN (
(SELECT UnitName FROM Units
WHERE UnitName LIKE '%Space Marine%' GROUP BY UnitName)
)
Here were my execution plan results:
Your co-worker's suggestion adds a nested loop and a second clustered index scan to my query as you can see above. Your mileage may vary, but definitely check the execution plans to see how they compare. I can't imagine how it would be more efficient.
Unless IIQR is some smaller table that indexes the names somehow (and is not the original table being queried here from the start), I don't see how that longer version helps at all; it's doing the exact same thing, but just adding in an extra step of creating a set of results which is when used in an IN.
But I'd be dubious even if IIQR is a smaller 'index' table. I'd want to see more about the database in question and what the query plan ends up being for each.
LIKE can have a negative effect on query performance because it often requires a table scan - physically loading each record's relevant field and searching for the text in question. Even if the field is indexed, this is likely the case. But there may be no way around it, if what you need to do is search for partial text at any possible location inside a field.
Depending on the size of the table in question, though; it may really not matter at all.
For you, though; I would suggest that keeping it simple is best. Unless you really do know what the whole effect of complicating a query would be on performance, it can be hard to try to decide which way to do things.
So I have a database table in MySQL that has a column containing a string. Given a target string, I want to find all the rows that have a substring contained in the target, ie all the rows for which the target string is a superstring for the column. At the moment I'm using a query along the lines of:
SELECT * FROM table WHERE 'my superstring' LIKE CONCAT('%', column, '%')
My worry is that this won't scale. I'm currently doing some tests to see if this is a problem but I'm wondering if anyone has any suggestions for an alternative approach. I've had a brief look at MySQL's full-text indexing but that also appears to be geared toward finding a substring in the data, rather than finding out if the data exists in a given string.
You could create a temporary table with a full text index and insert 'my superstring' into it. Then you could use MySQL's full text match syntax in a join query with your permanent table. You'll still be doing a full table scan on your permanent table because you'll be checking for a match against every single row (what you want, right?). But at least 'my superstring' will be indexed so it will likely perform better than what you've got now.
Alternatively, you could consider simply selecting column from table and performing the match in a high level language. Depending on how many rows are in table, this approach might make more sense. Offloading heavy tasks to a client server (web server) can often be a win because it reduces load on the database server.
If your superstrings are URLs, and you want to find substrings in them, it would be useful to know if your substrings can be anchored on the dots.
For instance, you have superstrings :
www.mafia.gov.ru
www.mymafia.gov.ru
www.lobbies.whitehouse.gov
If your rules contain "mafia' and you want the first 2 to match, then what I'll say doesn't apply.
Else, you can parse your URLs into things like : [ 'www', 'mafia', 'gov', 'ru' ]
Then, it will be much easier to look up each element in your table.
Well it appears the answer is that you don't. This type of indexing is generally not available and if you want it within your MySQL database you'll need to create your own extensions to MySQL. The alternative I'm pursuing is to do the indexing in my application.
Thanks to everyone that responded!
I created a search solution using views that needed to be robust enought to grow with the customers needs. For Example:
CREATE TABLE tblMyData
(
MyId bigint identity(1,1),
Col01 varchar(50),
Col02 varchar(50),
Col03 varchar(50)
)
CREATE VIEW viewMySearchData
as
SELECT
MyId,
ISNULL(Col01,'') + ' ' +
ISNULL(Col02,'') + ' ' +
ISNULL(Col03,'') + ' ' AS SearchData
FROM tblMyData
SELECT
t1.MyId,
t1.Col01,
t1.Col02,
t1.Col03
FROM tblMyData t1
INNER JOIN viewMySearchData t2
ON t1.MyId = t2.MyId
WHERE t2.SearchData like '%search string%'
If they then decide to add columns to tblMyData and they want those columns to be searched then modify viewMysearchData by adding the new colums to "AS SearchData" section.
If they decide that there are two many columns in the search then just modify the viewMySearchData by removing the unwanted columns from the "AS SearchData" section.