I'm looking for advice on how to tackle the issue of different spelling for the same name.
I have a SQL Server database with company names, and there are some companies that are the same but the spelling is different.
For example:
Building Supplies pty
Buidings Supplies pty
Building Supplied l/d
The problem is that there are no clear consistencies in the variation. Sometimes it's an extra 's', other times its an extra space.
Unfortunately I don't have a lookup list, so I can't use Fuzzy LookUp. I need to create the clean list.
Is there a method that people use to deal with this problem?
p.s I tried searching for this problem but can't seem to find a similar thread
Thanks
You can use SOUNDEX() DIFFERENCE() for this purpose.
DECLARE #SampleData TABLE(ID INT, BLD VARCHAR(50), SUP VARCHAR(50))
INSERT INTO #SampleData
SELECT 1, 'Building','Supplies'
UNION
SELECT 2, 'Buidings','Supplies'
UNION
SELECT 3, 'Biulding','Supplied'
UNION
SELECT 4, 'Road','Contractor'
UNION
SELECT 5, 'Raod','Consractor'
UNION
SELECT 6, 'Highway','Supplies'
SELECT *, DIFFERENCE('Building', BLD) AS DIF
FROM #SampleData
WHERE DIFFERENCE('Building', BLD) >= 3
Result
ID BLD SUP DIF
1 Building Supplies 4
2 Buidings Supplies 3
3 Biulding Supplied 4
If this serves your purpose you can write an update query to update selected record accordingly.
Aside from the SOUNDEX() DIFFERENCE() option (which is a very good one btw!) you could look into SSIS more.
Provided your data is in english and not exclusively names of people there is a lot you can do with these components:
Term extraction
Term lookup
Fuzzy grouping
Fuzzy lookup
The main flow would be a tiered structure where you try to find duplicates at increasingly less certain ways. Instead of applying them automaticaly you send all the names and keys you would need to apply the changes to a staging area where they can be reviewed and if needed applied.
If you go about it really smart you can use the reviewed data as a repository for making the package "learn", for example iu is hardly ever valid in english so if that is found and changing it to ui makes a valid english word you might want to start applying those automaticaly at some point.
One other thing to consider is keeping a list of all validated names and use this to check for duplicates of that names and to prevent unnecesary recursion/load on checking the source data.
Related
this question might appear a bit strange to you but i´ll try to explain it.
In our company in the production department we are tracking machine data. This data is also used for evaluating the quality of the production process.
In the following i will refer to these attributes:
productId
componentOfProduct -> the component which is affected by the error
routeStepOfError
causeOfError
The problem is, that the data the machine produces is not in the order the management wants to have it for evaluation.
So we have to do a data matching. Most of the time it is a simple relationship e.g. matching several productId numbers to 1 product Name / Group.
But in the case of the routeStepOfError it´s different. For some cases the routeStep the production lines are logging can be matched to the routeStep for the management reports like descirbed above with the productIds.
But for some routeSteps a way more complicated matching is done. So far it´s implementet in an VBA app which is matching the database output and writes data into a spreadsheet. the matching is done via Select Case Instructions like this:
Select Case routeStep
Case EOL
Select Case productId
Case 1111, 1112, 1113
Select Case causeOfError
Case A1:
Select Case componentOfProduct
Case "be1": routeStepReport = "final optical test"
Case Else: routeStepReport = "end of line"
End Select
Case Else: routeStepReport = "end of line"
End Select
Case Else: routeStepReport = "end of line"
End Select
Case...
End Select
...i know that the syntax might not be correct, but i hope you get what i´m trying to say: sometimes the mathing from routeStep to routeStepReport (i.e. the value we need for our management reports) depends on the routeStep, the productId, the componentOfProduct and the causeOfError.
...and these Select Case Statements are really long as there are many products and many routeSteps in our production process. So, each time, there is a change in the production programm / process, this has to be maintained in the VBA Code which is far away from being perfect as only 1 guy in our company really knows where in the code to look for this and how to maintain it.
So, i proposed to implement the whole matching in an SQL Database and just create the right relationships between the values of the machines and the values the management wants to have. Togehter with an interface in php or whatever people could just do the matching quite easily.
Well, for the simple matchings like productIds to Product Groups this works quite fine, but for the routeSteps like described above for me this might be a problem.
I would have created one table with the following attributes:
|-----------------|-----------------|-----------------|-----------------|-----------------|
|routeSTepofError |productId | componentOfProd | causeOfError | routeStepReport |
|-----------------|-----------------|-----------------|-----------------|-----------------|
But Let´s say, we have about 20 routeSteps, 50 productIds, each with about 4 Components and 10 causes of error this table might be endless as well and really hard to maintain.
Maybe i should have told before, that for the majority of routeSTepofErrors, there is a simple matiching from routeSTepofError to routeStepReport regardless to productIds, components and causes.... but if some mathings are depending on all 4 criterias, i have to completly fill the table above, don´t I?
Maybe there´s an easier solution to achieve this, but yet I cannot see it.
So i would be really pleased for each and every hint you could give me for solving this problem (i cannot change the way of matching itself; they still want to have "their" well-known figures :-) ).
Thanks a lot in advance!
Regards
You might use two tables, tblRouteStepErrorMatch and tblRouteStepErrorException.
tblRouteStepErrorMatch
routeStepofError
routeStepReport
tblRouteStepErrorException
routeStepError
productID
componentOfProd
causeOfError
routeStepReport
Then in your code, check the Exception table. If there's not match, go to the Match table.
ExcRecordset = SELECT * FROM tblRouteStepErrorException WHERE ...
If BOF(ExcRecordset) and EOF(ExcRecordset) Then 'No match in exception table
MatchRecordset = SELECT * FROM tblRouteStepErrorMatch WHERE ... 'go get from match table
get result from MatchRecordset
Else
get result from ExcRecordset
End if
Now your exceptions are a lot easier to maintain because there are far fewer of them and the match table becomes the fallback for when a special case isn't found.
I would really appreciate a bit of help/pointers on the following problem.
Background Info:
Database version: Oracle 9i
Java version: 1.4.2
The problem
I have a database table with multiple columns representing various meta data about a document.
E.g.:
CREATE TABLE mytable
(
document_id integer,
filename varchar(255),
added_date date,
created_by varchar(32),
....
)
Due to networking/latency issues between a webserver and database server, I would like to minimise the number of queries made to the database.
The documents are listed in a web page, but there are thousands of different documents.
To aid navigation, we provide filters on the web page to select just documents matching a certain value - e.g. created by user 'joe bloggs' or created on '01-01-2011'. Also, paging is provided so triggering a db call to get the next 50 docs or whatever.
The web pages themselves are kept pretty dumb - they just present what's returned by a java servlet. Currently, these filters are each provided with their distinct values through separate queries for distinct values on each column.
This is taking quite a long time due to networking latency and the fact it means 5 extra queries.
My Question
I would like to know if there is a way to get this same information in just one query?
For example, is there a way to get distinct results from that table in a form like:
DistinctValue Type
01-01-2011 added_date
01-02-2011 added_date
01-03-2011 added_date
Joe Bloggs created_by
AN Other created_by
.... ...
I'm guessing one issue with the above is that the datatypes are different across the columns, so dates and varchars could not both be returned in a "DistinctValue" column.
Is there a better/standard approach to this problem?
Many thanks in advance.
Jay
Edit
As I mentioned in a comment below, I thought of a possibly more memory/load effective approach that removes the original requirement to join the queries up -
I imagine another way it could work is
instead of populating the drop-downs
initially, have them react to a user
typing and then have a "suggester"
style drop-down appear of just those
distinct values that match the entered
text. I think this would mean a)
keeping the separate queries for
distinct values, but b) only running
the queries individually as needed,
and c) reducing the resultset by
filtering the unique values on the
user's text.
This query will return an output as you describe above:
SELECT DocumentID As DocumentID, 'FileName' As AttributeType, FileName As DistinctValue
FROM TableName
UNION
SELECT DocumentID, 'Added Date', Added_date FROM TableName
UNION
SELECT DocumentID, 'Created By', created_by FROM TableName
UNION
....
If you have the privilege you could create a view using this SQL and you could use it for your queries.
Due to networking/latency issues
between a webserver and database
server, I would like to minimise the
number of queries made to the
database.
The documents are listed in a web
page, but there are thousands of
different documents.
You may want to look into Lucene. Whenever I see "minimise queries to db" combined with "searching documents", this is what I think of. I've used this with very good success, and can be used with read-only or updating environments. Oracle's answer is Oracle Text, but (to me anyway) its a bit of a bear to setup and use. Depends on your company's technical resources and strengths.
Anyway, sure beats the heck out of multiple queries to the db for each connection.
Browsing through the more dubious parts of the web, I happened to come across this particular SQL injection:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4,5,6--
My knowledge of SQL - which I thought was half decent - seems very limiting as I read this.
Since I develop extensively for the web, I was curious to see what this code actually does and more importantly how it works.
It replaces an improperly written parametrized query like this:
$sql = '
SELECT *
FROM products
WHERE id = ' . $_GET['id'];
with this query:
SELECT *
FROM products
WHERE id = 1
UNION ALL
select 0,1,concat_ws(user(),0x3A,database(),0x3A,version()),3,4,5,6
, which gives you information about the database name, version and username connected.
The injection result relies on some assumptions about the underlying query syntax.
What is being assumed here is that there is a query somewhere in the code which will take the "id" parameter and substitute it directly into the query, without bothering to sanitize it.
It's assuming a naive query syntax of something like:
select * from records where id = {id param}
What this does is result in a substituted query (in your above example) of:
select * from records where id = 1 union select 0, 1 , concat_ws(user(),0x3a,database(),0x3a,version()), 3, 4, 5, 6 --
Now, what this does that is useful is that it manages to grab not only the record that the program was interested in, but also it UNIONs it with a bogus dataset that tells the attacker (these values appear separated by colons in the third column):
the username with which we are
connected to the database
the name of the database
the version of the db software
You could get the same information by simply running:
select concat_ws(user(),0x3a,database(),0x3a,version())
Directly at a sql prompt, and you'll get something like:
joe:production_db:mysql v. whatever
Additionally, since UNION does an implicit sort, and the first column in the bogus data set starts with a 0, chances are pretty good that your bogus result will be at the top of the list. This is important because the program is probably only using the first result, or there is an additional little bit of SQL in the basic expression I gave you above that limits the result set to one record.
The reason that there is the above noise (e.g. the select 0,1,...etc) is that in order for this to work, the statement you are calling the UNION with must have the same number of columns as the first result set. As a consequence, the above injection attack only works if the corresponding record table has 7 columns. Otherwise you'll get a syntax error and this attack won't really give you what you want. The double dashes (--) are just to make sure anything that might happen afterwords in the substitution is ignored, and I get the results I want. The 0x3a garbage is just saying "separate my values by colons".
Now, what makes this query useful as an attack vector is that it is easily re-written by hand if the table has more or less than 7 columns.
For example if the above query didn't work, and the table in question has 5 columns, after some experimentation I would hit upon the following query url to use as an injection vector:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4--
The number of columns the attacker is guessing is probably based on an educated look at the page. For example if you're looking at a page listing all the Doodads in a store, and it looks like:
Name | Type | Manufacturer
Doodad Foo Shiny Shiny Co.
Doodad Bar Flat Simple Doodads, Inc.
It's a pretty good guess that the table you're looking at has 4 columns (remember there's most likely a primary key hiding somewhere if we're searching by an 'id' parameter).
Sorry for the wall of text, but hopefully that answers your question.
this code adds an additional union query to the select statement that is being executed on page.php. The injector has determined that the original query has 6 fields, thus the selection of the numeric values (column counts must match with a union). the concat_ws just makes one field with the values for the database user , the database, and the version, separated by colons.
It seems to retrieve the user used to connect to the database, the database adress and port, the version of it. And it will be put by the error message.
I am wondering how others would handle a scenario like such:
Say I have multiple choices for a user to choose from.
Like, Color, Size, Make, Model, etc.
What is the best solution or practice for handling the build of your query for this scneario?
so if they select 6 of the 8 possible colors, 4 of the possible 7 makes, and 8 of the 12 possible brands?
You could do dynamic OR statements or dynamic IN Statements, but I am trying to figure out if there is a better solution for handling this "WHERE" criteria type logic?
EDIT:
I am getting some really good feedback (thanks everyone)...one other thing to note is that some of the selections could even be like (40 of the selections out of the possible 46) so kind of large. Thanks again!
Thanks,
S
What I would suggest doing is creating a function that takes in a delimited list of makeIds, colorIds, etc. This is probably going to be an int (or whatever your key is). And splits them into a table for you.
Your SP will take in a list of makes, colors, etc as you've said above.
YourSP '1,4,7,11', '1,6,7', '6'....
Inside your SP you'll call your splitting function, which will return a table-
SELECT * FROM
Cars C
JOIN YourFunction(#models) YF ON YF.Id = C.ModelId
JOIN YourFunction(#colors) YF2 ON YF2.Id = C.ColorId
Then, if they select nothing they get nothing. If they select everything, they'll get everything.
What is the best solution or practice for handling the build of your query for this scenario?
Dynamic SQL.
A single parameter represents two states - NULL/non-existent, or having a value. Two more means squaring the number of parameters to get the number of total possibilities: 2 yields 4, 3 yields 9, etc. A single, non-dynamic query can contain all the possibilities but will perform horribly between the use of:
ORs
overall non-sargability
and inability to reuse the query plan
...when compared to a dynamic SQL query that constructs the query out of only the absolutely necessary parts.
The query plan is cached in SQL Server 2005+, if you use the sp_executesql command - it is not if you only use EXEC.
I highly recommend reading The Curse and Blessing of Dynamic SQL.
For something this complex, you may want a session table that you update when the user selects their criteria. Then you can join the session table to your items table.
This solution may not scale well to thousands of users, so be careful.
If you want to create dynamic SQL it won't matter if you use the OR approach or the IN approach. SQL Server will process the statements the same way (maybe with little variation in some situations.)
You may also consider using temp tables for this scenario. You can insert the selections for each criteria into temp tables (e.g., #tmpColor, #tmpSize, #tmpMake, etc.). Then you can create a non-dynamic SELECT statement. Something like the following may work:
SELECT <column list>
FROM MyTable
WHERE MyTable.ColorID in (SELECT ColorID FROM #tmpColor)
OR MyTable.SizeID in (SELECT SizeID FROM #tmpSize)
OR MyTable.MakeID in (SELECT MakeID FROM #tmpMake)
The dynamic OR/IN and the temp table solutions work fine if each condition is independent of the other conditions. In other words, if you need to select rows where ((Color is Red and Size is Medium) or (Color is Green and Size is Large)) you'll need to try other solutions.
I'd like to use MySQL in this form:
SELECT 1 AS one, one*2 AS two
because it's shorter and sweeter than
SELECT one*2 AS two FROM ( SELECT 1 AS one ) AS sub1
but the former doesn't seem to work because it expects one to be a column.
Is there any easier way to accomplish this effect without subqueries?
And no, SELECT 2 AS two is not an option. ;)
Considering this SQL code
SELECT 1 AS one, one*2 AS two
from the perspective of SQL the language (and why not; mysql has a good track record of compliance with the ISO/ANSI SQL Standards), your one is not a variable; rather it is a column correlation name. You cannot use the correlation name in the SELECT clause with the same scope, hence the error.
FWIW your 'shorter and sweeter' syntax does actually work when using the MS Access Database Engine -- is that where you learned it, perchance? Sadly, the Access Database Engine has a poor track record of compliance with the Standards. It is said to take a long time to un-learn Access-speak and learn SQL code ;)
select #one := 1 as one, 2 * #one as two;
user-defined variables