Struggling with input validation/check constraint - sql

I'm trying to create a table with an email address column and want to make sure that only addresses in the correct format (contains "#") are allowed. I know how to use the LIKE operator in queries but not how to put a value constraint on a column.

You can add the check constraint like this For your basic example:
alter table t add contraint chk_email
(check (email like '%#%') );
Of course, that is really only a betting. Perhaps something more like this:
alter table t add contraint chk_email
(check (email like '%_#[^.]%' and -- one # and something before and after
email not like '%#%#%' and -- not more than one #
email not like '%[^-.a-zA-Z0-9_]%' -- valid characters)
);
This still will allow invalid emails, but it is at least closer.

This would be better to do in a programming language that supports regular expressions or already has a built in isValidEmail method.
Validating an email address using a regular expression or a simple like pattern is pretty darn hard to get right, if not nearly impossible.
You can read more about it on I Knew How To Validate An Email Address Until I Read The RFC - And to quote the part I think illustrates the problem best:
These are all valid email addresses!
Abc#def#example.com
Fred\ Bloggs#example.com
Joe.\Blow#example.com
"Abc#def"#example.com
"Fred Bloggs"#example.com
customer/department=shipping#example.com
$A12345#example.com
!def!xyz%abc#example.com
_somename#example.com
Attempting to get all the logic needed to validate such a wide range of possibilities in a T-SQL statement is like attempting to climb the Everest blind-folded with your hands tied behind your back.
You could have a simple validation that might return a lot of false-positives or false-negatives using a simple like pattern like the one suggested in How to Validate Email Address in SQL Server? by Pinal Dave: '%_#__%.__%', but that's really just a naive attempt to clog a dam with a band-aid.
Having said all that, you might be able to use a CLR Scalar-Valued Function to validate your email addresses, using code like the one from this SO post.
Personally, I have no experience with CLR functions, so it would probably be irresponsible of me to try and write you a code example (especially since I don't really have a test environment to check it before I post this answer), but I hope what I've written so far was helpful enough, and with the help of the links in this answer and some web searches you will be able to solve the problem.

Related

Custom, user-definable "wildcard" constants in SQL database search -- possible?

My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.
Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.

Remove all rows from a table where an invalid "URL" has been entered

A somewhat odd Postgresql question for our highly specific use case. We have a table which accepts URLs as a part of a comment input from our users. This is on a highly trafficked site. We had some PHP code that was validating that users only entered correctly-formed URLs, if they included one in their comment (usually comment text does not include any URLs).
However, sadly, our PHP is old on an old server. So at some point the ereg logic we had became dysfunctional. Which means miscreant users have had a field day entering comments with badly formed URLs like the following:
l%20are%20generally%20included%20almost%20anyplace--even%20if%20your%20"yard"%20is%20bound%20to%20an%20outdoor%20patio%20or%20balcony.Adding%20water%20to%20your%20patio%20could%20be%20as%20simple%20as%20aiming%20a%20low%20dish%20of%20water%20designed%20for%20use%20in%20the%20form%20of%20birdbath.Any%20cursory%20container%20around%206%20in%20.wide%20and%20a%20half-inch%20deep%20will%20attempt%20to%20work.Pie%20pans,%20garbage%20can%20lids,%20or%20flo
Note that it's not a URL at all. Hence, our question: is there a Postgresql-only way, perhaps through some PL/SQL function or some stored function or something, that we can use to delete all these rubbish records from our database? We'd ideally not want to use a PHP program that went through the entire database and checked it against the valid URL pattern.
We'd like to execute this within PG itself. We can take the database offline to perform this task for as long as it takes.
Thank you!
SELECT * FROM table WHERE url_column !~* '(https?|ftp)://(-\.)?([^\s/?\.#-]+\.?)+(/[^\s]*)?'
Try this query, validate the output en then you could create a DELETE query with this example.

Replace Character in String with SQL Server Table Trigger on Insert\Update

**Answered
I am attempting to create a trigger that will replace a character ’ (MS Word Smart Quote) with a proper apostrophe ' when new data is inserted or updated by a user from our website.
The special apostrophe may be found anywhere on a 5000 NVarchar column and may be found multiple times in the same string.
Any easy replace statement for this?
REPLACE(Column,'’','''')
I'm going to argue that you should probably look at doing this in your applications instead of from within SQL Server. That's NOT the answer you're looking for - but it would probably make more sense.
Typically, when I see questions like this I instantly worry about devs trying to 'defeat' SQL Injection. If that's the case, this approach will NEVER work - as per:
http://sqlmag.com/database-security/sql-injection-beyond-basics
That said, if you're not focused on that and just need to get rid of 'pesky' characters, then REPLACE() will work (and likely be your best option), but I'd still argue that you're probably better off tackling 'formatting' issues like this from within your applications. Or in other words, treat SQL Server as your data repository - something that stores your raw data. Then, if you need to make it 'pretty' or 'tweak' it for various outputs/displays, then do that on the way out to your users by means of your application(s).

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.

Need Pattern for dynamic search of multiple sql tables

I'm looking for a pattern for performing a dynamic search on multiple tables.
I have no control over the legacy (and poorly designed) database table structure.
Consider a scenario similar to a resume search where a user may want to perform a search against any of the data in the resume and get back a list of resumes that match their search criteria. Any field can be searched at anytime and in combination with one or more other fields.
The actual sql query gets created dynamically depending on which fields are searched. Most solutions I've found involve complicated if blocks, but I can't help but think there must be a more elegant solution since this must be a solved problem by now.
Yeah, so I've started down the path of dynamically building the sql in code. Seems godawful. If I really try to support the requested ability to query any combination of any field in any table this is going to be one MASSIVE set of if statements. shiver
I believe I read that COALESCE only works if your data does not contain NULLs. Is that correct? If so, no go, since I have NULL values all over the place.
As far as I understand (and I'm also someone who has written against a horrible legacy database), there is no such thing as dynamic WHERE clauses. It has NOT been solved.
Personally, I prefer to generate my dynamic searches in code. Makes testing convenient. Note, when you create your sql queries in code, don't concatenate in user input. Use your #variables!
The only alternative is to use the COALESCE operator. Let's say you have the following table:
Users
-----------
Name nvarchar(20)
Nickname nvarchar(10)
and you want to search optionally for name or nickname. The following query will do this:
SELECT Name, Nickname
FROM Users
WHERE
Name = COALESCE(#name, Name) AND
Nickname = COALESCE(#nick, Nickname)
If you don't want to search for something, just pass in a null. For example, passing in "brian" for #name and null for #nick results in the following query being evaluated:
SELECT Name, Nickname
FROM Users
WHERE
Name = 'brian' AND
Nickname = Nickname
The coalesce operator turns the null into an identity evaluation, which is always true and doesn't affect the where clause.
Search and normalization can be at odds with each other. So probably first thing would be to get some kind of "view" that shows all the fields that can be searched as a single row with a single key getting you the resume. then you can throw something like Lucene in front of that to give you a full text index of those rows, the way that works is, you ask it for "x" in this view and it returns to you the key. Its a great solution and come recommended by joel himself on the podcast within the first 2 months IIRC.
What you need is something like SphinxSearch (for MySQL) or Apache Lucene.
As you said in your example lets imagine a Resume that will composed of several fields:
List item
Name,
Adreess,
Education (this could be a table on its own) or
Work experience (this could grow to its own table where each row represents a previous job)
So searching for a word in all those fields with WHERE rapidly becomes a very long query with several JOINS.
Instead you could change your framework of reference and think of the Whole resume as what it is a Single Document and you just want to search said document.
This is where tools like Sphinx Search do. They create a FULL TEXT index of your 'document' and then you can query sphinx and it will give you back where in the Database that record was found.
Really good search results.
Don't worry about this tools not being part of your RDBMS it will save you a lot of headaches to use the appropriate model "Documents" vs the incorrect one "TABLES" for this application.