Are "charlist" style wildcards part of the SQL-92 standard? - sql

The SQL wildcards "%" and "_" are well documented and widely known. However as w3schools explains, there are also "charlist" style wildcards for matching a single character within or outside a given range, for example to find all the people called Carl but not those called Earl:
select * from Person where FirstName like '[A-D]arl'
... or to find the opposite, use either:
select * from Person where FirstName like '[!A-D]arl'
or (depending on the RDBMS, presumably):
select * from Person where FirstName like '[^A-D]arl'
Is this type of wildcard part of the SQL-92 standard, and what databases actually support it? For example:
Oracle 11g doesn't support it
SQL Server 2005 supports it, with the negation operator being "^" (not "!")

The SQL-99 Standard has a SIMILAR TO predicate which uses "charlist" style as well as the "%" and "_" wildcard characters.
Nothing similar (no pun intended) in the SQL-92 Standard, though.

The "charlist" operators look like regular expressions, or a limited subset of them. AFAIK there's no regular expression syntax specified in SQL-92 although many databases support regex's, and HOW they support it varies. Oracle, for example, has functions to do regular expression comparisons and substitutions. Don't know how others do it.
Share and enjoy.

Related

Difference between % vs * in string comparison in Hive

When trying to list down all tables names in a database having a specific name format, the following query works fine :
show tables like '*case*';
while the following does not
show tables like '%case%';
On the other hand, when comparing the actual data inside string columns its the vice-versa case
Working query :
select column from database.table where column like '%ABC%' limit 5;
Not working query :
select column from database.table where column like '*ABC*' limit 5;
What's the difference between the 2 operators * and % ?
This is the difference between regular expressions and like patterns.
LIKE is built into the SQL language. It has two wildcards:
% represents any number of characters including zero.
_ represents exactly one character.
Regular expressions are much more flexible for matching almost any pattern in a string.
When SQL was invented, I don't think regular expressions were in common use in computer systems -- at the very least, the folks at IBM who worked on relational databases may not have been familiar with the folks at ATT who were inventing Unix.
Regular expressions are much more powerful than LIKE patterns, of course. And Hive supports them via the RLIKE operator (and some other functions).
The SHOW functionality is not standard SQL. So, the developers of Hive chose the more flexible method for pattern matching.
HiveQL attempts to mimic the SQL, but it does not strictly follow its standards.
The usage of the wildcards are not pertinent to the LIKE clause, but to the statement itself. SHOW statements validate the wildcards based on the Java regular expression whereas when it comes to SELECT statements, Hive tries to stick with the SQL's wildcard validation.

Why is the use of 'from' behind the table name allowed? What does it do?

SQL for DB2 is pretty strict, that's why I was surprised this query succeeded:
select 1 from sysibm.sysdummy1 from
Is it exactly the same as?
select 1 from sysibm.sysdummy1
If the double from is allowed, why isn't a double where/select/order by/having allowed? Is there any difference in the output when running this query on a 'real' table?
Db2 (for Linux, Unix, Windows) provides a list of reserved schemas and words. As stated in the docs, the list is not enforced by Db2, but the recommendation is to not use them for portability reasons.
A from succeeds but not a where because an optional WHERE clause follows in the place you tried to use the reserved words. In that case you have an incomplete WHERE clause and it violates grammar rules. Thus, the recommendation is to respect the list of reserved words and not use them. You may (freedom of expression... ;-) ), but you should be considerate...

What are pros and cons of using special characters in SQL identifiers?

Should I avoid special characters like "é á ç" in SQL table names and column names?
What are the pros and cons of using special characters?
As you can guess, there are pros and cons. This is more or less a subjective question.
SQL (unlike most programming languages) allows you to use special characters, whitespace, punctuation, or reserved words in your table or column identifiers.
It's pretty nice that people have the choice to use appropriate characters for their native language.
Especially in cases where a word changes its meaning significantly when spelled with the closest ASCII characters: e.g. año vs. ano.
But the downside is that if you do this, you have to use "delimited identifiers" every time you reference the table with special characters. In standard SQL, delimited identifiers use double-quotes.
SELECT * FROM "SELECT"
This is actually okay! If you want to use an SQL reserved word as a table name, you can do it. But it might cause some confusion for some readers of the code.
Likewise if you use special non-ASCII characters, it might make it hard for English-speaking programmers to maintain the code, because they are not familiar with the key sequence to type those special characters. Or they might forget that they have to delimit the table names.
SELECT * FROM "año"
Then there's non-standard delimited identifiers. Microsoft uses square-brackets by default:
SELECT * FROM [año]
And MySQL uses back-ticks by default:
SELECT * FROM `año`
Though both can use the standard double-quotes as identifier delimiters if you enable certain options, you can't always rely on that, and if the option gets disabled, your code will stop working. So users of Microsoft and MySQL are kind of stuck using the non-standard delimiters, unfortunately.
Maintaining the code is simpler in some ways if you can stick with ASCII characters. But there are legitimate reasons to want to use special characters too.

What does the SQL Standard say about usage of backtick(`)?

Once I had spent hours in debugging a simple SQL query using mysql_query() in PHP/MySQL only to realise that I had missed bactick around the table name. From then I had been always using it around table names.
But when I used the same in SQLite/C++, the symbol is not even recognized. It's confusing, whether to use this or not? What does standard say about usage of it?
Also, it would be helpful if anyone could tell me when to use quotes and when not. I mean around values and field names.
The SQL standard (current version is ISO/IEC 9075:2011, in multiple parts) says nothing about the 'back-tick' or 'back-quote' symbol (Unicode U+0060 or GRAVE ACCENT); it doesn't recognize it as a character with special meaning that can appear in SQL.
The Standard SQL mechanism for quoting identifiers is with delimited identifiers enclosed in double quotes:
SELECT "select" FROM "from" WHERE "where" = "group by";
In MySQL, that might be written:
SELECT `select` FROM `from` WHERE `where` = `group by`;
In MS SQL Server, that might be written:
SELECT [select] FROM [from] WHERE [where] = [group by];
The trouble with the SQL Standard notation is that C programmers are used to enclosing strings in double quotes, so most DBMS use double quotes as an alternative to the single quotes recognized by the standard. But that then leaves you with a problem when you want to enclose identifiers.
Microsoft took one approach; MySQL took another; Informix allows interchangeable use of single and double quotes, but if you want delimited identifiers, you set an environment variable and then you have to follow the standard (single quotes for strings, double quotes for identifiers); DB2 only follows the standard, AFAIK; SQLite appears to follow the standard; Oracle also appears to follow the standard; Sybase appears to allow either double quotes (standard) or square brackets (as with MS SQL Server — which means SQL Server might allow double quotes too). This page (link AWOL since 2013 — now available in The Wayback Machine) documents documented all these servers (and was helpful filling out the gaps in my knowledge) and notes whether the strings inside delimited identifiers are case-sensitive or not.
As to when to use a quoting mechanism around identifiers, my attitude is 'never'. Well, not quite never, but only when absolutely forced into doing so.
Note that delimited identifiers are case-sensitive; that is, "from" and "FROM" refer to different columns (in most DBMS — see URL above). Most of SQL is not case-sensitive; it is a nuisance to know which case to use. (The SQL Standard has a mainframe orientation — it expects names to be converted to upper-case; most DBMS convert names to lower-case, though.)
In general, you must delimit identifiers which are keywords to the version of SQL you are using. That means most of the keywords in Standard SQL, plus any extras that are part of the particular implementation(s) that you are using.
One continuing source of trouble is when you upgrade the server, where a column name that was not a keyword in release N becomes a keyword in release N+1. Existing SQL that worked before the upgrade stops working afterwards. Then, at least as a short-term measure, you may be forced into quoting the name. But in the ordinary course of events, you should aim to avoid needing to quote identifiers.
Of course, my attitude is coloured by the fact that Informix (which is what I work with mostly) accepts this SQL verbatim, whereas most DBMS would choke on it:
CREATE TABLE TABLE
(
DATE INTEGER NOT NULL,
NULL FLOAT NOT NULL,
FLOAT INTEGER NOT NULL,
NOT DATE NOT NULL,
INTEGER FLOAT NOT NULL
);
Of course, the person who produces such a ridiculous table for anything other than demonstration purposes should be hung, drawn, quartered and then the residue should be made to fix the mess they've created. But, within some limits which customers routinely manage to hit, keywords can be used as identifiers in many contexts. That is, of itself, a useful form of future-proofing. If a word becomes a keyword, there's a moderate chance that the existing code will continue to work unaffected by the change. However, the mechanism is not perfect; you can't create a table with a column called PRIMARY, but you can alter a table to add such a column. There is a reason for the idiosyncrasy, but it is hard to explain.
Trailing underscore
You said:
it would be helpful if anyone could tell me when to use quotes and when not
Years ago I surveyed several relational database products looking for commands, keywords, and reserved words. Shockingly, I found over a thousand distinct words.
Many of them were surprisingly counter-intuitive as a "database word". So I feared there was no simple way to avoid unintentional collisions with reserved words while naming my tables, columns, and such.
Then I found this tip some where on the internets:
Use a trailing underscore in all your SQL naming.
Turns out the SQL specification makes an explicit promise to never use a trailing underscore in any SQL-related names.
Being copyright-protected, I cannot quote the SQL spec directly. But section 5.2.11 <token> and <separator> from a supposed-draft of ISO/IEC 9075:1992, Database Language SQL (SQL-92) says (in my own re-wording):
In the current and future versions of the SQL spec, no keyword will end with an underscore
➥ Though oddly dropped into the SQL spec without discussion, that simple statement to me screams out “Name your stuff with a trailing underscore to avoid all naming collisions”.
Instead of:
person
name
address
…use:
person_
name_
address_
Since adopting this practice, I have found a nice side-effect. In our apps we generally have classes and variables with the same names as the database objects (tables, columns, etc.). So an inherent ambiguity arises as to when referring to the database object versus when referring to the app state (classes, vars). Now the context is clear: When seeing a trailing underscore on a name, the database is specifically indicated. No underscore means the app programming (Java, etc.).
Further tip on SQL naming: For maximum portability, use all-lowercase with underscore between words, as well as the trailing underscore. While the SQL spec requires (not suggests) an implementation to store identifiers in all uppercase while accepting other casing, most/all products ignore this requirement. So after much reading and experimenting, I learned the all-lowercase with underscores will be most portable.
If using all-lowercase, underscores between words, plus a trailing underscore, you may never need to care about enquoting with single-quotes, double-quotes, back-ticks, or brackets.

Does SQL standard allows whitespace between function names and parenthesis

Checking few RDBMS I find that things like
SELECT COUNT (a), SUM (b)
FROM TABLE
are allowed (notice space between aggregate functions and parenthesis).
Could anyone provide a pointer to SQL standard itself where this is defined (any version will do)?
EDIT:
The above works in postgres, mysql needs set sql_mode = "IGNORE_SPACE"; as defined here (for full list of functions that are influenced with this server mode see in this ref).
MS SQL is reported to accept the above.
Also, it seems that the answer is most likely in the standard. I can follow the BNF regarding the regular symbols and terms, but I get lost when it comes to the definition of whitespace and separators in that part of the select.
Yes; the white space between tokens is substantially ignored. The only exception is, officially, with adjacent string literal concatenation - but the standard is weirder than any implementation would be.
See: http://savage.net.au/SQL/
This works in SQL Server 2005:
SELECT COUNT (*)
FROM TABLE
...while one space between COUNT and (*) on MySQL causes a MySQL 1064 error (syntax error). I don't have Oracle or Postgres handy to test.
Whatever the standard may be, it's dependent on implementation in the vendor and version you are using.
I can't provide a pointer, but I believe that white space like that is ignored.
I know that it is in T-SQL, and about 80% certain about MySQL's implementation.