Plain SQL vs Dialects - sql

DBMS Vendors use SQL dialect features to differentiate their product, at the same time claiming to support SQL standards. 'Nuff said on this.
Is there any example of SQL you have coded that can't be translated to SQL:2008 standard SQL ?
To be specific, I'm talking about DML (a query statement), NOT DDL, stored procedure syntax or anything that is not a pure SQL statement.
I'm also talking about queries you would use in Production, not for ad-hoc stuff.
Edit Jan 13
Thanks for all of your answers : they have conveyed to me an impression that a lot of the DBMS-specific SQL is created to allow work-arounds for poor relational design. Which leads me to the conclusion you probably wouldn't want to port most existing applications.

Typical differences include subtly differnt semantics (for example Oracle handles NULLs differently from other SQL dialects in some cases), different exception handling mechanisms, different types and proprietary methods for doing things like string operations, date operations or hierarchical queries. Query hints also tend to have syntax that varies across platforms, and different optimisers may get confused on different types of constructs.
One can use ANSI SQL for the most part across database systems and expect to get reasonable results on a database with no significant tuning issues like missing indexes. However, on any non-trivial application there is likely to be some requirement for code that cannot easily be done portably.
Typically, this requirement will be fairly localised within an application code base - a handful of queries where this causes an issue. Reporting is much more likely to throw up this type of issue and doing generic reporting queries that will work across database managers is very unlikely to work well. Some applications are more likely to cause grief than others.
Therefore, it is unlikely that relying on 'portable' SQL constructs for an application will work in the general case. A better strategy is to use generic statements where they will work and break out to a database specific layer where this does not work.
A generic query mechanism could be to use ANSI SQL where possible; another possible approach would be to use an O/R mapper, which can take drivers for various database platforms. This type of mechanism should suffice for the majority of database operations but will require you to do some platform-specifc work where it runs out of steam.
You may be able to use stored procedures as an abstraction layer for more complex operations and code a set of platform specific sprocs for each target platform. The sprocs could be accessed through something like ADO.net.
In practice, subtle differences in paramter passing and exception handling may cause problems with this approach. A better approach is to produce a module that wraps the
platform-specific database operations with a common interface. Different 'driver' modules can be swapped in and out depending on what DBMS platform you are using.

Oracle has some additions, such as model or hierarchical queries that are very difficult, if not impossible, to translate into pure SQL

Even when SQL:2008 can do something sometimes the syntax is not the same. Take the REGEXP matching syntax for example, SQL:2008 uses LIKE_REGEX vs MySQL's REGEXP.
And yes, I agree, it's very annoying.

Part of the problem with Oracle is that it's still based on the SQL 1992 ANSI standard. SQL Server is on SQL 1999 standard, so some of the things that look like "extensions" are in fact newer standards. (I believe that the "OVER" clause is one of these.)
Oracle is also far more restrictive about placing subqueries in SQL. SQL Server is far more flexible and permissive about allowing subqueries almost anywhere.
SQL Server has a rational way to select the "top" row of a result: "SELECT TOP 1 FROM CUSTOMERS ORDER BY SALES_TOTAL". In Oracle, this becomes "SELECT * FROM (SELECT CUSTOMERS ORDER BY SALES_TOTAL) WHERE ROW_NUMBER <= 1".
And of course there's always Oracle's infamous SELECT (expression) FROM DUAL.
Edit to add:
Now that I'm at work and can access some of my examples, here's a good one. This is generated by LINQ-to-SQL, but it's a clean query to select rows 41 through 50 from a table, after sorting. It uses the "OVER" clause:
SELECT [t1].[CustomerID], [t1].[CompanyName], [t1].[ContactName], [t1].[ContactTitle], [t1].[Address], [t1].[City], [t1].[Region], [t1].[PostalCode], [t1].[Country], [t1].[Phone], [t1].[Fax]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[ContactName]) AS [ROW_NUMBER], [t0].[CustomerID], [t0].[CompanyName], [t0].[ContactName], [t0].[ContactTitle], [t0].[Address], [t0].[City], [t0].[Region], [t0].[PostalCode], [t0].[Country], [t0].[Phone], [t0].[Fax]
FROM [dbo].[Customers] AS [t0]
) AS [t1]
WHERE [t1].[ROW_NUMBER] BETWEEN 40 + 1 AND 40 + 10
ORDER BY [t1].[ROW_NUMBER]

Common here on SO
ISNULL (SQL Server)
NVL ('Orable)
IFNULL (MySQL, DB2?)
COALESCE (ANSI)
To answer exactly:
ISNULL can easily give different results as COALESCE on SQL Server because of data type precedence, as per my answer/comments here

Related

How different are SQL dialects for basic queries?

New to SQL, so please excuse imprecision in the question.
For "normal" queries, is SQL syntax mutually intelligible between dialects? To take a concrete example, would SELECT * FROM [Pending Scans] be valid in all common dialects?
Not looking for an exhaustive list!
No. This would be:
select *
from pending_scans;
Square braces are non-standard for escaping identifiers. The standard escape character for identifiers is the double quote, and most databases now support that.
I should note that for all but the simplest queries -- such as the one you have written -- slight differences between dialects make it a fool's errand to try to write complete portable code. For instance, function names are different, such as len() versus length(), and date/time operations are quite bespoke.
If you are writing an application that needs to support multiple different database types, a typical method would be to define the API as a set of views and use the views with simple SELECT queries (as in your example).

Adding conditions to database queries in an orthogonal way

The problem at hand is how to add conditions to SQL database queries issued by pre-existing applications in an "orthogonal way", meaning that this operation should be done in such a way that applications need not be concerned. In case you are curious, the actual problem involves adding multitenancy-related criteria that should keep tenants from getting to each other data. In practical terms, that means adding extra conditions to WHERE clauses (in possibly multiple locations within the query).
My team has been working on a solution which involves "intercepting" and parsing the queries, before they are sent to the DBMS. As many of you probably know, SQL parsing is not the simplest thing to implement, especially when you need to support most of the major DBMS' syntax extensions.
Another idea which has been thrown around is that one could let the DBMS itself handle the criteria injection. It is, after all, the one element which has full knowledge over the query syntax and (hopefully) should have no problem tinkering with it before carrying out its execution. The problem then would be figuring out how to pass the multitenancy metadata to the DBMS. Is this viable at all? Is this a bad idea?
Any other ideas on how this problem might be approached?
Thank you for your time.
Would it not be easier to work through views, each view restricted to the applicable user's data. The pre-written SQL could use a base name for the view which is then modified in code to add a prefix or suffix to the view name to give the user's view.
Example:
Table tennant_data has views named tennant_data_user1 and tennant_data_user2.
Your SQL is select col1, col2 from tennant_data_{view}
Your program code obtains the name of the current user (user1 or user2), and replaces {view} with their userid in the SQL, then executes the SQL.
Depending upon the DBMS (and language?) you are using, you could probably grant access so that user1 can only use the xxx_user1 views and so on, so there is no chance of them accessing the wrong data be mis-using a view or direct access to the underlying table.
Using server-side technology
The ideal place to implement this functionality is on the server side, of course. Some databases implement features like Oracle's Virtual Private Database doing exactly what you need.
You could also emulate the feature by replacing all your direct table access by (possibly updatable) views, which contain a filter on relevant columns using SYS_CONTEXT (in Oracle). Using this approach, database clients will never be able to bypass those additional predicates that you will be adding everywhere.
Using client-side technology
You didn't mention the technology stack you are using, but in Java, this could be done with jOOQ. jOOQ ships with a SQL parser that can parse the SQL statement into an expression tree. One of the main use-cases for this is to translate SQL from one dialect to another, as can be seen here:
https://www.jooq.org/translate
Top 5 Hidden jOOQ Features - Parsing Connection
E.g. this statement that runs on SQL Server or PostgreSQL:
try (DSLContext ctx = DSL.using("...");
Connection c = ctx.parsingConnection(); // Magic here
Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT * FROM (VALUES (1), (2), (3)) t(a)")) {
while (rs.next())
System.out.println(rs.getInt(1));
}
Could be translated to this equivalent statement in Oracle:
select t.a from (
(select null a from dual where 1 = 0) union all
(select * from (
(select 1 from dual) union all
(select 2 from dual) union all
(select 3 from dual)
) t)
) t
Once you have the jOOQ expression tree available, you could transform it using jOOQ's VisitListener as can be seen in these articles:
Row Level Security implementation in JOOQ
Implementing Client-Side Row-Level Security with jOOQ
Disclaimers:
jOOQ can only parse the SQL functionality that is also supported by the jOOQ API. The grammar can be seen here. This may not be enough for your existing applications, depending on how much vendor-specific functionality you're using.
I work for the company behind jOOQ, so obviously, this answer is biased.

Does select * impact stored procedure performance?

I know this could be a trivial question but I keep hearing one of my teachers voice saying
don't use SELECT * within a stored procedure, that affects performance and it's returning data that could be braking its clients if it's schema changes causing unknown ripple
I can't find any article confirming that concept, and I think that should be noticeable if true.
In most modern professional SQL implementations (Oracle, SQL Server, DB2, etc.), the use of SELECT * has a negative impact only in a top-level SELECT. In all other cases the SQL compiler should perform column-optimization anyway, eliminating any columns that are not used.
And the negative effect of * in a top-level SELECT is almost entirely related to returning all of the columns when you probably do not need all of them.
IMHO, in all other cases(**), including most ad-hoc cases, the use of * is perfectly fine and has no determimental effects (and obvious beneficial conveniences). The widespread universal pronouncements agaist using * are largely an archiac holdover from the time (10-15 years ago) when most SQL compilers did not have effective column-elimination optimization techniques.
(** - one exception is in VIEW definitions in SQL Server, because it doesn't automatically notice if the bound column list changes.)
The other reason that you sometimes see for not using SELECT * is not because of any performance issue, but just as a matter of coding practices. That is, that it's generally better to write your SQL code to be explicit about what columns you (or your client code) expects and thus are dependent on. If you use * then it's implicit and someone reading your SQL code cannot easily tell if your application is truly dependent on a certain column or not. (And IMHO, this is the more valid reason.)
I found this quote in a paper when we use SELECT * instruction:
“[…] real harm occurs when a sort is required. Every SELECTed column, with the sorting columns repeated, makes up the width of the sort work file wider. The wider and longer the file, the slower the sort is.” In http://www.quest.com/whitepapers/10_SQL_Tips.pdf
This paper is form DB2 engine but likely this is applied for other engines too.

Portable SQL : unique primary keys

Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?
As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.

LINQ syntax vs SQL syntax

Why did Andres Heilsberg designed LINQ syntax to be different than that of SQL (whereby made an overhead for the programmers to learn a whole new thing)?
Weren't it better if it used same syntax as of SQL?
LINQ isn't meant to be SQL. It's meant to be a query language which is as independent of the data source as reasonably possible. Now admittedly it has a strong SQL bias, but it's not meant to just be embedding SQL in source code (fortunately).
Personally, I vastly prefer LINQ's syntax to SQL's. In particular, the ordering is much more logical in LINQ. Just by looking at the order of the query clauses, you can see the logical order in which the query is processed. You start with a data source, possibly do some filtering, ordering etc, and usually end with a projection or grouping. Compare that with SQL, where you start off saying which columns you're interested in, not even knowing what table you're talking about yet.
Not only is LINQ more logical in that respect, but it allows tools to work with you better - if Visual Studio knows what data you're starting with, then when you start writing a select clause (for example) it can help you with IntelliSense. Additionally, it allows the translation from LINQ query expressions into "dot notation" to be relatively simple using extension methods, without the compiler having to be aware of any details of what the query will actually do.
So from my point of view: no, LINQ would be a lot worse if it had slavishly followed SQL's syntax.
First, choose your flavor of SQL - there are several! (T-, PL-, etc).
Ultimately, there are similarities and differences. A lot of the LINQ changes make more sense - i.e. choosing your source (FROM) before you try filtering (WHERE) / projection (SELECT), allowing better static analysis etc (including intellisense), and a more natural query comprehension syntax. This helps both the developer and the compiler, so I'm happy.
It is simpler to parse expression when initial data is provided in its beginning.
Because of this VS provides code completion even for partially written LINQ queries (great feature IMO).
The reason is the C# language designers use this approach because of when I first specify where the data is coming from, now Visual Studio and the C# compiler know what my data looks like. And I can have IntelliSense help in the rest of the query because Visual Studio will know that "city" (for example) is a string, and it has operations like startsWith and a property named length. And really, inside of a relational database like SQL Server, the select clause that you are writing in a SQL statement at the top is really one of the last pieces of the information the query engine has to figure out. Before that it has to figure out what table you are working against in the from clause even though the from clause comes later in SQL syntax