Is it possible to have Postgres reject queries which use its proprietary extensions to the SQL language?
e.g. select a::int from b; should throw an error, forcing the use of proper casts as in select cast(a as int) from b;
Perhaps more to the point is the question of whether it is possible to write SQL that is supported by all RDBMS with the same resulting behaviour?
PostgreSQL has no such feature. Even if it did, it wouldn't help you tons because interpretations of the SQL standard vary, support for standard syntax and features vary, and some DBs are relaxed about restrictions that others enforce or have limitations others don't. Syntax is the least of your problems.
The only reliable way to write cross-DB portable SQL is to test that SQL on every target database as part of an automated test suite. And to swear a lot.
In many places the query parser/rewriter transforms the standard "spelling" of a query into the PostgreSQL internal form, which will be emitted on dump/reload. In particular, PostgreSQL doesn't store the raw source code for things like views, check constraint expressions, index expressions, etc. It stores the internal parse tree, and reconstructs the source from that when it's asked to dump or display the object.
For example:
regress=> CREATE TABLE sometable ( x varchar(100) );
CREATE TABLE
regress=> CREATE VIEW someview AS SELECT CAST (x AS integer) FROM sometable;
CREATE VIEW
regress=> SELECT pg_get_viewdef('someview');
pg_get_viewdef
-------------------------------------
SELECT (sometable.x)::integer AS x
FROM sometable;
(1 row)
It'd be pretty useless anyway, since the standard fails to specify some pretty common and important pieces of functionality and often has rather ambiguous specifications of things it does define. Until recently it didn't define a way to limit the number of rows returned by a query, for example, so every database had its own different syntax (TOP, LIMIT / OFFSET, etc).
Other things the standard specifies are not implemented by most vendors, so using them is pretty pointless. Good luck using the SQL-standard generated and identity columns across all DB vendors.
It'd be quite nice to have a "prefer standard spelling" dump mode, that used CAST instead of ::, etc, but it's really not simple to do because some transformations aren't 1:1 reversible, e.g.:
regress=> CREATE VIEW v AS SELECT '1234' SIMILAR TO '%23%';
CREATE VIEW
regress=> SELECT pg_get_viewdef('v');
SELECT ('1234'::text ~ similar_escape('%23%'::text, NULL::text));
or:
regress=> CREATE VIEW v2 AS SELECT extract(dow FROM current_date);
CREATE VIEW
regress=> SELECT pg_get_viewdef('v2');
SELECT date_part('dow'::text, ('now'::text)::date) AS date_part;
so you see that significant changes would need to be made to how PostgreSQL internally represents and works with functions and expressions before what you want would be possible.
Lots of the SQL standard stuff uses funky one-off syntax that PostgreSQL converts into function calls and casts during parsing, so it doesn't have to add special case features every time the SQL committe have another brain-fart and pull some new creative bit of syntax out of ... somewhere. Changing that would require adding tons of new expression node types and general mess, all for no real gain.
Perhaps more to the point is the question of whether it is possible to
write SQL that is supported by all RDBMS with the same resulting
behaviour?
No, not even for many simple statments..
select top 10 ... -- tsql
select ... limit 10 -- everyone else
many more examples exist. Use an orm or something similar if you want to insulate yourself from database choice.
If you do write sql by hand, then trying to follow the SQL standard is always a good choice :-)
You could use a tool like Mimer's SQL Validator to validate that queries follow the SQL spec before running them:
http://developer.mimer.com/validator/parser92/index.tml
You could force users to write queries in HQL or JPQL, which would then get translated in to the correct SQL dialect for your database.
Related
The problem at hand is how to add conditions to SQL database queries issued by pre-existing applications in an "orthogonal way", meaning that this operation should be done in such a way that applications need not be concerned. In case you are curious, the actual problem involves adding multitenancy-related criteria that should keep tenants from getting to each other data. In practical terms, that means adding extra conditions to WHERE clauses (in possibly multiple locations within the query).
My team has been working on a solution which involves "intercepting" and parsing the queries, before they are sent to the DBMS. As many of you probably know, SQL parsing is not the simplest thing to implement, especially when you need to support most of the major DBMS' syntax extensions.
Another idea which has been thrown around is that one could let the DBMS itself handle the criteria injection. It is, after all, the one element which has full knowledge over the query syntax and (hopefully) should have no problem tinkering with it before carrying out its execution. The problem then would be figuring out how to pass the multitenancy metadata to the DBMS. Is this viable at all? Is this a bad idea?
Any other ideas on how this problem might be approached?
Thank you for your time.
Would it not be easier to work through views, each view restricted to the applicable user's data. The pre-written SQL could use a base name for the view which is then modified in code to add a prefix or suffix to the view name to give the user's view.
Example:
Table tennant_data has views named tennant_data_user1 and tennant_data_user2.
Your SQL is select col1, col2 from tennant_data_{view}
Your program code obtains the name of the current user (user1 or user2), and replaces {view} with their userid in the SQL, then executes the SQL.
Depending upon the DBMS (and language?) you are using, you could probably grant access so that user1 can only use the xxx_user1 views and so on, so there is no chance of them accessing the wrong data be mis-using a view or direct access to the underlying table.
Using server-side technology
The ideal place to implement this functionality is on the server side, of course. Some databases implement features like Oracle's Virtual Private Database doing exactly what you need.
You could also emulate the feature by replacing all your direct table access by (possibly updatable) views, which contain a filter on relevant columns using SYS_CONTEXT (in Oracle). Using this approach, database clients will never be able to bypass those additional predicates that you will be adding everywhere.
Using client-side technology
You didn't mention the technology stack you are using, but in Java, this could be done with jOOQ. jOOQ ships with a SQL parser that can parse the SQL statement into an expression tree. One of the main use-cases for this is to translate SQL from one dialect to another, as can be seen here:
https://www.jooq.org/translate
Top 5 Hidden jOOQ Features - Parsing Connection
E.g. this statement that runs on SQL Server or PostgreSQL:
try (DSLContext ctx = DSL.using("...");
Connection c = ctx.parsingConnection(); // Magic here
Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT * FROM (VALUES (1), (2), (3)) t(a)")) {
while (rs.next())
System.out.println(rs.getInt(1));
}
Could be translated to this equivalent statement in Oracle:
select t.a from (
(select null a from dual where 1 = 0) union all
(select * from (
(select 1 from dual) union all
(select 2 from dual) union all
(select 3 from dual)
) t)
) t
Once you have the jOOQ expression tree available, you could transform it using jOOQ's VisitListener as can be seen in these articles:
Row Level Security implementation in JOOQ
Implementing Client-Side Row-Level Security with jOOQ
Disclaimers:
jOOQ can only parse the SQL functionality that is also supported by the jOOQ API. The grammar can be seen here. This may not be enough for your existing applications, depending on how much vendor-specific functionality you're using.
I work for the company behind jOOQ, so obviously, this answer is biased.
I know this could be a trivial question but I keep hearing one of my teachers voice saying
don't use SELECT * within a stored procedure, that affects performance and it's returning data that could be braking its clients if it's schema changes causing unknown ripple
I can't find any article confirming that concept, and I think that should be noticeable if true.
In most modern professional SQL implementations (Oracle, SQL Server, DB2, etc.), the use of SELECT * has a negative impact only in a top-level SELECT. In all other cases the SQL compiler should perform column-optimization anyway, eliminating any columns that are not used.
And the negative effect of * in a top-level SELECT is almost entirely related to returning all of the columns when you probably do not need all of them.
IMHO, in all other cases(**), including most ad-hoc cases, the use of * is perfectly fine and has no determimental effects (and obvious beneficial conveniences). The widespread universal pronouncements agaist using * are largely an archiac holdover from the time (10-15 years ago) when most SQL compilers did not have effective column-elimination optimization techniques.
(** - one exception is in VIEW definitions in SQL Server, because it doesn't automatically notice if the bound column list changes.)
The other reason that you sometimes see for not using SELECT * is not because of any performance issue, but just as a matter of coding practices. That is, that it's generally better to write your SQL code to be explicit about what columns you (or your client code) expects and thus are dependent on. If you use * then it's implicit and someone reading your SQL code cannot easily tell if your application is truly dependent on a certain column or not. (And IMHO, this is the more valid reason.)
I found this quote in a paper when we use SELECT * instruction:
“[…] real harm occurs when a sort is required. Every SELECTed column, with the sorting columns repeated, makes up the width of the sort work file wider. The wider and longer the file, the slower the sort is.” In http://www.quest.com/whitepapers/10_SQL_Tips.pdf
This paper is form DB2 engine but likely this is applied for other engines too.
Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?
As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.
In an imperative programming language you are allowed to place expression everywhere where an integer is expected. Why in SQL this isn't so? Why doesn't SQL accept something like this:
Create Table Entries (Title Varchar((Select Count(ID) From SomeOtherTable)))
When, for example, this works:
Select * From Entries Where id = (Select Count(id) From SomeOtherTable)
Now, of course, these examples are silly. I don't really care if I can create tables based on a dynamic information. I want to understand why SQL is different in this regard from, say, C or C# or Delphi, etc. In these languages I know what will work based on intuitive rules. When I am writing a query, I am kind of unsure whether some construction will be accepted: sometimes it will be and sometimes it won't (I am still new to SQL, so maybe this is the main reason).
I am using Interbase 7.5 if this has some relevance.
CREATE TABLE is DDL (Data Definition Language), that is defining schema structure of the database while SELECT is DML (Data Manipulation Language), that is working with the data stored in the schema.
You can not mix and match DDL and DML in one query.
DBMS Vendors use SQL dialect features to differentiate their product, at the same time claiming to support SQL standards. 'Nuff said on this.
Is there any example of SQL you have coded that can't be translated to SQL:2008 standard SQL ?
To be specific, I'm talking about DML (a query statement), NOT DDL, stored procedure syntax or anything that is not a pure SQL statement.
I'm also talking about queries you would use in Production, not for ad-hoc stuff.
Edit Jan 13
Thanks for all of your answers : they have conveyed to me an impression that a lot of the DBMS-specific SQL is created to allow work-arounds for poor relational design. Which leads me to the conclusion you probably wouldn't want to port most existing applications.
Typical differences include subtly differnt semantics (for example Oracle handles NULLs differently from other SQL dialects in some cases), different exception handling mechanisms, different types and proprietary methods for doing things like string operations, date operations or hierarchical queries. Query hints also tend to have syntax that varies across platforms, and different optimisers may get confused on different types of constructs.
One can use ANSI SQL for the most part across database systems and expect to get reasonable results on a database with no significant tuning issues like missing indexes. However, on any non-trivial application there is likely to be some requirement for code that cannot easily be done portably.
Typically, this requirement will be fairly localised within an application code base - a handful of queries where this causes an issue. Reporting is much more likely to throw up this type of issue and doing generic reporting queries that will work across database managers is very unlikely to work well. Some applications are more likely to cause grief than others.
Therefore, it is unlikely that relying on 'portable' SQL constructs for an application will work in the general case. A better strategy is to use generic statements where they will work and break out to a database specific layer where this does not work.
A generic query mechanism could be to use ANSI SQL where possible; another possible approach would be to use an O/R mapper, which can take drivers for various database platforms. This type of mechanism should suffice for the majority of database operations but will require you to do some platform-specifc work where it runs out of steam.
You may be able to use stored procedures as an abstraction layer for more complex operations and code a set of platform specific sprocs for each target platform. The sprocs could be accessed through something like ADO.net.
In practice, subtle differences in paramter passing and exception handling may cause problems with this approach. A better approach is to produce a module that wraps the
platform-specific database operations with a common interface. Different 'driver' modules can be swapped in and out depending on what DBMS platform you are using.
Oracle has some additions, such as model or hierarchical queries that are very difficult, if not impossible, to translate into pure SQL
Even when SQL:2008 can do something sometimes the syntax is not the same. Take the REGEXP matching syntax for example, SQL:2008 uses LIKE_REGEX vs MySQL's REGEXP.
And yes, I agree, it's very annoying.
Part of the problem with Oracle is that it's still based on the SQL 1992 ANSI standard. SQL Server is on SQL 1999 standard, so some of the things that look like "extensions" are in fact newer standards. (I believe that the "OVER" clause is one of these.)
Oracle is also far more restrictive about placing subqueries in SQL. SQL Server is far more flexible and permissive about allowing subqueries almost anywhere.
SQL Server has a rational way to select the "top" row of a result: "SELECT TOP 1 FROM CUSTOMERS ORDER BY SALES_TOTAL". In Oracle, this becomes "SELECT * FROM (SELECT CUSTOMERS ORDER BY SALES_TOTAL) WHERE ROW_NUMBER <= 1".
And of course there's always Oracle's infamous SELECT (expression) FROM DUAL.
Edit to add:
Now that I'm at work and can access some of my examples, here's a good one. This is generated by LINQ-to-SQL, but it's a clean query to select rows 41 through 50 from a table, after sorting. It uses the "OVER" clause:
SELECT [t1].[CustomerID], [t1].[CompanyName], [t1].[ContactName], [t1].[ContactTitle], [t1].[Address], [t1].[City], [t1].[Region], [t1].[PostalCode], [t1].[Country], [t1].[Phone], [t1].[Fax]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[ContactName]) AS [ROW_NUMBER], [t0].[CustomerID], [t0].[CompanyName], [t0].[ContactName], [t0].[ContactTitle], [t0].[Address], [t0].[City], [t0].[Region], [t0].[PostalCode], [t0].[Country], [t0].[Phone], [t0].[Fax]
FROM [dbo].[Customers] AS [t0]
) AS [t1]
WHERE [t1].[ROW_NUMBER] BETWEEN 40 + 1 AND 40 + 10
ORDER BY [t1].[ROW_NUMBER]
Common here on SO
ISNULL (SQL Server)
NVL ('Orable)
IFNULL (MySQL, DB2?)
COALESCE (ANSI)
To answer exactly:
ISNULL can easily give different results as COALESCE on SQL Server because of data type precedence, as per my answer/comments here