Besides a declarative language, is SQL a functional language? - sql

Why yes or why not?

SQL was designed as a declarative language, in sense that you tell what you want to get and the SQL engine decides how.
However, SQL operates on sets, and the results of the functions can be first class sets in Oracle, SQL Server and PostgreSQL.
One can say that SQL is functional language, as long as a function takes a set as its input and produces a set as its output.
That is, you can write something like this:
SELECT *
FROM mytable t
JOIN myfunction(x) f
ON f.col1 = t.col2
, or even this:
SELECT *
FROM mytable t
CROSS APPLY
myfunction(t.col2) f
(in SQL Server)
or this:
SELECT t.*, myfunction(t.col2)
FROM mytable t
(in PostgreSQL)
This is not a part of SQL standard, though.
Just like a C++ compiler tries to find an optimal way to multiply two floats (in terms of plain algebra), SQL optimizer tries to find an optimal ways to multiply two sets (in terms of relational algebra).
In C++, you just write a * b and rely on compiler to generate an optimal assembly for this.
In SQL, you write SELECT * FROM a NATURAL JOIN b and rely on optimizer.
However, with all SQL's declared declarativity (no pun intended), most real optimizers are able to do only very basic query rewrites.
Say, no optimizer I'm aware of is able to use same efficient plan for this query:
SELECT t1.id, t1.value, SUM(t2.value)
FROM mytable t1
JOIN mytable t2
ON t2.id <= t1.id
GROUP BY
t1.id, t1.value
and for this one:
SELECT id, value, SUM(t1.value) OVER (ORDER BY id)
FROM mytable
, to say nothing of more complex queries.
That's why you still need to formulate your queries so that they use an efficient plan (while still producing the same result), thus making SQL quite a bit less of a declarative language.
I recently made post in my blog on this:
Double-thinking in SQL

No, SQL is not a functional language. The paradigm is somewhat different. Note that there are other types of declarative programming languages other than functional - the canonical example being logic programming and PROLOG.
Technically, Relational Algebra (the theoretical basis of SQL) is not actually turing complete. Although modern SQL dialects add enough procedural features so that one can implement stored procedures and are turing complete at this level, a single SQL query is not a turing complete computation. Relational Algebra has the property of godel completeness. Godel completeness implies the ability to express any computation that can be defined in terms of first order predicate calculus - basically what you would know as ordinary logical expressions.

Are functions first-class objects in SQL? hardly. So I'd say no.

I think SQL and functional languages are very different from each other. In a functional language computation is done by evaluating functions. Functions do not mutate state. All they do is compute a value from their arguments. I other words, functions do not cause side-effects. Functional languages are general purpose.
SQL is a language designed for dealing with relational database management systems. It can be viewed as a Domain Specific Language. It is designed to work on "sets" of data. It can mutate global state (i.e, the database) by using commands like UPDATE. There is no concept of functions getting evaluated to a value. As far as I understand, SQL is not even Turing complete.

Declarative and functional? That would be a spreadsheet.

There's no single true definition of what a functional language is (or for that matter, what a procedural or object-oriented one is).
But I can't really think of much that points to SQL being functional.
It doesn't have functions, it doesn't have recursion, it doesn't have closures, it doesn't have nested functions, it doesn't have functions as first-class types.
A more commonly asked question is whether SQL is a programming language at all. It's not turing-complete.

Not generally speaking, as without extensions to the syntax (e.g. PL/SQL, T-SQL), you can't write functions.
But it is certainly very expression-oriented, which is a feature it has in common with functional languages.

In a sense: yes, SQL is functional.
Look at
SELECT as map(),
WHERE as filter(),
ORDER BY as sort(),
DISTINCT as set()
etc.

Since the point of a functional language is that you program with, well, functions, I would say no. SQL is programming with relations (if you can even call SQL a programming language - in it's basic form, SQL is not Turing complete).

I would say yes. A SQL Query maps one set of data (ie, data from tables) to another (the result set).
In mathematics, a function is basically a mapping from one set to another.
Thus, a query is a function, and since queries are first class citizens in SQL, therefore SQL is a functional language.

Related

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?
In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.
You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.
Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...
Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.
Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

Natural Join -- Relational theory and SQL

This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))
So here is my problem...
On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).
On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).
Is the reconciliation of these that:
Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.
and / or
Good/Better table design should remove/minimise the problems that natural join creates.
?
a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),
"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."
This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.
"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."
Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.
"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."
While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.
Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.
"Good/Better table design should remove/minimise the problems that natural join creates."
Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...
First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".
Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:
6.12. The Reliance on Attribute Names. Salient quote:
The operators of the relational algebra… all rely heavily on attribute
naming.
3.9. Column Naming in SQL. Salient quote:
Strong recommendation: …if two columns in SQL represent "the same kind
of information," give them the same name wherever possible. (That's
why, for example, the two supplier number columns in the
suppliers-and-parts database are both called SNO and not, say, SNO in
one table and SNUM in the other.) Conversely, if two columns represent
different kinds of information, it's usually a good idea to give them
different names.
I'd like to address #kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:
For every base table, define a view identical to that base table except possibly for some column renaming.
Make sure the set of views so defined abides by the column naming discipline described above.
Operate in terms of those views instead of the underlying base tables.
Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.
Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect
is currently closed as "won't fix" despite currently having a respectable +38 / -2 score
has been reopened and gained a respectable 46 / -2 score
(go vote for it now :)
The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.
In Tutorial D syntax I can very simply write a natural join as:
R{a,b,c} JOIN S{a,c,d};
But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.
E.g. the above query can be written in SQL as:
SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;
or:
SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;
People will likely prefer the latter version because it is shorter and "simpler".
Theory versus reality...
Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.
I think Oracle and a few others actually support support natural joins -- TSQL doesn't.
Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.
I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.
I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).

Is there any way I can compare two sql strings to check if they are semantically equivalent?

I am writing some Java unit tests and I need to compare two sql strings where the sql statements are semantically equivalent but can be syntactically different. I can't do string comparison since an order of the from clause and where clause can be different yet the two queries might be equivalent.
Is there anyway to do this in java without having to write my own Oracle SQL Parser ? :)
P.S. The query can be very complicated !
Thank you !
The general answer is NO, because you can always call some kind of stored procedure which hides a Turing machine.
The fact that you can do arithmetic in a SQL statement I think pushes you over the Turing cliff, too.
Of course, theorists always tell us everything is impossible, so we should all roll over and die.
Nah.
So what can you do? Well, a "simple" possibility is to normalize the SQL queries, much as you simplify algebraic equations. If you could somehow, for a SQL statement, "normalize" (convert) it into the absolutely shortest equivalent SQL that did the same thing, then you could normalize both SQL statements and compare the result statements; if the they are equal modulo identifier renaming, then they have the same "semantics". For every operator in SQL, there is some semantics behind it, and some set of equivalent operations, just as there is in algebra. So, if you can determine the set of algebraic equivalences for each SQL operator, you can replace each algebraic computation by the shortest algebriac equivalent which does the same thing.
To do this, you have to be able to parse SQL, and apply SQL rewrites to the parsed SQL, which means you need a program transformation engine. (You can see an analog of this at Parsing and Rewriting Algebra)
This doesn't work in all cases. First, there may be several same-length "shortest" SQL statements that are equivalent (2+X is the same as X+2 but it isn't obvious to a a tool). Now you've got a theorem proving problem (for our X+2 example, use the commutative law to prove they are equal), back to stuck in theory. Second, you may not know how to generate the shortest possible sequence using your rewrites; even math equations sometimes have to swell up before they can get small again. Technically you have to search all possible algebraic equivalences to find the shortest, and that's impossibly large.
So, hard to do in practice, too. So, NO.
Not a direct solution for your problem, but you may want to look into JSqlParser which may already cover part of what you need.

LINQ syntax vs SQL syntax

Why did Andres Heilsberg designed LINQ syntax to be different than that of SQL (whereby made an overhead for the programmers to learn a whole new thing)?
Weren't it better if it used same syntax as of SQL?
LINQ isn't meant to be SQL. It's meant to be a query language which is as independent of the data source as reasonably possible. Now admittedly it has a strong SQL bias, but it's not meant to just be embedding SQL in source code (fortunately).
Personally, I vastly prefer LINQ's syntax to SQL's. In particular, the ordering is much more logical in LINQ. Just by looking at the order of the query clauses, you can see the logical order in which the query is processed. You start with a data source, possibly do some filtering, ordering etc, and usually end with a projection or grouping. Compare that with SQL, where you start off saying which columns you're interested in, not even knowing what table you're talking about yet.
Not only is LINQ more logical in that respect, but it allows tools to work with you better - if Visual Studio knows what data you're starting with, then when you start writing a select clause (for example) it can help you with IntelliSense. Additionally, it allows the translation from LINQ query expressions into "dot notation" to be relatively simple using extension methods, without the compiler having to be aware of any details of what the query will actually do.
So from my point of view: no, LINQ would be a lot worse if it had slavishly followed SQL's syntax.
First, choose your flavor of SQL - there are several! (T-, PL-, etc).
Ultimately, there are similarities and differences. A lot of the LINQ changes make more sense - i.e. choosing your source (FROM) before you try filtering (WHERE) / projection (SELECT), allowing better static analysis etc (including intellisense), and a more natural query comprehension syntax. This helps both the developer and the compiler, so I'm happy.
It is simpler to parse expression when initial data is provided in its beginning.
Because of this VS provides code completion even for partially written LINQ queries (great feature IMO).
The reason is the C# language designers use this approach because of when I first specify where the data is coming from, now Visual Studio and the C# compiler know what my data looks like. And I can have IntelliSense help in the rest of the query because Visual Studio will know that "city" (for example) is a string, and it has operations like startsWith and a property named length. And really, inside of a relational database like SQL Server, the select clause that you are writing in a SQL statement at the top is really one of the last pieces of the information the query engine has to figure out. Before that it has to figure out what table you are working against in the from clause even though the from clause comes later in SQL syntax

Plain SQL vs Dialects

DBMS Vendors use SQL dialect features to differentiate their product, at the same time claiming to support SQL standards. 'Nuff said on this.
Is there any example of SQL you have coded that can't be translated to SQL:2008 standard SQL ?
To be specific, I'm talking about DML (a query statement), NOT DDL, stored procedure syntax or anything that is not a pure SQL statement.
I'm also talking about queries you would use in Production, not for ad-hoc stuff.
Edit Jan 13
Thanks for all of your answers : they have conveyed to me an impression that a lot of the DBMS-specific SQL is created to allow work-arounds for poor relational design. Which leads me to the conclusion you probably wouldn't want to port most existing applications.
Typical differences include subtly differnt semantics (for example Oracle handles NULLs differently from other SQL dialects in some cases), different exception handling mechanisms, different types and proprietary methods for doing things like string operations, date operations or hierarchical queries. Query hints also tend to have syntax that varies across platforms, and different optimisers may get confused on different types of constructs.
One can use ANSI SQL for the most part across database systems and expect to get reasonable results on a database with no significant tuning issues like missing indexes. However, on any non-trivial application there is likely to be some requirement for code that cannot easily be done portably.
Typically, this requirement will be fairly localised within an application code base - a handful of queries where this causes an issue. Reporting is much more likely to throw up this type of issue and doing generic reporting queries that will work across database managers is very unlikely to work well. Some applications are more likely to cause grief than others.
Therefore, it is unlikely that relying on 'portable' SQL constructs for an application will work in the general case. A better strategy is to use generic statements where they will work and break out to a database specific layer where this does not work.
A generic query mechanism could be to use ANSI SQL where possible; another possible approach would be to use an O/R mapper, which can take drivers for various database platforms. This type of mechanism should suffice for the majority of database operations but will require you to do some platform-specifc work where it runs out of steam.
You may be able to use stored procedures as an abstraction layer for more complex operations and code a set of platform specific sprocs for each target platform. The sprocs could be accessed through something like ADO.net.
In practice, subtle differences in paramter passing and exception handling may cause problems with this approach. A better approach is to produce a module that wraps the
platform-specific database operations with a common interface. Different 'driver' modules can be swapped in and out depending on what DBMS platform you are using.
Oracle has some additions, such as model or hierarchical queries that are very difficult, if not impossible, to translate into pure SQL
Even when SQL:2008 can do something sometimes the syntax is not the same. Take the REGEXP matching syntax for example, SQL:2008 uses LIKE_REGEX vs MySQL's REGEXP.
And yes, I agree, it's very annoying.
Part of the problem with Oracle is that it's still based on the SQL 1992 ANSI standard. SQL Server is on SQL 1999 standard, so some of the things that look like "extensions" are in fact newer standards. (I believe that the "OVER" clause is one of these.)
Oracle is also far more restrictive about placing subqueries in SQL. SQL Server is far more flexible and permissive about allowing subqueries almost anywhere.
SQL Server has a rational way to select the "top" row of a result: "SELECT TOP 1 FROM CUSTOMERS ORDER BY SALES_TOTAL". In Oracle, this becomes "SELECT * FROM (SELECT CUSTOMERS ORDER BY SALES_TOTAL) WHERE ROW_NUMBER <= 1".
And of course there's always Oracle's infamous SELECT (expression) FROM DUAL.
Edit to add:
Now that I'm at work and can access some of my examples, here's a good one. This is generated by LINQ-to-SQL, but it's a clean query to select rows 41 through 50 from a table, after sorting. It uses the "OVER" clause:
SELECT [t1].[CustomerID], [t1].[CompanyName], [t1].[ContactName], [t1].[ContactTitle], [t1].[Address], [t1].[City], [t1].[Region], [t1].[PostalCode], [t1].[Country], [t1].[Phone], [t1].[Fax]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[ContactName]) AS [ROW_NUMBER], [t0].[CustomerID], [t0].[CompanyName], [t0].[ContactName], [t0].[ContactTitle], [t0].[Address], [t0].[City], [t0].[Region], [t0].[PostalCode], [t0].[Country], [t0].[Phone], [t0].[Fax]
FROM [dbo].[Customers] AS [t0]
) AS [t1]
WHERE [t1].[ROW_NUMBER] BETWEEN 40 + 1 AND 40 + 10
ORDER BY [t1].[ROW_NUMBER]
Common here on SO
ISNULL (SQL Server)
NVL ('Orable)
IFNULL (MySQL, DB2?)
COALESCE (ANSI)
To answer exactly:
ISNULL can easily give different results as COALESCE on SQL Server because of data type precedence, as per my answer/comments here