Why Microsoft invented KQL rather than using SQL for Azure Data Explorer - kql

Azure Data Explorer uses Kusto Query Language (KQL). Why Microsoft didn't use SQL ?
LogEvents
| where StartTime > datetime(2021-12-31)
| where EventType == 'Error'
| project StartTime, EventType , Message
The same could be written in sql saving developers effort to learn a new language
select StartTime, EventType, Message from LogEvents where startime > ....

And from another perspective -
Why KQL (and not just SQL)?
The SQL query structure holds in some bad design decisions.
Starting a query with a SELECT clause (and not FROM clause) harms the natural flow of query writing and the chances for a decent IntelliSense.
The relation between the GROUP BY clause and the SELECT clause (which contains the aggregate functions) has a similar effect.
SQL result dataset schema is always known in advance.
While useful from multiple perspective, it is also very limiting. Consider plugins like pivot or bag_unpack (without an output schema).
SQL has a rigid structure that requires a cumbersome nesting.
Consider aggregation over aggregation or filtering on a windows function (Teradata has the Qualify operator for many years and it was recently also adopted by Spark, however it is not part of the SQL standard).
SQL expressions aliasing is very limited (with Teradata as an exception), leading to cumbersome queries nesting / code duplications which harms readability and prone to errors.
SQL has various filtering clause (WHERE / HAVING / QUALIFY) for different scenarios, which is quite confusing for users.
SQL is standardized only on its basics.
Different RDBMS / SQL engines use different data types, functions, hints, non-standard clauses etc. Sometimes the exact same syntax, lead to different results (consider a windows function with order by clause or even something allegedly trivial as a substring function that gets a negative value as a start position).
If different SQL engines are so different from each other, why being bound by SQL at all?
SQL development (as a standard) seems to be stuck.
It is somewhere between painfully slow and non-existent.
There is a massive progress done by Spark SQL, but this is an initiative around a single SQL engine, detached from the SQL standard.

Actually, while KQL is the preferred language by Azure Data Explorer, it does support basic SQL.
See T-SQL support.
Please note the intention here is for SQL in the T-SQL dialect and not for the T-SQL procedural language.
--
with LogEvents as (select now() as StartTime, 'Error' as EventType, 'Hello World' as Message)
select StartTime, EventType , Message
from LogEvents
where StartTime > cast('2021-12-31' as date) and EventType = 'Error'
StartTime
EventType
Message
2022-11-29T17:16:43.414266Z
Error
Hello World
Fiddle

Related

Adding conditions to database queries in an orthogonal way

The problem at hand is how to add conditions to SQL database queries issued by pre-existing applications in an "orthogonal way", meaning that this operation should be done in such a way that applications need not be concerned. In case you are curious, the actual problem involves adding multitenancy-related criteria that should keep tenants from getting to each other data. In practical terms, that means adding extra conditions to WHERE clauses (in possibly multiple locations within the query).
My team has been working on a solution which involves "intercepting" and parsing the queries, before they are sent to the DBMS. As many of you probably know, SQL parsing is not the simplest thing to implement, especially when you need to support most of the major DBMS' syntax extensions.
Another idea which has been thrown around is that one could let the DBMS itself handle the criteria injection. It is, after all, the one element which has full knowledge over the query syntax and (hopefully) should have no problem tinkering with it before carrying out its execution. The problem then would be figuring out how to pass the multitenancy metadata to the DBMS. Is this viable at all? Is this a bad idea?
Any other ideas on how this problem might be approached?
Thank you for your time.
Would it not be easier to work through views, each view restricted to the applicable user's data. The pre-written SQL could use a base name for the view which is then modified in code to add a prefix or suffix to the view name to give the user's view.
Example:
Table tennant_data has views named tennant_data_user1 and tennant_data_user2.
Your SQL is select col1, col2 from tennant_data_{view}
Your program code obtains the name of the current user (user1 or user2), and replaces {view} with their userid in the SQL, then executes the SQL.
Depending upon the DBMS (and language?) you are using, you could probably grant access so that user1 can only use the xxx_user1 views and so on, so there is no chance of them accessing the wrong data be mis-using a view or direct access to the underlying table.
Using server-side technology
The ideal place to implement this functionality is on the server side, of course. Some databases implement features like Oracle's Virtual Private Database doing exactly what you need.
You could also emulate the feature by replacing all your direct table access by (possibly updatable) views, which contain a filter on relevant columns using SYS_CONTEXT (in Oracle). Using this approach, database clients will never be able to bypass those additional predicates that you will be adding everywhere.
Using client-side technology
You didn't mention the technology stack you are using, but in Java, this could be done with jOOQ. jOOQ ships with a SQL parser that can parse the SQL statement into an expression tree. One of the main use-cases for this is to translate SQL from one dialect to another, as can be seen here:
https://www.jooq.org/translate
Top 5 Hidden jOOQ Features - Parsing Connection
E.g. this statement that runs on SQL Server or PostgreSQL:
try (DSLContext ctx = DSL.using("...");
Connection c = ctx.parsingConnection(); // Magic here
Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT * FROM (VALUES (1), (2), (3)) t(a)")) {
while (rs.next())
System.out.println(rs.getInt(1));
}
Could be translated to this equivalent statement in Oracle:
select t.a from (
(select null a from dual where 1 = 0) union all
(select * from (
(select 1 from dual) union all
(select 2 from dual) union all
(select 3 from dual)
) t)
) t
Once you have the jOOQ expression tree available, you could transform it using jOOQ's VisitListener as can be seen in these articles:
Row Level Security implementation in JOOQ
Implementing Client-Side Row-Level Security with jOOQ
Disclaimers:
jOOQ can only parse the SQL functionality that is also supported by the jOOQ API. The grammar can be seen here. This may not be enough for your existing applications, depending on how much vendor-specific functionality you're using.
I work for the company behind jOOQ, so obviously, this answer is biased.

PostgreSQL force standard SQL syntax

Is it possible to have Postgres reject queries which use its proprietary extensions to the SQL language?
e.g. select a::int from b; should throw an error, forcing the use of proper casts as in select cast(a as int) from b;
Perhaps more to the point is the question of whether it is possible to write SQL that is supported by all RDBMS with the same resulting behaviour?
PostgreSQL has no such feature. Even if it did, it wouldn't help you tons because interpretations of the SQL standard vary, support for standard syntax and features vary, and some DBs are relaxed about restrictions that others enforce or have limitations others don't. Syntax is the least of your problems.
The only reliable way to write cross-DB portable SQL is to test that SQL on every target database as part of an automated test suite. And to swear a lot.
In many places the query parser/rewriter transforms the standard "spelling" of a query into the PostgreSQL internal form, which will be emitted on dump/reload. In particular, PostgreSQL doesn't store the raw source code for things like views, check constraint expressions, index expressions, etc. It stores the internal parse tree, and reconstructs the source from that when it's asked to dump or display the object.
For example:
regress=> CREATE TABLE sometable ( x varchar(100) );
CREATE TABLE
regress=> CREATE VIEW someview AS SELECT CAST (x AS integer) FROM sometable;
CREATE VIEW
regress=> SELECT pg_get_viewdef('someview');
pg_get_viewdef
-------------------------------------
SELECT (sometable.x)::integer AS x
FROM sometable;
(1 row)
It'd be pretty useless anyway, since the standard fails to specify some pretty common and important pieces of functionality and often has rather ambiguous specifications of things it does define. Until recently it didn't define a way to limit the number of rows returned by a query, for example, so every database had its own different syntax (TOP, LIMIT / OFFSET, etc).
Other things the standard specifies are not implemented by most vendors, so using them is pretty pointless. Good luck using the SQL-standard generated and identity columns across all DB vendors.
It'd be quite nice to have a "prefer standard spelling" dump mode, that used CAST instead of ::, etc, but it's really not simple to do because some transformations aren't 1:1 reversible, e.g.:
regress=> CREATE VIEW v AS SELECT '1234' SIMILAR TO '%23%';
CREATE VIEW
regress=> SELECT pg_get_viewdef('v');
SELECT ('1234'::text ~ similar_escape('%23%'::text, NULL::text));
or:
regress=> CREATE VIEW v2 AS SELECT extract(dow FROM current_date);
CREATE VIEW
regress=> SELECT pg_get_viewdef('v2');
SELECT date_part('dow'::text, ('now'::text)::date) AS date_part;
so you see that significant changes would need to be made to how PostgreSQL internally represents and works with functions and expressions before what you want would be possible.
Lots of the SQL standard stuff uses funky one-off syntax that PostgreSQL converts into function calls and casts during parsing, so it doesn't have to add special case features every time the SQL committe have another brain-fart and pull some new creative bit of syntax out of ... somewhere. Changing that would require adding tons of new expression node types and general mess, all for no real gain.
Perhaps more to the point is the question of whether it is possible to
write SQL that is supported by all RDBMS with the same resulting
behaviour?
No, not even for many simple statments..
select top 10 ... -- tsql
select ... limit 10 -- everyone else
many more examples exist. Use an orm or something similar if you want to insulate yourself from database choice.
If you do write sql by hand, then trying to follow the SQL standard is always a good choice :-)
You could use a tool like Mimer's SQL Validator to validate that queries follow the SQL spec before running them:
http://developer.mimer.com/validator/parser92/index.tml
You could force users to write queries in HQL or JPQL, which would then get translated in to the correct SQL dialect for your database.

Cassandra CQL - NoSQL or SQL

I am pretty new to Cassandra, just started learning Cassandra a week ago.
I first read, that it was a NoSQL, but when i started using CQL,
I started to wonder whether Cassandra is a NoSQL or SQL DB?
Can someone explain why CQL is more or less like SQL?
CQL is declarative like SQL and the very basic structure of the query component of the language (select things where condition) is the same. But there are enough differences that one should not approach using it in the same way as conventional SQL.
The obvious items: 1. There are no joins or subqueries. 2. No transactions
Less obvious but equally important to note:
Except for the primary key, you can only apply a WHERE condition on a column if you have created an index on that column. In SQL, you don't have to index a column to filter on it but in CQL the select statement will fail outright.
There are no OR or NOT logical operators, only AND. It is very important to model your data so you won't need these two; it is very easy to accidentally forget.
Date handling is profoundly different. CQL permits ONLY the equal operator for timestamps so extremely common and useful expressions like this do not work: where dateField > TO_TIMESTAMP('2013-01-01','YYYY-MM-DD') Also, CQL does not permit string insert of dates accurate to millis (seconds only) -- but it does permit entry of millis since epoch as a long int -- which most other DB engines do NOT permit. Lastly, timezone (as GMT offset) is invisibly captured for both long millis and string formats without a timezone. This can lead to confusion for those systems that deliberately do not conflate local time + GMT offset.
You can ONLY update a table based on primary key (or an IN list of primary keys). You cannot update based on other column data, nor can you do a mass update like this: update table set field = value; CQL demands a where clause with the primary key.
Grammar for AND does not permit parens. TO be fair, it's not necessary because of the lack of the OR operator but this means traditional SQL rewriters that add "protective" parens around expressions will not work with CQL, e.g.: select * from www where (str1 = 'foo2') and (dat1 = 12312442);
In general, it is best to use Cassandra as a big, resilient permastore of data for which a small number of very high level, very high performance queries can be applied to drag out a subset of data to work with at the application layer. That subset might be 1 million rows, yes. CQL and the Cassandra model is not designed for 2 page long SELECT statements with embedded cases, aggregations, etc. etc.
For all intents and purposes, CQL is SQL, so in the strictest sense Cassandra is an SQL database. However, most people closely associate SQL with the relational databases it is usually applied to. Under this (mis)interpretation, Cassandra should not be considered an "SQL database" since it is not relational, and does not support ACID properties.
Docs for CQLV3.0
CQL DESCRIBE to get schema of keyspace, column family, cluster
CQL Doesn't support some stuffs I had known in SQL like joins group by triggers cursors procedure transactions stored procedures
CQL3.0 Supports ORDER BY
CQL Supports all DML and DDL functionalities
CQL Supports BATCH
BATCH is not an analogue for SQL ACID transactions.
Just the DOC mentioned above is a best reference :)

Are sub-SELECT's in SELECT and FROM clauses standard-compliant?

The title pretty much says it all. Under "standard-compliant" SQL I mean SQL constructs allowed in any of SQL standards.
I looked through the "Understanding SQL" book, but it mentions subqueries only inside WHERE, GROUP BY, HAVING etc. clauses, not SELECT and FROM (or maybe I'm missing something).
I know MS SQL allows sub-SELECT's in SELECT and FROM. I would like to know if it is a standard behavior. Or maybe it isn't standard, but is now implemented in major SQL databases (I have very little experience with DB's other than MS SQL)?
Yes. You can use a subquery as a derived table wherever you can use a table in a select statement.
SQL ANSI 92

Plain SQL vs Dialects

DBMS Vendors use SQL dialect features to differentiate their product, at the same time claiming to support SQL standards. 'Nuff said on this.
Is there any example of SQL you have coded that can't be translated to SQL:2008 standard SQL ?
To be specific, I'm talking about DML (a query statement), NOT DDL, stored procedure syntax or anything that is not a pure SQL statement.
I'm also talking about queries you would use in Production, not for ad-hoc stuff.
Edit Jan 13
Thanks for all of your answers : they have conveyed to me an impression that a lot of the DBMS-specific SQL is created to allow work-arounds for poor relational design. Which leads me to the conclusion you probably wouldn't want to port most existing applications.
Typical differences include subtly differnt semantics (for example Oracle handles NULLs differently from other SQL dialects in some cases), different exception handling mechanisms, different types and proprietary methods for doing things like string operations, date operations or hierarchical queries. Query hints also tend to have syntax that varies across platforms, and different optimisers may get confused on different types of constructs.
One can use ANSI SQL for the most part across database systems and expect to get reasonable results on a database with no significant tuning issues like missing indexes. However, on any non-trivial application there is likely to be some requirement for code that cannot easily be done portably.
Typically, this requirement will be fairly localised within an application code base - a handful of queries where this causes an issue. Reporting is much more likely to throw up this type of issue and doing generic reporting queries that will work across database managers is very unlikely to work well. Some applications are more likely to cause grief than others.
Therefore, it is unlikely that relying on 'portable' SQL constructs for an application will work in the general case. A better strategy is to use generic statements where they will work and break out to a database specific layer where this does not work.
A generic query mechanism could be to use ANSI SQL where possible; another possible approach would be to use an O/R mapper, which can take drivers for various database platforms. This type of mechanism should suffice for the majority of database operations but will require you to do some platform-specifc work where it runs out of steam.
You may be able to use stored procedures as an abstraction layer for more complex operations and code a set of platform specific sprocs for each target platform. The sprocs could be accessed through something like ADO.net.
In practice, subtle differences in paramter passing and exception handling may cause problems with this approach. A better approach is to produce a module that wraps the
platform-specific database operations with a common interface. Different 'driver' modules can be swapped in and out depending on what DBMS platform you are using.
Oracle has some additions, such as model or hierarchical queries that are very difficult, if not impossible, to translate into pure SQL
Even when SQL:2008 can do something sometimes the syntax is not the same. Take the REGEXP matching syntax for example, SQL:2008 uses LIKE_REGEX vs MySQL's REGEXP.
And yes, I agree, it's very annoying.
Part of the problem with Oracle is that it's still based on the SQL 1992 ANSI standard. SQL Server is on SQL 1999 standard, so some of the things that look like "extensions" are in fact newer standards. (I believe that the "OVER" clause is one of these.)
Oracle is also far more restrictive about placing subqueries in SQL. SQL Server is far more flexible and permissive about allowing subqueries almost anywhere.
SQL Server has a rational way to select the "top" row of a result: "SELECT TOP 1 FROM CUSTOMERS ORDER BY SALES_TOTAL". In Oracle, this becomes "SELECT * FROM (SELECT CUSTOMERS ORDER BY SALES_TOTAL) WHERE ROW_NUMBER <= 1".
And of course there's always Oracle's infamous SELECT (expression) FROM DUAL.
Edit to add:
Now that I'm at work and can access some of my examples, here's a good one. This is generated by LINQ-to-SQL, but it's a clean query to select rows 41 through 50 from a table, after sorting. It uses the "OVER" clause:
SELECT [t1].[CustomerID], [t1].[CompanyName], [t1].[ContactName], [t1].[ContactTitle], [t1].[Address], [t1].[City], [t1].[Region], [t1].[PostalCode], [t1].[Country], [t1].[Phone], [t1].[Fax]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[ContactName]) AS [ROW_NUMBER], [t0].[CustomerID], [t0].[CompanyName], [t0].[ContactName], [t0].[ContactTitle], [t0].[Address], [t0].[City], [t0].[Region], [t0].[PostalCode], [t0].[Country], [t0].[Phone], [t0].[Fax]
FROM [dbo].[Customers] AS [t0]
) AS [t1]
WHERE [t1].[ROW_NUMBER] BETWEEN 40 + 1 AND 40 + 10
ORDER BY [t1].[ROW_NUMBER]
Common here on SO
ISNULL (SQL Server)
NVL ('Orable)
IFNULL (MySQL, DB2?)
COALESCE (ANSI)
To answer exactly:
ISNULL can easily give different results as COALESCE on SQL Server because of data type precedence, as per my answer/comments here