In what order SQL statement execute if Select statement has CONCAT function on some columns? - sql

I have been trying to understand the code that is the part of the BigQuery course on Coursera. The query looks like this
SELECT
CONCAT(fullVisitorId, CAST(visitID AS STRING)) AS unique_session_id,
sessionQualityDim,
SUM(productRevenue) AS transactions_revenue
FROM
transaction_table
WHERE sessionQualityDim > 60
GROUP BY unique_session_id, sessionQualityDim
My question is how would the order of the SQL statement would execute. Mainly when the GROUP BY is done on unique_session_id (which is the CONCAT between two columns), how would 'GROUP BY' knows about the calculated CONCAT results(unique_session_id). From my knowledge the SELECT statement would run in the last. But in this case it seems like first field is calculated using CONCAT and then group by is using that for grouping. Can someone give more insights on this ?

According to the SQL standard, the GROUP BY is parsed before the SELECT.
However, this is not a hard-and-fast rule among databases. What BigQuery is doing is determining the column aliases from the SELECT. It then allows these aliases in the GROUP BY. Other databases do this as well -- Postgres and all its derived databases for example.
Do not confuse the parsing of the query with the execution, though. The execution is through some very complicated parallel directed acyclic graph. What is happening here is simply that the BigQuery parser is (conveniently) allowing users to use table aliases in GROUP BY.

Related

why distinct and order by doesn't work together in sql query?

enter image description here I am learning how to order by is used in SQL query, then I learned that order by and distinctly don't work together but, when I try to do it practically it worked. I am so confused even after asking chatgpt what the relationship is between order by and distinct.
I learned that when executing SQL queries, the ORDER BY clause comes after the SELECT clause. This means that the database will first retrieve the data specified in the SELECT clause, and then sort it based on the criteria specified in the ORDER BY clause. If the column used in the ORDER BY clause is not present in the SELECT clause, the database will automatically include that column in the select and do order by on both the columns and give result column only column given in SELECT.
However, when using both DISTINCT and ORDER BY together, the outcome may not be what is expected. This is because DISTINCT acts on both the columns in the SELECT clause and the column in the ORDER BY clause. This may cause unexpected results, especially in MySQL.
I found that when I tried this in practice, it still produced the desired results, which makes me question if I learned something incorrectly or if there is missing information that I am unaware of.
I am using the MYSQL database.
It seems there are some spaces before your unique value. It will be appropriate to perform TRIM/RTRIM and remove these spaces in the DISTINC clause itself.
It should be something like this:
DISTINCT TRIM(value) AS trim_value
...
ORDER BY trim_value
Also, it is possible that these are not spaces but some other characters which need to be replace, too.

When does the aliasing take effect? [duplicate]

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.

Is it possible to have an SQL query that uses AGG functions in this way?

Assuming I have the following aggregate functions:
AGG1
AGG2
AGG3
AGG4
Is it possible to write valid SQL (in a db agnostic way) like this:
SELECT [COL1, COL2 ....], AGG1(param1), AGG2(param2) FROM [SOME TABLES]
WHERE [SOME CRITERIA]
HAVING AGG3(param2) >-1 and AGG4(param4) < 123
GROUP BY COL1, COL2, ... COLN
ORDER BY COL1, COLN ASC
LIMIT 10
Where COL1 ... COLN are columns in the tables being queried, and param1 ... paramX are parameters passed to the AGG funcs.
Note: AGG1 and AGG2 are returned in the results as columns (but do not appear in the HAVING CLAUSE, and AGG3 and AGG4 appear in the HAVING CLAUSE but are not returned in the result set.
Ideally, I want a DB agnostic answer to the solution, but if I have to be tied to a db, I am using PostgreSQL (v9.x).
Edit
Just a matter of clarification: I am not opposed to using GROUP BY in the query. My SQL is not very good, so the example SQL above may have been slightly misleading. I have edited the pseudo sql statement above to hopefully make my intent more clear.
The main thing I wanted to find out was whether a select query that used AGG functions could:
Have agg functions values in the returned column without them being specified in a HAVING clause.
Have agg functions specified in a HAVING clause, but are not returned in the result set.
From the answers I have received so far, it would seem the answer to both questions is YES. The only think I have to do to correct my SQL is to add a GROUP BY clause to make sure that the returned rows are unique.
PostgreSQL major version include the first digit after the dot, thus "PostgreSQL (v9.x)" is not specific enough. As #kekekela said, there is no (cheap) completely db agnostic way. Even between PostgreSQL 9.0 and 9.1 there is an important syntactical difference.
If you had only the grouped values AGG1(param1), AGG2(param2) you would get away without providing an explicit GROUP BY clause. Since you mix grouped and non-grouped columns you have to provide a GROUP BY clause with all non-grouped columns that appear in the SELECT. That's true for any version of PostgreSQL. Read about GROUP BY and HAVING it in the manual.
Starting with version 9.1, however, once you list a primary key in the GROUP BY you can skip additional columns for this table and still use them in the SELECT list. The release notes for version 9.1 tell us:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
Concerning parameters
Do you intend to feed a constant value to an aggregate function? What's the point? The docs tell us
An aggregate function computes a single result from multiple input rows.
Or do you want those parameters to be column names? That kind of dynamic SQL works as long as the statement is generated before committing to the database. Does not work for prepared statements or simple sql or plpgsql functions. You have to use EXECUTE in a plpgsql function for that purpose.
As safeguard against SQLi use the USING $1, $2 syntax for values and quote_ident() for your column or table names.
The only way to aggregate over columns without using GROUP BY is to use windowing functions. You left out details of your problem, but the following might be what you are looking for:
SELECT *
FROM (
SELECT [COL1, COL2 ....],
AGG1(param1) over (partition by some_grouping_column) as agg1,
AGG2(param2) over (partition by some_grouping_column) as agg2,
row_number() over () as rn
FROM [SOME TABLES]
WHERE [SOME CRITERIA]
ORDER BY COL1
) t
WHERE AGG3 >-1
AND AGG4 < 123
AND rn <= 10
ORDER BY col1
This is standard ANSI SQL and works on most database including PostgreSQL (since 8.4).
Note that you do not need to use the same grouping column for both aggregates in the partition by clause.
If you want to stick with ANSI SQL then you should use the row_number() function to limit the result. If you run this only on PostgreSQL (or other DBMS that support LIMIT in some way) move the LIMIT cause into the derived table (the inner query)
That should work from a high level perspective, except you'd need COL1, COL2 etc in a GROUP BY statement or else they won't be valid in the SELECT list. Having AGG1, etc in the SELECT list and not in the HAVING is not a problem.
As far as db agnostic, you're going to have to tweak syntax no matter what you do (the LIMIT for example is going to be different in PostgreSQL, SQL SERVER and Oracle that I know off the top of my head), but you could build logic to construct the statements properly for each provided your high-level representation is solid.

SQL "WITH" Clause/Statement

Before I begin by putting a lot SQL statements to help solve my issue I might be able to get the answer by asking a simple question. I use SQL Server 2005 on a daily basis and use the "WITH" clause to perform sub-queries. I am unfortunately in a situation now where I have to use SQL Compact which does not allow the use of the "WITH" clause to perform sub queries. What is the substitute of the "WITH" clause in SQL Compact. On average I am using 10 sub queries at a time.
As long as none of your CTE's (Common Table Expression - the formal name for the feature you are using) are recursive, remember that in the simplest form,
;WITH Q1 As
(
SELECT columns FROM Table1
)
SELECT columns FROM Q1
Can be roughly translated to:
SELECT columns FROM (SELECT columns FROM Table1) Q1
Note the 'Q1' on the end there. You have to give the subquery a name. The name you choose often doesn't matter, and so simple names are common here -- even just single letters. With 10 subqueries to string together, you might need to choose something more meaningful.
Create a temp table with the result of each with clause; use the temp tables instead of the with clause.

Using alias in query and using it

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.