How to create custom abbreviations in SQL? - sql

Is it possible to "reuse" sql parameter in an sql SELECT statement?
Similar to this pseudocode:
<DEFINE> city as c, destination as d
SELECT c, d FROM thetable t INNER JOIN(SELECT c FROM...) etc
(so that each parameter has to only be written explicit a single time)?

Most databases support common table expressions, but you express this as:
with t as (
select city as c, destination as d
from thetable
)
select c, d
from t;
This may not do what you are asking for. A common table expression is really a subquery, and within the subquery you can use shorter names. However, it will not affect column names across tables, unless you include more CTEs.
Also, CTEs can have performance implications in some databases. Some databases actually materialize them, creating intermediate tables, and that can affect performance. Other databases optimize them in place.

It is possible to use common table expressions in postgres. As far as I know it is not possible in mysql for instance. Different database management systems may differ in its compatibility with sql standard.

Related

When does the aliasing take effect? [duplicate]

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.

Any resources for this SQL filtering?

I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.

Clarification about Select from (select...) statement

I came across a SQL practice question. The revealed answer is
SELECT ROUND(ABS(a - c) + ABS(b - d), 4) FROM (
SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d
FROM station);
Normally, I would enocunter
select[] from[] where [] (select...)
which to imply that the selected variable from the inner loop at the where clause will determine what is to be queried in the outer loop. As mentioned at the beginning, this time the select is after
FROM to me I'm curious the functionality of this. Is it creating an imaginary table?
The piece in parentheses:
(SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d FROM station)
is a subquery.
What's important here is that the result of a subquery looks like a regular table to the outer query. In some SQL flavors, an alias is necessary immediately following the closing parenthesis (i.e. a name by which to refer to the table-like result).
Whether this is technically a "temporary table" is a bit of a detail as its result isn't stored outside the scope of the query; and there is an also a thing called a temporary table which is stored.
Additionally (and this might be the source of confusion), subqueries can also be used in the WHERE clause with an operator (e.g. IN) like this:
SELECT student_name
FROM students
WHERE student_school IN (SEELCT school_name FROM schools WHERE location='Springfield')
This is, as discussed in the comments and the other answer a subquery.
Logically, such a subquery (when it appears in the FROM clause) is executed "first", and then the results treated as a table1. Importantly though, that is not required by the SQL language2. The entire query (including any subqueries) is optimized as a whole.
This can include the optimizer doing things like pushing a predicate from the outer WHERE clause (which, admittedly, your query doesn't have one) down into the subquery, if it's better to evaluate that predicate earlier rather than later.
Similarly, if you had two subqueries in your query that both access the same base table, that does not necessarily mean that the database system will actually query that base table exactly twice.
In any case, whether the database system chooses to materialize the results (store them somewhere) is also decided during the optimization phase. So without knowing your exact RDBMS and the decisions that the optimizer takes to optimize this particular query, it's impossible to say whether it will result in something actually being stored.
1Note that there is no standard terminology for this "result set as a table" produced by a subquery. Some people have mentioned "temporary tables" but since that is a term with a specific meaning in SQL, I shall not be using it here. I generally use the term "result set" to describe any set of data consisting of both columns and rows. This can be used both as a description of the result of the overall query and to describe smaller sections within a query.
2Provided that the final results are the same "as if" the query had been executed in its logical processing order, implementations are free to perform processing in any ordering they choose to.
As there are so many terms involved, I just thought I'll throw in another answer ...
In a relational database we deal with tables. A query reads from tables and its result again is a table (albeit not a stored one).
So in the FROM clause we can access query results just like any stored table:
select * from (select * from t) x;
This makes the inner query a subquery to our main query. We could also call this an ad-hoc view, because view is the word we use for queries we access data from. We can move it to the begin of our main query in order to enhance readability and possibly use it multiple times in it:
with x as (select * from t) select * from x;
We can even store such queries for later access:
create view v as select * from t;
select * from v;
In the SQL standard these terms are used:
BASE TABLE is a stored table we create with CREATE TABLE .... t in above examples is supposed to be a base table.
VIEWED TABLE is a view we create with CREATE VIEW .... v above examples is a viewed table.
DERIVED TABLE is an ad-hoc view, such as x in the examples above.
When using subqueries in other clauses than FROM (e.g. in the SELECT clause or the WHERE clause), we don't use the term "derived table". This is because in these clauses we don't access tables (i.e. something like WHERE mytable = ... does not exist), but columns and expression results. So the term "subquery" is more general than the term "derived table". In those clauses we still use various terms for subqueries, though. There are correlated and non-correlated subqueries and scalar and non-scalar ones.
And to make things even more complicated we can use correlated subqueries in the FROM clause in modern DBMS that feature lateral joins (sometimes implemented as CROSS APPLY and OUTER APPLY). The standard calls these LATERAL DERIVED TABLES.

SQL - IN clause vs equals operator for small list

Which should be the preferred and efficient way?
where #TeamId in (Team1Id, Team2Id)
or
where #TeamId=Team1Id or #TeamId=Team2Id
I am using sql server 2008.
Edit
When I checked execution plans, both the queries showed that they are using indexes and same execution plan.
Both are same
SQL server converts this
where #TeamId in (Team1Id, Team2Id)
Into
where #TeamId=Team1Id or #TeamId=Team2Id
It's better to write IN compare to OR more readable and easy.
For the specific example yo provide, of testing a variable, IN is simply syntactic sugar for multiple OR's.
However in the related case of selecting rows of a relation the use of a join to another relation is superior, particulalry if the data field being compared is indexed or the list of comparison values grows. Such a comparison relation is easily created using a static sub-query like this:
select *
from data
join (
select Team1Id as TeamId union all
select Team2Id
) comparison on comparison.TeamId = data.TeamId
This technique of a static sub-query is widely applicable to many circumstances.

Using alias in query and using it

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.