SQL Distinct keyword - sql

I would be thankful if someone could help me categorize the word Distinct. I am learning sql and understand what it does but is it a function, an attribute or a keyword just like SELECT, FROM and WHERE etc. I guess it to be a keyword and in which case what does it mean to write two keywords together (i.e SELECT DISTINCT <tuple of attributes> FROM <relation>)?

It is a keyword and could be used in different contexts:
Select distinct field1, field2, field3
from myTable;
Within this context the returned data has only 1 row per each distinct values of field1, field2 and field3 column values. ie:
field1, field2, field3
1, 2, 3
1, 2, 3
1, 1, 1
1, 2, 1
with distinct would return:
1, 2, 3
1, 1, 1
1, 2, 1
IOW it is like group by on all fields included in select.
It is also used with aggregations like this:
Select count(distinct productId)
from OrderDetails;
Would count each productID only once within the group (here in example didn't add any special grouping). Above query for example would answer a question like how many of our products had any sale so far?

I have done a bit of searching and found it is called :
Conditional expressions : DISTINCT predicate.
Not sure if it is what you are looking for.

It is a keyword, used primarily in two contexts:
SELECT DISTINCT
COUNT(DISTINCT . . .)
In some databases, it is also allowed with set-functions, such as UNION DISTINCT. This highlights the default behavior of the set-function (which removes duplicates).
Conceptually, it modifies the action to work only on distinct values.
It can be used with other aggregation functions, but that usage is not useful. It usually implies a problem with the data, usually a data modeling problem.

In the language it is a keyword, but note that at the core, those language keywords serve to denote invocations of operators ("relational" operators, though pseudo-relational is really more accurate) and operators are functions ...
So there is a bit of a case to say that it serves both purposes, and that the distinction you are asking about is actually rather irrelevant.
examples
SELECT ... : invocation of "bag" projection / SELECT DISTINCT ... : invocation of "relational" projection.
WHERE a IS NOT DISTINCT FROM b : the "more relational" equality operator (that yields true if both a and b are null) / WHERE a = b : the "less relational" equality operator (that yields false if both a and b are null).

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Match count of a regular expression for every row

I use below query to get content rows which has my_regex_pattern. But I don't know how many times the pattern hit for every row. What is the best way to get match count for every row in Postgres?
For example if a row's content is 'abcdefabcgh' and my regular expression is 'abc', I want 2 since 'abcdefabcgh' has two 'abc'.
SELECT content
FROM table1
WHERE content ~ 'my_regex_pattern'
Or how can I get rows which has matches more than a specific number. For example just give me records which has abc more than 4 times.
Of course you can make it work with regexp_matches(). Or better yet, regexp_split_to_table(). To apply to a whole table, use a LATERAL join (requires Postgres 9.3+):
SELECT content, ct
FROM table1 t, LATERAL (
SELECT count(*) - 1 AS ct
FROM regexp_split_to_table(t.content, 'abc')
) c
WHERE t.content ~ 'abc'; -- eliminate rows without match
For simple patterns like in the example in your question, you could also:
SELECT content, (length(content) - length(replace(content, 'abc', ''))) / length('abc')
FROM table1
WHERE content LIKE '%abc%';
Typically faster, since regular expression functions are costly. Also works for older versions.

Using part of the select clause without rewriting it

I am using an Oracle SQL Db and I am trying to count the number of terms starting with X letter in a dictionnary.
Here is my query :
SELECT Substr(Lower(Dict.Term),0,1) AS Initialchar,
Count(Lower(Dict.Term))
FROM Dict
GROUP BY Substr(Lower(Dict.Term),0,1)
ORDER BY Substr(Lower(Dict.Term),0,1);
This query is working as expected, but the thing that I'm not really happy about is the fact that I have to rewrite the long "Substr(Lower(Dict.Term),0,1)" in the GROUP BY and ORDER BY clause. Is there any way to reuse the one I defined in the SELECT part ?
Thanks
You can use a subquery. Because Oracle follows the SQL standard, substr() starts counting at 1. Although Oracle does explicitly allow 0 ("If position is 0, then it is treated as 1"), I find it misleading because "0" and "1" refer to the same position.
So:
select first_letter, count(*)
from (select d.*, substr(lower(d.term), 1, 1) as first_letter
from dict d
) d
group by first_letter
order by first_letter;
Not directly. The output columns can only be referred to in the ORDER BY clause, but not used in any other way. The only way would be to make it into a subselect, but it wouldn't be any clearer and might cause issues with performance.
I prefer subquery factoring for this purpose.
with init as (
select substr(lower(d.term), 1, 1) as Initialchar
from dict d)
select Initialchar, count(*)
from init
group by Initialchar
order by Initialchar;
Contrary to opposite meaning, IMO this makes the query much clearer and defines natural order; especially while using more subqueries.
I'm not aware about performance caveats, but there are some limitation, such as it not possible to use with clause within another with clause: ORA-32034: unsupported use of WITH clause.

Aggregate multiple columns without groupBy in Slick 2.0

I would like to perform an aggregation with Slick that executes SQL like the following:
SELECT MIN(a), MAX(a) FROM table_a;
where table_a has an INT column a
In Slick given the table definition:
class A(tag: Tag) extends Table[Int](tag, "table_a") {
def a = column[Int]("a")
def * = a
}
val A = TableQuery[A]
val as = A.map(_.a)
It seems like I have 2 options:
Write something like: Query(as.min, as.max)
Write something like:
as
.groupBy(_ => 1)
.map { case (_, as) => (as.map(identity).min, as.map(identity).max) }
However, the generated sql is not good in either case. In 1, there are two separate sub-selects generated, which is like writing two separate queries. In 2, the following is generated:
select min(x2."a"), max(x2."a") from "table_a" x2 group by 1
However, this syntax is not correct for Postgres (it groups by the first column value, which is invalid in this case). Indeed AFAIK it is not possible to group by a constant value in Postgres, except by omitting the group by clause.
Is there a way to cause Slick to emit a single query with both aggregates without the GROUP BY?
The syntax error is a bug. I created a ticket: https://github.com/slick/slick/issues/630
The subqueries are a limitation of Slick's SQL compiler currently producing non-optimal code in this case. We are working on improving the situation.
As a workaround, here is a pattern to swap out the generated SQL under the hood and leave everything else intact: https://gist.github.com/cvogt/8054159
I use the following trick in SQL Server, and it seems to work in Postgres:
select min(x2."a"), max(x2."a")
from "table_a" x2
group by (case when x2.a = x2.a then 1 else 1 end);
The use of the variable in the group by expression tricks the compiler into thinking that there could be more than one group.

SQL - IN vs. NOT IN

Suppose I have a table with column which takes values from 1 to 10. I need to select columns with all values except for 9 and 10. Will there be a difference (performance-wise) when I use this query:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
and this one?
SELECT * FROM tbl WHERE col IN (1, 2, 3, 4, 5, 6, 7, 8)
Use "IN" as it will most likely make the DBMS use an index on the corresponding column.
"NOT IN" could in theory also be translated into an index usage, but in a more complicated way which DBMS might not "spend overhead time" using.
When it comes to performance you should always profile your code (i.e. run your queries few thousand times and measure each loops performance using some kind of stopwatch. Sample).
But here I highly recommend using the first query for better future maintaining. The logic is that you need all records but 9 and 10. If you add value 11 to your table and use second query, logic of your application will be broken that will lead to bug, of course.
Edit: I remember this was tagged as php that's why I provided sample in php, but I might be mistaken. I guess it won't be hard to rewrite that sample in the language you're using.
I have seen Oracle have trouble optimizing some queries with NOT IN if columns are nullable. If you can write your query either way, IN is preferred as far as I'm concerned.
For a list of constants, MySQL will internally expand your code to:
SELECT * FROM tbl WHERE ((col <> 9 and col <> 10))
Same for the other one, with 8 times = instead.
So yes, the first one will be faster, less comparisons to be done. Chances that it is measurable are negligible though, the overhead of a handful of constant comparisons is nothing compared to the general overhead of parsing SQL and retrieving data.
"IN" statement works internally like a serie of "OR" statements.
For example:
SELECT * FROM tbl WHERE col IN (1, 2, 3)
Its equals to
SELECT * FROM tbl WHERE col = 1 OR col = 2 OR col = 3
"OR" statements could cause some performance issues as explained in this article:
https://bertwagner.com/2018/02/20/or-vs-union-all-is-one-better-for-performance/
When you do a NOT IN statement, its all the same, but the result has a logical denial. BUT, you could write and equivalent query much better in performance. In your example:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
Its equals to
SELECT * FROM tbl WHERE col <> 9 AND col <> 10
With an "AND" statement, the database stop analizing when one of all conditionals its false, so, its much better in performance than "OR" used in "IN" statement.