Grouping sets columns in aggregate arguments and NULL replacement - sql

There are many grouping sets examples on the internet like query Q1 in the example below. But query Q2 is different because A2 is a grouping column and it is used as the argument to SUM().
Which one of the following is correct for Q2 according to the SQL Standard (any version since 2003 that supports grouping sets)? If (1) is correct, please explain why with reference to the Standard.
A2 is replaced by NULL unless it is in an argument to an aggregate. This interpretation would give results R1 below. This is Oracle's behaviour (which seems more useful).
A2 is replaced by NULL including where it is used in an aggregate: this means that the aggregate will return NULL.
This interpretation would give results R2 below. This is how I have understood the SQL Standard (possibly incorrectly).
Example code:
-- Setup
create table A (A1 int, A2 int, A3 int);
insert into A values (1, 1, 100);
insert into A values (1, 2, 40);
insert into A values (2, 1, 70);
insert into A values (5, 1, 90);
-- Query Q1
-- Expected/Observed results:
--
-- A1 A2 SUM(A3)
-- ---------- ---------- ----------
-- 1 - 140
-- 2 - 70
-- 5 - 90
-- - 1 260
-- - 2 40
-- - - 300
select A1, A2, sum (A3)
from A
group by grouping sets ((A1), (A2), ())
order by 1, 2;
-- Query Q2
-- Results R1 (Oracle):
-- A1 A2 SUM(A2)
-- ---------- ---------- ----------
-- 1 - 3
-- 2 - 1
-- 5 - 1
-- - 1 3
-- - 2 2
-- - - 5
--
-- Results R2 (SQL Standard?):
-- A1 A2 SUM(A2)
-- ---------- ---------- ----------
-- 1 - -
-- 2 - -
-- 5 - -
-- - 1 3
-- - 2 2
-- - - - -- NULL row
select A1, A2, sum (A2)
from A
group by grouping sets ((A1), (A2), ())
order by 1, 2;
I am aware of this from SQL 2003 7.9 Syntax 17, which describes how columns are replaced with NULLs. However, I might have missed or misunderstood a rule elsewhere that excludes arguments to aggregates.
m) For each GS_i:
iii) Case:
1) If GS_i is an <ordinary grouping set>, then
A) Transform SL2 to obtain SL3, and transform HC to obtain
HC3, as follows:
II) Replace each <column reference> in SL2 and HC that
references PC_k by "CAST(NULL AS DTPCk)"

As with many difficult SQL features, it can help to look at earlier versions of the standard where the phrasing might be simpler. And it turns out that grouping sets were introduced in SQL 1999 and were then revised in SQL 2003.
SQL 1999
Syntax Rule 4 states:
Let SING be the <select list> constructed by removing from SL every <select
sublist> that is not a <derived column> that contains at least one <set
function specification>.
Then Syntax Rule 11 defines PC_k as the column references contained in the group by list. It constructs a derived table projecting the union of GSQQL_i, which are query specifications projecting the PC_k or NULL as appropriate, the PCBIT_i grouping function indicators and SING.
Thus any that contains a set function will not have its argument replaced, and its columns won't be replaced either. So answer (1) is correct.
However, in the following query the GSQQL_i corresponding to the <grand total> doesn't group by C1 so I think it will give an error rather than replacing C1 with NULL for that grouping set.
select C1 + MAX(C2) from T group by grouping sets ((C1), ());
SQL 2003 - 2011
I still don't have a definitive answer for this. It hinges on what they meant (or forgot to specify?) by "references" in the replacement rule. It would be clearer if it said one of "immediately contained", "simply contained" or "directly contained", as defined in ISO 9075-1 (SQL Part 1: Framework).
The note (number 134 in SQL 2003) at the start of General Rules says "As a
result of the syntactic transformations specified in the Syntax Rules of this
Sub-clause, only primitive <group by clause>s are left to consider." So the
aggregate argument either has or has not actually been replaced: we aren't
expected to evaluate aggregates in a special way (whereas if General Rule 3 were in effect applied before the NULL substitution of Syntax Rule 17 then answer (1) would be correct).
I found a draft of Technical Corrigendum 5 [pdf], which is a "diff" towards SQL 2003. This includes the relevant changes to on pages 80-87. Unfortunately the bulk of the change has only the brief rationale "Provide a correct, unified treatment of CUBE and ROLLUP". General Rule 3, quoted above, has the rationale "clarify the semantics of column references".

Related

WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate]

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?
By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result
No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable
I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.
With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.
You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this
Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte
Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

SQL pivot with text based fields

Forgive me, but I can't get this working.
I can find lots of complex pivots using numeric values, but nothing basic based on strings to build upon.
Lets suppose this is my source query from a temp table. I can't change this:
select * from #tmpTable
This provides 12 rows:
Row | Name | Code
---------------------------------
1 | July 2019 | 19/20-01
2 | August 2019 | 19/20-02
3 | September 2019 | 19/20-03
.. .. ..
12 | June 2020 | 19/20-12
I want to pivot this and return the data like this:
Data Type | [0] | [1] | [3] | [12]
---------------------------------------------------------------------------
Name | July 2019 | August 2019 | September 2019 | June 2020
Code | 19/20-01 | 19/20-02 | 19/20-03 | 19/20-12
Thanks in advance..
Strings and numbers aren't much different in pivot terms, it's just that you can't use numeric aggregators like SUM or AVG on them. MAX will be fine and in this case you'll only have one Value so nothing will be lost
You need to pull your data out to a taller key/value representation before pivoting it back to look the other way round as it does now
unpivot the data:
WITH upiv AS(
SELECT 'Name' as t, row as r, name as v FROM #tempTable
UNION ALL
SELECT 'Code' as t, row, code FROM #tempTable
)
Now the data can be re grouped and conditionally aggregated on the r columns:
SELECT
t,
MAX(CASE WHEN r = 1 THEN v END) as r1,
MAX(CASE WHEN r = 2 THEN v END) as r2,
...
MAX(CASE WHEN r = 12 THEN v END) as r12
FROM
upiv
GROUP BY
t
You'll need to put the two sql blocks I present here together so they form a single sql statement. If you want to know more about how this works, I suggest you run the sql statement inside the with block, take a look at it, and also remove the group by/max words from the full statement and look at the result. You'll see the WITH block query makes the data taller, essentially a key/value pair that is tracking what type the data is (name or code). When you run the full sql without the group by/max you'll see the tall data spreads out sideways to give a lot of nulls and a diagonal set of cell data (if ordered by r). The group by collapses all these nulls because a MAX will pick any value over null (of which there is only one)
You could also do this as an UNPIVOT followed by a PIVOT. I've always preferred to use this form because not every database supports the UN/PIVOT keywords. Arguably, UNPIVOT/PIVOT could perform better because there may be specific optimizations the developers can make (eg UNPIVOT could single scan a table; this multiple Union approach may require multiple scans and ways round it could be more memory intensive) but in this case it's only 12 rows. I suspect you're using SQLServer but if you're using a database that doesn't understand WITH you can place the bracketed statement of the WITH (including the brackets) between the FROM and the upiv to make it a subquery if the pattern SELECT ... FROM (SELECT ... UNION ALL SELECT ...) upiv GROUP BY ...; there is no difference
I'll leave renaming the output columns as an exercise for you but I would urge you to consider not putting spaces or square brackets in the column names as you show in your question

substring and trim in Teradata

I am working in Teradata with some descriptive data that needs to be transformed from a gerneric varchar(60) into the different field lengths based on the type of data element and the attribute value. So I need to take whatever is in the Varchar(60) and based on field 'ABCD' act on field 'XYZ'. In this case XYZ is a varchar(3). To do this I am using CASE logic within my select. What I want to do is
eliminate all occurances of non alphabet/numeric data. All I want left are upper case Alpha chars and numbers.
In this case "Where abcd = 'GROUP' then xyz should come out as a '000', '002', 'A', 'C'
eliminate extra padding
Shift everything Right
abcd xyz
1 GROUP NULL
2 GROUP $
3 GROUP 000000000000000000000000000000000000000000000000000000000000
4 GROUP 000000000000000000000000000000000000000000000000000000000002
5 GROUP A
6 GROUP C
7 GROUP r
To do this I have tried TRIM and SUBSTR amongst several other things that did not work. I have pasted what I have working now, but I am not reliably working through the data within the select. I am really looking for some options on how to better work with strings in Teradata. I have been working out of the "SQL Functions, Operators, Expressions and Predicates" online PDF. Is there a better reference. We are on TD 13
SELECT abcd
, CASE
-- xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WHEN abcd= 'GROUP'
THEN(
CASE
WHEN SUBSTR(tx.abcd,60, 4) = 0
THEN (
SUBSTR(tx.abcd,60, 3)
)
ELSE
TRIM (TRAILING FROM tx.abcd)
END
)
END AS abcd
FROM db.descr tx
WHERE tx.abcd IS IN ( 'GROUP')
The end result should look like this
abcd xyz
1 GROUP 000
2 GROUP 002
3 GROUP A
4 GROUP C
I will have to deal with approx 60 different "abcd" types, but they should all conform to the type of data I am currently seeing.. ie.. mixed case, non numeric, non alphabet, padded, etc..
I know there is a better way, but I have come in several circles trying to figure this out over the weekend and need a little push in the right direction.
Thanks in advance,
Pat
The SQL below uses the CHARACTER_LENGTH function to first determine if there is a need to perform what amounts to a RIGHT(tx.xyz, 3) using the native functions in Teradata 13.x. I think this may accomplish what you are looking to do. I hope I have not misinterpreted your explanation:
SELECT CASE WHEN tx.abcd = 'GROUP'
AND CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz) > 3
THEN SUBSTRING(tx.xyz FROM (CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz)) - 3))
ELSE tx.abcd
END
FROM db.descr tx;
EDIT: Fixed parenthesis in SUBSTRING

EXCEPT query in SQL

I am doing an except query in SQL like this
Q1 EXCEPT Q2 EXCEPT Q3
Where Q1, Q2 and Q3 are sub-query.
I just want to know what will be its output, (Q1-Q2)-Q3 or Q1-(Q2-Q3)? and if 2nd, how to get 1st one as output?
I don't have an installation of DB2, but in PostgreSQL, Q1 except Q2 except Q3 appears to be interpolated as (Q1 except Q2) except Q3 (note: generate_series(m,n) is a function found in PostgreSQL that generates a single column of integer values from m to n (where m<n, of course)):
select generate_series(1,10) except select generate_series(5,15) except select generate_series(10,20);
generate_series
-----------------
1
2
3
4
(4 rows)
select * from (select generate_series(1,10) except select generate_series(5,15))a except select generate_series(10,20);
generate_series
-----------------
1
2
3
4
(4 rows)
select generate_series(1,10) except select * from (select generate_series(5,15) except select generate_series(10,20))a;
generate_series
-----------------
1
2
3
4
10
(5 rows)
However, it's best to use parentheses to make certain that the order of evaluation is how you want it.
The SQL-92 Standard confirms (Annex C) what should be fairly intuitive: in the absence of parentheses, both occurrences of the EXCEPT in the sample code will (obviously) be at the same level of precedence, in which case they shall be evaluated in left-to-right order.
Out of interest, the precedence of EXCEPT, INTERSECT and UNION (parentheses aside) is implementation dependent, so consult the SQL product's documentation. However, a quick google seems to suggest that most SQL products (including DB2) have EXCEPT and UNION at the same level of precedence and INTERSECT at a higher level of precedence.

SQL - suppressing duplicate *adjacent* records

I need to run a Select statement (DB2 SQL) that does not pull adjacent row duplicates based on a certain field. In specific, I am trying to find out when data changes, which is made difficult because it might change back to its original value.
That is to say, I have a table that vaguely resembles the below, sorted by Letter and then by Date:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 12, 2009-03-01
A, 12, 2009-04-01
A, 9, 2009-05-01
A, 9, 2009-06-01
A, 5, 2009-07-01
And I want to get the results:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 9, 2009-05-01
A, 5, 2009-07-01
discarding adjacent duplicates but keeping the last row (despite it having the same number as the first row). The obvious:
Select Letter, Number, Min(Update_Date) from Table group by Letter, Number
does not work -- it doesn't include the last row.
Edit: As there seems to have been some confusion, I have clarified the month column into a date column. It was meant as a human-parseable short form, not as actual valid data.
Edit: The last row is not important BECAUSE it is the last row, but because it has a "new value" that is also an "old value". Grouping by NUMBER would wrap it in with the first row; it needs to remain a separate entity.
Depending on which DB2 you're on, there are analytic functions which can make this problem easy to solve. An example in Oracle is below, but the select syntax appears to be pretty similar.
create table t1 (c1 char, c2 number, c3 date);
insert into t1 VALUES ('A', 5, DATE '2009-01-01');
insert into t1 VALUES ('A', 12, DATE '2009-02-01');
insert into t1 VALUES ('A', 12, DATE '2009-03-01');
insert into t1 VALUES ('A', 12, DATE '2009-04-01');
insert into t1 VALUES ('A', 9, DATE '2009-05-01');
insert into t1 VALUES ('A', 9, DATE '2009-06-01');
insert into t1 VALUES ('A', 5, DATE '2009-07-01');
SQL> l
1 SELECT C1, C2, C3
2 FROM (SELECT C1, C2, C3,
3 LAG(C2) OVER (PARTITION BY C1 ORDER BY C3) AS PRIOR_C2,
4 LEAD(C2) OVER (PARTITION BY C1 ORDER BY C3) AS NEXT_C2
5 FROM T1
6 )
7 WHERE C2 <> PRIOR_C2
8 OR PRIOR_C2 IS NULL -- to pick up the first value
9 ORDER BY C1, C3
SQL> /
C C2 C3
- ---------- -------------------
A 5 2009-01-01 00:00:00
A 12 2009-02-01 00:00:00
A 9 2009-05-01 00:00:00
A 5 2009-07-01 00:00:00
This is not possible with set based commands (i.e. group by and such).
You may be able to do this by using cursors.
Personally, I would get the data into my client application and do the filtering there.
The first thing you'd have to do is identify the sequence within which you wish to view/consider the the data. Values of "Jan, Feb, Mar" don't help, because the data's not in alphabetical order. And what happens when you flip from Dec to Jan? Step 1: identify a sequence that uniquely defines each row with regards to your problem.
Next, you have to be able to compare item #x with item #x-1, to see if it has changed. If changed, include; if not changed, exclude. Trivial when using procedural code loops (cursors in SQL), but would you want to use those? They tend not to perform too well.
One SQL-based way to do this is to join the table on itself, with the join clause being "MyTable.SequenceVal = MyTable.SequenceVal - 1". Throw in a comparison, make sure you don't toss the very first row of the set (where there is no x-1), and you're done. Note that performance may suck if the "SequenceVal" is not indexed.
Using an "EXCEPT" clause is one way to do it. See below for the solution. I've included all of my test steps here. First, I created a session table (this will go away after I disconnect from my database).
CREATE TABLE session.sample (
letter CHAR(1),
number INT,
update_date DATE
);
Then I imported your sample data:
IMPORT FROM sample.csv OF DEL INSERT INTO session.sample;
Verified that your sample information is in the database:
SELECT * FROM session.sample;
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 02/01/2009
A 12 03/01/2009
A 12 04/01/2009
A 9 05/01/2009
A 9 06/01/2009
A 5 07/01/2009
7 record(s) selected.
I wrote this with an EXCEPT clause, and used the "WITH" to try to make it clearer. Basically, I'm trying to select all rows that have a previous date entry. Then, I exclude all of those rows from a select on the whole table.
WITH rows_with_previous AS (
SELECT s.*
FROM session.sample s
JOIN session.sample s2
ON s.letter = s2.letter
AND s.number = s2.number
AND s.update_date = s2.update_date - 1 MONTH
)
SELECT *
FROM session.sample
EXCEPT ALL
SELECT *
FROM rows_with_previous;
Here is the result:
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 04/01/2009
A 9 06/01/2009
A 5 07/01/2009
4 record(s) selected.