Condition while aggregation in Spark

Condition while aggregation in Spark - sql

This question is related to conditional aggregation on SQLs. Normally we put conditions using 'case' statement in select clause but that case condition checks only the row under consideration. Consider the below data:
BEGIN TRANSACTION;
/* Create a table called NAMES */
CREATE TABLE NAMES(M CHAR, D CHAR, A INTEGER);
/* Create few records in this table */
INSERT INTO NAMES VALUES('M1','Y',2);
INSERT INTO NAMES VALUES('M1','Y',3);
INSERT INTO NAMES VALUES('M2','Y',2);
INSERT INTO NAMES VALUES('M2',null,3);
INSERT INTO NAMES VALUES('M3',null,2);
INSERT INTO NAMES VALUES('M3',null,3);
COMMIT;
This query groups using column 'M' and checks if column 'D' is null or not (separately for each record) and put a sum aggregation on column 'A'.
select sum(case when D = 'Y' then 0 else A end) from NAMES group by M;
Output for this query is:
M1|0
M2|3
M3|5
But if we want to check column 'D' for each record in the group if it is null. If any of the records is 'Y' in the group, do not perform 'sum' aggregation at all.
In brief, the expected output for the above scenario is:
M1|0
M2|0
M3|5
Answers in Spark SQL are highly appreciated.

You can use another case expression:
select (case when max(D) = min(D) and max(D) = 'Y' -- all the same
then sum(case when D = 'Y' then 0 else A end)
else 0
end)
from NAMES
group by M;

Related

How to create an alias that counts Ws, Ls, and Ds to create a record

I have a table with sports results with a column labeled 'Result' where the values in that column are either W, L, or D. I would like to create an alias column that will quickly count the Ws, Ls, and Ds from the whole table in that columns and display it as 'Count W-Count L-Count D'.
I'm very new to SQL and I haven't figured this specific of a request out, nor can I find the correct search terms in Google to discover a video or forum result for the situation I am looking for.

If you want the values in separate columns, use conditional aggregation:
select sum(case when result = 'W' then 1 else 0 end) as w_cnt,
sum(case when result = 'L' then 1 else 0 end) as l_cnt,
sum(case when result = 'T' then 1 else 0 end) as t_cnt
from t;

Best option go for group by
Select result, count(*) from table
where column IN ('W' , 'L' , 'D' )
group by result

BigQuery(standard SQL) grouping values based on first CASE WHEN statement

Here is my query with the output below the syntax.
SELECT DISTINCT CASE WHEN id = 'RUS0261431' THEN value END AS sr_type,
COUNT(CASE WHEN id in ('RUS0290788') AND value in ('1','2','3','4') THEN respondentid END) AS sub_ces,
COUNT(CASE WHEN id IN ('RUS0290788') AND value in ('5','6','7') THEN respondentid END) AS pos_ces,
COUNT(*) as total_ces
FROM `some_table`
WHERE id in ( 'RUS0261431') AND id <> '' AND value IS NOT NULL
GROUP BY 1
As you can see with the attached table I'm unable to group the values based on Id RUS0290788 with the distinct values that map to RUS0261431. Is there anyway to pivot with altering my case when statements so I can group sub_ces and pos_ces by sr_type. Thanks in advanceenter image description here

You can simplify your WHERE condition to WHERE id = ('RUS0261431'). Only records with this value will be selected so you do not have to repeat this in the CASE statements.

Not a group expression: troubleshooting an Oracle query

Oracle query
I have a column value with hardcoded value 'N/A' and other char values as well. I need to write a select query to get the min of this column grouping the other set of columns.. but the challenge is i need to replace the hard coded value of 'N/A' with another character 'Abc' along with min function
Option 1: nvl won't work as the value is hardcoded
Option 2: decode in the select statement along with min clause in the decode list, and group by clause with the other columns used in the select list
However, getting an error
ORA-00979 : not a group expression.
Example :
Select a, b, decode(z,'N/A','abc',min(z))
From table 1, table 2
Where table 1.p=table2.q
Group by a,b
Having c.table1 >= table2.d

You should be using DECODE inside the MIN function, not the other way around. But, I would probably just use a single CASE expression here:
SELECT
a,
b,
MIN(CASE WHEN z = 'N/A' THEN 'abc' ELSE z END) AS min_value
FROM table1 t1
INNER JOIN table2 t2
ON t1.p = t2.q
GROUP BY
a,
b;
The above CASE expression is just taking the minimum value of z for each group, with the only difference between MIN(z) being that should the value be N/A, it would be treated as abc.

How to limit the columns of the select statement?

I have to create a new table and inside should be the columns I get from the CASE statement. I do not need the rest of the columns resulting from the select statement
for example:
CREATE TABLE test
AS (
SELECT a.id, ...
CASE WHEN a.id = 1 THEN 2
ELSE 0
END as LegalType
FROM table a, ...
WHERE ...);
now my question how can I select only the column LegalType from the CASE statement? I do not want to have column a.id

You can SELECT INTO
SELECT CASE WHEN a.id = 1 THEN 2 ELSE 0 END as LegalType
....
INTO test
FROM table a
WHERE 1=1);
This will create you a table based on the data returned in the SELECT. see https://msdn.microsoft.com/en-GB/library/ms188029.aspx

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

I am writing data to a table and allocating a "group-id" for each batch of data that is written. To illustrate, consider the following table.
GroupId Value
------- -----
1 a
1 b
1 c
2 a
2 b
3 a
3 b
3 c
3 d
In this example, there are three groups of data, each with similar but varying values.
How do I query this table to find a group that contains a given set of values? For instance, if I query for (a,b,c) the result should be group 1. Similarly, a query for (b,a) should result in group 2, and a query for (a, b, c, e) should result in the empty set.
I can write a stored procedure that performs the following steps:
select distinct GroupId from Groups -- and store locally
for each distinct GroupId: perform a set-difference (except) between the input and table values (for the group), and vice versa
return the GroupId if both set-difference operations produced empty sets
This seems a bit excessive, and I hoping to leverage some other commands in SQL to simplify. Is there a simpler way to perform a set-comparison in this context, or to select the group ID that contains the exact input values for the query?

This is a set-within-sets query. I like to solve it using group by and having:
select groupid
from GroupValues gv
group by groupid
having sum(case when value = 'a' then 1 else 0 end) > 0 and
sum(case when value = 'b' then 1 else 0 end) > 0 and
sum(case when value = 'c' then 1 else 0 end) > 0 and
sum(case when value not in ('a', 'b', 'c') then 1 else - end) = 0;
The first three conditions in the having clause check that each elements exists. The last condition checks that there are no other values. This method is quite flexible, for various exclusions and inclusion conditions on the values you are looking for.
EDIT:
If you want to pass in a list, you can use:
with thelist as (
select 'a' as value union all
select 'b' union all
select 'c'
)
select groupid
from GroupValues gv left outer join
thelist
on gv.value = thelist.value
group by groupid
having count(distinct gv.value) = (select count(*) from thelist) and
count(distinct (case when gv.value = thelist.value then gv.value end)) = count(distinct gv.value);
Here the having clause counts the number of matching values and makes sure that this is the same size as the list.
EDIT:
query compile failed because missing the table alias. updated with right table alias.

This is kind of ugly, but it works. On larger datasets I'm not sure what performance would look like, but the nested instances of #GroupValues key off GroupID in the main table so I think as long as you have a good index on GroupID it probably wouldn't be too horrible.
If Object_ID('tempdb..#GroupValues') Is Not Null Drop Table #GroupValues
Create Table #GroupValues (GroupID Int, Val Varchar(10));
Insert #GroupValues (GroupID, Val)
Values (1,'a'),(1,'b'),(1,'c'),(2,'a'),(2,'b'),(3,'a'),(3,'b'),(3,'c'),(3,'d');
If Object_ID('tempdb..#FindValues') Is Not Null Drop Table #FindValues
Create Table #FindValues (Val Varchar(10));
Insert #FindValues (Val)
Values ('a'),('b'),('c');
Select Distinct gv.GroupID
From (Select Distinct GroupID
From #GroupValues) gv
Where Not Exists (Select 1
From #FindValues fv2
Where Not Exists (Select 1
From #GroupValues gv2
Where gv.GroupID = gv2.GroupID
And fv2.Val = gv2.Val))
And Not Exists (Select 1
From #GroupValues gv3
Where gv3.GroupID = gv.GroupID
And Not Exists (Select 1
From #FindValues fv3
Where gv3.Val = fv3.Val))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Condition while aggregation in Spark - sql

You can use another case expression: select (case when max(D) = min(D) and max(D) = 'Y' -- all the same then sum(case when D = 'Y' then 0 else A end) else 0 end) from NAMES group by M;

Related

How to create an alias that counts Ws, Ls, and Ds to create a record

BigQuery(standard SQL) grouping values based on first CASE WHEN statement

Not a group expression: troubleshooting an Oracle query

How to limit the columns of the select statement?

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

Categories

Resources