Counting columns with a where clause - sql

Is there a way to count a number of columns which has a particular value for each rows in Hive.
I have data which looks like in input and I want to count how many columns have value 'a' and how many column have value 'b' and get the output like in 'Output'.
Is there a way to accomplish this with Hive query?

One method in Hive is:
select ( (case when cl_1 = 'a' then 1 else 0 end) +
(case when cl_2 = 'a' then 1 else 0 end) +
(case when cl_3 = 'a' then 1 else 0 end) +
(case when cl_4 = 'a' then 1 else 0 end) +
(case when cl_5 = 'a' then 1 else 0 end)
) as count_a,
( (case when cl_1 = 'b' then 1 else 0 end) +
(case when cl_2 = 'b' then 1 else 0 end) +
(case when cl_3 = 'b' then 1 else 0 end) +
(case when cl_4 = 'b' then 1 else 0 end) +
(case when cl_5 = 'b' then 1 else 0 end)
) as count_b
from t;
To get the total count, I would suggest using a subquery and adding count_a and count_b.

Use lateral view with explode on the data and do the aggregations on it.
select id
,sum(cast(col='a' as int)) as cnt_a
,sum(cast(col='b' as int)) as cnt_b
,sum(cast(col in ('a','b') as int)) as cnt_total
from tbl
lateral view explode(array(ci_1,ci_2,ci_3,ci_4,ci_5)) tbl as col
group by id

Related

SQL sum of two conditional aggregation

So yesterday I learned about conditional aggregation. I'm fairly new to SQL.
Here is my query:
select
Year_CW,
sum(case when col = 0 then 1 else 0 end) as "Total_sampled(Checked)",
sum(case when col = 1 then 1 else 0 end) as "Total_unsampled(Not_Checked)",
sum(case when col = 0 AND col2 = 'accepted' then 1 else 0 end) as "Accepted",
sum(case when col = 0 AND col2 = 'accepted with comments' then 1 else 0 end) as "Accepted with comments",
sum(case when col = 0 AND col2 = 'request for rework' then 1 else 0 end) as "Request for rework",
sum(case when col = 0 AND col2 = 'rejected' then 1 else 0 end) as "Rejected",
sum(case when col = 0 Or col = 1 then 1 else 0 end) as "Total_DS"
from
(select
Year_CW, SAMPLED as col, APPROVAL as col2
from
View_TEST tv) tv
group by
Year_CW
order by
Year_CW desc
I'm basically just calculating some KPIs grouped by week.
Look at the row for "Total_DS". It is essentially the sum of the first two sums, "Total_sampled(Checked)" and "Total_unsampled(Not_Checked)".
Is there a way that I can add the two columns from the first two sums to get the third one instead of trying to get the data all over again? I feel performance wise this would be terrible practice. It doesn't matter for this database but I don't want to learn bad code practice from the start.
Thanks for helping.
You probably won't see a significant performance hit from what you're doing now as you already have all the data available, you're just repeating the case evaluation.
But you can't refer to the column aliases for the first two columns within the same level of query.
If you can't do a simple count as #Zeki suggested because you aren't sure if there might be values other than zero and one (though this looks rather like a binary true/false equivalent, so there may well be a check constraint limiting you to those values), or if you're just more interested in a more general case, you can use an inline view as #jarhl suggested:
select Year_CW,
"Total_sampled(Checked)",
"Total_unsampled(Not_Checked)",
"Accepted",
"Accepted with comments",
"Request for rework",
"Rejected",
"Total_sampled(Checked)" + "Total_unsampled(Not_Checked)" as "Total_DS"
from (
select Year_CW,
sum(case when col = 0 then 1 else 0 end) as "Total_sampled(Checked)",
sum(case when col = 1 then 1 else 0 end) as "Total_unsampled(Not_Checked)",
sum(case when col = 0 AND col2 = 'accepted' then 1 else 0 end) as "Accepted",
sum(case when col = 0 AND col2 = 'accepted with comments' then 1 else 0 end)
as "Accepted with comments",
sum(case when col = 0 AND col2 = 'request for rework' then 1 else 0 end)
as "Request for rework",
sum(case when col = 0 AND col2 = 'rejected' then 1 else 0 end) as "Rejected"
from (
select Year_CW, SAMPLED as col, APPROVAL as col2
from View_TEST tv
) tv
group by Year_CW
)
order by Year_CW desc;
The inner query gets the data and calculates the conditional aggregate values. The outer query just gets those values from the inner query, and also adds the Total_DS column to the result set by adding together the rwo values from the inner query.
You should generally avoid quoted identifiers, and if you really need them in your result set you should apply them at the last possible moment - so use unquoted identifiers in the inner query, and give them qupted aliases in the outer query. And personally if the point of a query is to count things, I prefer to use a conditional count over a conditional sum. I'm also not sure why you already have a subquery against your view, which just changes the column names and makes the main query slightly more obscure. So I might do this as:
select year_cw,
total_sampled_checked as "Total_sampled(Checked)",
total_unsampled_not_checked as "Total_unsampled(Not_Checked)",
accepted as "Accepted",
accepted_with_comments as "Accepted with comments",
request_for_rework as "Request for rework",
rejected as "Rejected",
total_sampled_checked + total_unsampled_not_checked as "Total_DS"
from (
select year_cw,
count(case when sampled = 0 then 1 end) as total_sampled_checked,
count(case when sampled = 1 then 1 end) as total_unsampled_not_checked,
count(case when sampled = 0 and approval = 'accepted' then 1 end) as accepted,
count(case when sampled = 0 and approval = 'accepted with comments' then 1 end)
as accepted_with_comments,
count(case when sampled = 0 and approval = 'request for rework' then 1 end)
as request_for_rework,
count(case when sampled = 0 and approval = 'rejected' then 1 end) as rejected
from view_test
group by year_cw
)
order by year_cw desc;
Note that in the case expression, then 1 can be then <anything that isn't null>, so you could do then sampled or whatever. I've left out the implicit else null. As count() ignores nulls, all the case expression has to do is evaluate to any not-null value for the rows you want to include in the count.
You can try below
select Year_CW,
sum(case when col = 0 then 1 else 0 end) as "Total_sampled(Checked)",
sum(case when col = 1 then 1 else 0 end) as "Total_unsampled(Not_Checked)",
sum(case when col = 0 AND col2 = 'accepted' then 1 else 0 end) as "Accepted",
sum(case when col = 0 AND col2 = 'accepted with comments' then 1 else 0 end) as "Accepted with comments",
sum(case when col = 0 AND col2 = 'request for rework' then 1 else 0 end) as "Request for rework",
sum(case when col = 0 AND col2 = 'rejected' then 1 else 0 end) as "Rejected",
sum(sum(case when col = 0 then 1 else 0 end) = 0 Or sum(case when col = 1 then 1 else 0 end) = 1 then 1 else 0 end) as "Total_DS"
from (select Year_CW, SAMPLED as col, APPROVAL as col2
from View_TEST tv
) tv
group by Year_CW
order by Year_CW desc

Create and classify a row based on column values

I am attempting to assign a classification to a row of data based on whether certain values exist. Utilizing the sample code below I have gotten to a place where I've gotten stuck.
proc sql;
create table test
(id char(4),
task char(4),
id2 char(4),
status char(10),
seconds num);
insert into test
values('1','A','1','COMP',15)
values('1','B','2','WORK',20)
values('1','C','3','COMP',50)
values('1','D','3','COMP',null)
values('2','A','1','COMP',15)
values('2','B','2','COMP',520)
values('2','C','2','COMP',NULL)
values('2','D','3','COMP',221)
values('2','E','3','COMP',null)
values('2','F','3','COMP',null);
proc sql;
create table test2 as
select
ID,
ID2,
STATUS,
SUM(SECONDS) AS SECONDS,
sum(case when task='A' THEN 1 ELSE 0 END) AS A,
sum(case when task='B' THEN 1 ELSE 0 END) AS B,
sum(case when task='C' THEN 1 ELSE 0 END) AS C,
sum(case when task='D' THEN 1 ELSE 0 END) AS D,
sum(case when task='E' THEN 1 ELSE 0 END) AS E,
sum(case when task='F' THEN 1 ELSE 0 END) AS F
from
test
GROUP BY
ID,
ID2,
STATUS
;
quit;
Ultimately I would like to classify each row that gets created in the second step 'test2' to have a column that looks to the values in each lettered column(A-F) and Label them as such. So when the Row has a 1 in Column A only, it would be labeled 'A' but when a row has a 1 in multiple columns like 'D', 'E' and 'F' I would like it to be labeled as D_E_F.
Best way to do this is in a DATA STEP:
data test3;
format classifier $32.;
set test2;
array vars[6] A B C D E F;
classifier = "";
do i=1 to 6;
if vars[i] then
classifier = catx("_",classifier,vname(vars[i]));
end;
drop i;
run;
I create a character variable CLASSIFIER with length 32.
I define an array that groups the columns A through F. This allows me to loop over those columns easily.
Initialize the CLASSIFIER variable.
Loop over the array. If the value =1, then add the name of the variable to the CLASSIFIER string.
CATX(delim,str1,str2) concatenates str1 and str2 with the delim in the middle. It also removes whitespace.
VNAME(array[i]) returns the variable name of the variable pointed to by array[i].
Finally remove the i loop variable, unless you really want it in your output.
I know it is ugly, but you can do it with CASE statements accumulating the wanted result in another field. You have the SQL Fiddle here.
Note that if it is possible that the concatenation is empty you will have to check this condition to avoid performing the substring.
select
ID,
ID2,
STATUS,
SUM(SECONDS) AS SECONDS,
sum(case when task='A' THEN 1 ELSE 0 END) AS A,
sum(case when task='B' THEN 1 ELSE 0 END) AS B,
sum(case when task='C' THEN 1 ELSE 0 END) AS C,
sum(case when task='D' THEN 1 ELSE 0 END) AS D,
sum(case when task='E' THEN 1 ELSE 0 END) AS E,
sum(case when task='F' THEN 1 ELSE 0 END) AS F,
substring(
case when sum(case when task='A' THEN 1 ELSE 0 END) = 1 then '_A' else '' end
+ case when sum(case when task='B' THEN 1 ELSE 0 END) = 1 then '_B' else '' end
+ case when sum(case when task='C' THEN 1 ELSE 0 END) = 1 then '_C' else '' end
+ case when sum(case when task='D' THEN 1 ELSE 0 END) = 1 then '_D' else '' end
+ case when sum(case when task='E' THEN 1 ELSE 0 END) = 1 then '_E' else '' end
+ case when sum(case when task='F' THEN 1 ELSE 0 END) = 1 then '_F' else '' end,
2, len(case when sum(case when task='A' THEN 1 ELSE 0 END) = 1 then '_A' else '' end
+ case when sum(case when task='B' THEN 1 ELSE 0 END) = 1 then '_B' else '' end
+ case when sum(case when task='C' THEN 1 ELSE 0 END) = 1 then '_C' else '' end
+ case when sum(case when task='D' THEN 1 ELSE 0 END) = 1 then '_D' else '' end
+ case when sum(case when task='E' THEN 1 ELSE 0 END) = 1 then '_E' else '' end
+ case when sum(case when task='F' THEN 1 ELSE 0 END) = 1 then '_F' else '' end) - 1) as wantedOutput
from
test
GROUP BY
ID,
ID2,
STATUS

How to use count distinct along multiple columns?

Input:
id sem1 sem2 sem3 sem4 sem5 sem6 sem7
1 S O S R null null null
2 O O R R S null null
Desired Output:
id O R S
1 1 1 2
2 2 2 1
If your database supports APPLY/UNPIVOT operator then use this
CROSS APPLY method
SELECT id,
SUM(CASE WHEN val = 'O' THEN 1 ELSE 0 END) O,
SUM(CASE WHEN val = 'R' THEN 1 ELSE 0 END) R,
SUM(CASE WHEN val = 'S' THEN 1 ELSE 0 END) S
FROM mytable
CROSS apply (VALUES (sem1),
(sem2),
(sem3),
(sem4),
(sem5),
(sem6),
(sem7)) cs(val)
GROUP BY id
SQL FIDDLE DEMO
UNPIVOT method
SELECT id,
SUM(CASE WHEN val = 'O' THEN 1 ELSE 0 END) O,
SUM(CASE WHEN val = 'R' THEN 1 ELSE 0 END) R,
SUM(CASE WHEN val = 'S' THEN 1 ELSE 0 END) S
FROM (SELECT *
FROM mytable) a
UNPIVOT (val
FOR col IN ( sem1,
sem2,
sem3,
sem4,
sem5,
sem6,
sem7 )) upv
GROUP BY id
SQL FIDDLE DEMO
I personally prefer CROSS APPLY method over UNPIVOT since it is more readable. Performance wise both will be identical

Sum data for many different results for same field

I am trying to find a better way to write this sql server code 2008. It works and data is accurate. Reason i ask is that i will be asked to do this for several other reports going forward and want to reduce the amount of code to upkeep going forward.
How can i take a field where i sum for the yes/no/- (dash) in each field without doing an individual sum as i have in code. Each table is a month of detail data which i sum using in a CTE. i changed the table name for each month and Union All to put data together. Is there a better way to do this. This is a small sample of code. Thanks for the help.
WITH H AS (
SELECT 'August' AS Month_Name
, SUM(CASE WHEN G.FFS = '-' THEN 1 ELSE 0 END) AS FFS_Dash
, SUM(CASE WHEN G.FFS = 'Yes' THEN 1 ELSE 0 END) AS FFS_Yes
, SUM(CASE WHEN G.FFS = 'No' THEN 1 ELSE 0 END) AS FFS_No
, SUM(CASE WHEN G.DNA = '-' THEN 1 ELSE 0 END) AS DNA_Dash
, SUM(CASE WHEN G.DNA = 'Yes' THEN 1 ELSE 0 END) AS DNA_Yes
, SUM(CASE WHEN G.DNA = 'No' THEN 1 ELSE 0 END) AS DNA_No
FROM table08 G )
, G AS (
SELECT 'July' AS Month_Name
, SUM(CASE WHEN G.FFS = '-' THEN 1 ELSE 0 END) AS FFS_Dash
, SUM(CASE WHEN G.FFS = 'Yes' THEN 1 ELSE 0 END) AS FFS_Yes
, SUM(CASE WHEN G.FFS = 'No' THEN 1 ELSE 0 END) AS FFS_No
, SUM(CASE WHEN G.DNA = '-' THEN 1 ELSE 0 END) AS DNA_Dash
, SUM(CASE WHEN G.DNA = 'Yes' THEN 1 ELSE 0 END) AS DNA_Yes
, SUM(CASE WHEN G.DNA = 'No' THEN 1 ELSE 0 END) AS DNA_No
FROM table07 G )
select * from H
UNION ALL
select * from G
How about:
SELECT Month_Name,
SUM(CASE WHEN G.FFS = '-' THEN 1 ELSE 0 END) AS FFS_Dash,
SUM(CASE WHEN G.FFS = 'Yes' THEN 1 ELSE 0 END) AS FFS_Yes,
SUM(CASE WHEN G.FFS = 'No' THEN 1 ELSE 0 END) AS FFS_No,
SUM(CASE WHEN G.DNA = '-' THEN 1 ELSE 0 END) AS DNA_Dash,
SUM(CASE WHEN G.DNA = 'Yes' THEN 1 ELSE 0 END) AS DNA_Yes,
SUM(CASE WHEN G.DNA = 'No' THEN 1 ELSE 0 END) AS DNA_No
FROM ((select 'July' as Month_Name, G.*
from table07 G
) union all
(select 'August', H.*
from table08 H
)
) gh
GROUP BY Month_Name;
However, having tables with the same structure is usually a sign of poor database design. You should have a single table with a column representing the month.

Oracle SQL dividing two self defined columns

if i have the following select two count cases:
COUNT(CASE WHEN STATUS ='Færdig' THEN 1 END) as completed_callbacks,
COUNT(CASE WHEN SOLVED_SECONDS /60 /60 <= 2 THEN 1 END) as completed_within_2hours
and i want to devide the two results with eachother how can i achieve this?
this is my attemt however that failed:
CASE(completed_callbacks / completed_within_2hours * 100) as Percentage
i know this is a rather simple question but i havnt been able to find the answer anywhere
You have to create a derived table:
SELECT completed_callbacks / completed_within_2hours * 100
FROM (SELECT Count(CASE
WHEN status = 'Færdig' THEN 1
END) AS completed_callbacks,
Count(CASE
WHEN solved_seconds / 60 / 60 <= 2 THEN 1
END) AS completed_within_2hours
FROM yourtable
WHERE ...)
Try this:
with x as (
select 'Y' as completed, 'Y' as completed_fast from dual
union all
select 'Y' as completed, 'N' as completed_fast from dual
union all
select 'Y' as completed, 'Y' as completed_fast from dual
union all
select 'N' as completed, 'N' as completed_fast from dual
)
select
sum(case when completed='Y' then 1 else 0 end) as count_completed,
sum(case when completed='N' then 1 else 0 end) as count_not_completed,
sum(case when completed='Y' and completed_fast='Y' then 1 else 0 end) as count_completed_fast,
case when (sum(case when completed='Y' then 1 else 0 end) = 0) then 0 else
((sum(case when completed='Y' and completed_fast='Y' then 1 else 0 end) / sum(case when completed='Y' then 1 else 0 end))*100)
end pct_completed_fast
from x;
Results:
"COUNT_COMPLETED" "COUNT_NOT_COMPLETED" "COUNT_COMPLETED_FAST" "PCT_COMPLETED_FAST"
3 1 2 66.66666666666666666666666666666666666667
The trick is to use SUM rather than COUNT, along with a decode or CASE.
select
COUNT(CASE WHEN STATUS ='Færdig' THEN 1 END)
/
COUNT(CASE WHEN SOLVED_SECONDS /60 /60 <= 2 THEN 1 END)
* 100
as
Percentage