In SQL, how do you count if any rows in a group match a certain criteria? - sql

I'm new to SQL, but I have a dataset that has students, their class subjects, and if there was an error in their work. I want to know how many students have at least 1 error in any subject. Thus, whether a student has one subject with an error (like students 2 and 3 in the example) or multiple errors (like student 4), they'd be flagged. Only if they have no errors should they be categorized as 'no'.
I know I have to use GROUP BY and COUNT, and I'm thinking I have to use HAVING as well, but I can't seem to put it together. Here's a sample dataset:
ID Rating Error
==========================================
1 English No
1 Math No
2 English Yes
2 Math No
2 Science No
3 English Yes
4 English Yes
4 Math Yes
And the desired output:
Error Count Percent
==========================================
No 1 .25
Yes 3 .75

there are many different ways you can do it, here is one example by using CTE (common table expressions):
with t as (
select
id,
case when sum(case when error='Yes' then 1 else 0 end) > 0 then 'Yes' else 'No' end as error
from students
group by id
)
select
error,
count(*),
(0.0 + count(*)) / (select count(*) from t) as perc
from t
group by error
basically, inner query (t) is used to calculate error status for each student, outer query calculates error distribution/percentage numbers

There are several useful functions you can use:
bool_or(boolean) → boolean - Returns TRUE if any input value is TRUE, otherwise FALSE.
if(condition, true_value, false_value) - Evaluates and returns true_value if condition is true, otherwise evaluates and returns false_value.
select count(distinct id) - to count distinct ids.
with dataset (ID,Rating,Error) as (
values (1,'Math','No'),
(2,'English','Yes'),
(1,'English','No'),
(2,'Math','No'),
(2,'Science','No'),
(3,'English','Yes'),
(4,'English','Yes'),
(4,'Math','Yes')
)
select if(has_error, 'Yes', 'No') Error,
count(*) Count,
cast(count(*) as double) / (select count(distinct id) from dataset) Percent
from (
select bool_or(Error = 'Yes') has_error
from dataset
group by id
)
group by has_error;
Output:
Error
Count
Percent
Yes
3
0.75
No
1
0.25

Related

sum values per id for a specific condition without group by syntax

I would like to sum the values that have the some id and 'Y'in YN column ,in a case statement, hence I can not use the group by syntax. Please see below an example and my code : Table T
ID Value YN
1 4 Y
1 6 Y
2 3 N
Request:
select
case when YN = 'Y'
then ( select sum(Value) from T group by ID)
else Value
end as TotalResult;
Can you help me displaying only Totalresult ?
Just because you use GROUP BY does not mean that you have to include that column in the SELECT...
SELECT
SUM(Value) AS TotalResult
FROM
T
GROUP BY
ID, YN
=>
Total Result
--------------
10
3
Exactly what query you need, however, is unclear as you have not demonstrated clearly what you want the query to actually do, or what the expected results should be for your sample data.

HIVE/Impala query: Count the number of rows between rows fulfilling certain conditions

I need to count the number of rows that fulfill certain conditions contained in intervals defined by other rows that fulfill other conditions. Examples: the number of rows N between 'Reference' having values 1 and 4 that fulfill the condition 'Other_condition' = b is N=1, the number of rows N between 'Reference' having values 2 and 5 that fulfill the condition 'Other_condition' = b is N=2 etc.
Date Reference Other_condition
20171111 1 a
20171112 2 a
20171113 3 b
20171114 4 b
20171115 5 b
I'm accessing the database through Hive/Impala SQL queries and unfortunately I have no idea where to start implementing such a window function. A half-pseudocode version of what I want would be something like:
SELECT COUNT (DISTINCT database.Date) AS counter, Other_condition, reference
FROM database
WHERE database.Other_condition = a AND database.Reference BETWEEN
(window function condition 1: database.Reference = 2) AND
(window function condition 2: database.Reference = 5)
GROUP BY counter
Your question is rather hard to follow. I get the first conditions, which is the number of rows between "1" and "4".
Here is one method that should be pretty easy to generalize:
select (max(case when reference = 4 then seqnum end) -
max(case when reference = 1 then seqnum end)
) as num_rows_1_4
from (select t.*,
row_number() over (order by date) as seqnum
from t
) t;

Calculate percentages of columns in Oracle SQL

I have three columns, all consisting of 1's and 0's. For each of these columns, how can I calculate the percentage of people (one person is one row/ id) who have a 1 in the first column and a 1 in the second or third column in oracle SQL?
For instance:
id marketing_campaign personal_campaign sales
1 1 0 0
2 1 1 0
1 0 1 1
4 0 0 1
So in this case, of all the people who were subjected to a marketing_campaign, 50 percent were subjected to a personal campaign as well, but zero percent is present in sales (no one bought anything).
Ultimately, I want to find out the order in which people get to the sales moment. Do they first go from marketing campaign to a personal campaign and then to sales, or do they buy anyway regardless of these channels.
This is a fictional example, so I realize that in this example there are many other ways to do this, but I hope anyone can help!
The outcome that I'm looking for is something like this:
percentage marketing_campaign/ personal campaign = 50 %
percentage marketing_campaign/sales = 0%
etc (for all the three column combinations)
Use count, sum and case expressions, together with basic arithmetic operators +,/,*
COUNT(*) gives a total count of people in the table
SUM(column) gives a sum of 1 in given column
case expressions make possible to implement more complex conditions
The common pattern is X / COUNT(*) * 100 which is used to calculate a percent of given value ( val / total * 100% )
An example:
SELECT
-- percentage of people that have 1 in marketing_campaign column
SUM( marketing_campaign ) / COUNT(*) * 100 As marketing_campaign_percent,
-- percentage of people that have 1 in sales column
SUM( sales ) / COUNT(*) * 100 As sales_percent,
-- complex condition:
-- percentage of people (one person is one row/ id) who have a 1
-- in the first column and a 1 in the second or third column
COUNT(
CASE WHEN marketing_campaign = 1
AND ( personal_campaign = 1 OR sales = 1 )
THEN 1 END
) / COUNT(*) * 100 As complex_condition_percent
FROM table;
You can get your percentages like this :
SELECT COUNT(*),
ROUND(100*(SUM(personal_campaign) / sum(count(*)) over ()),2) perc_personal_campaign,
ROUND(100*(SUM(sales) / sum(count(*)) over ()),2) perc_sales
FROM (
SELECT ID,
CASE
WHEN SUM(personal_campaign) > 0 THEN 1
ELSE 0
end AS personal_campaign,
CASE
WHEN SUM(sales) > 0 THEN 1
ELSE 0
end AS sales
FROM the_table
WHERE ID IN
(SELECT ID FROM the_table WHERE marketing_campaign = 1)
GROUP BY ID
)
I have a bit overcomplicated things because your data is still unclear to me. The subquery ensures that all duplicates are cleaned up and that you only have for each person a 1 or 0 in marketing_campaign and sales
About your second question :
Ultimately, I want to find out the order in which people get to the
sales moment. Do they first go from marketing campaign to a personal
campaign and then to sales, or do they buy anyway regardless of these
channels.
This is impossible to do in this state because you don't have in your table, either :
a unique row identifier that would keep the order in which the rows were inserted
a timestamp column that would tell when the rows were inserted.
Without this, the order of rows returned from your table will be unpredictable, or if you prefer, pure random.

DB2 SQL filter query result by evaluating an ID which has two types of entries

After many attempts I have failed at this and hoping someone can help. The query returns every entry a user makes when items are made in the factory against and order number. For example
Order Number Entry type Quantity
3000 1 1000
3000 1 500
3000 2 300
3000 2 100
4000 2 1000
5000 1 1000
What I want to the query do is to return filter the results like this
If the order number has an entry type 1 and 2 return the row which is type 1 only
otherwise just return row whatever the type is for that order number.
So the above would end up:
Order Number Entry type Quantity
3000 1 1000
3000 1 500
4000 2 1000
5000 1 1000
Currently my query (DB2, in very basic terms looks like this ) and was correct until a change request came through!
Select * from bookings where type=1 or type=2
thanks!
select * from bookings
left outer join (
select order_number,
max(case when type=1 then 1 else 0 end) +
max(case when type=2 then 1 else 0 end) as type_1_and_2
from bookings
group by order_number
) has_1_and_2 on
type_1_and_2 = 2
has_1_and_2.order_number = bookings.order_number
where
bookings.type = 1 or
has_1_and_2.order_number is null
Find all the orders that have both type 1 and type 2, and then join it.
If the row matched the join, only return it if it is type 1
If the row did not match the join (has_type_2.order_number is null) return it no matter what the type is.
A "common table expression" [CTE] can often simplify your logic. You can think of it as a way to break a complex problem into conceptual steps. In the example below, you can think of g as the name of the result set of the CTE, which will then be joined to
WITH g as
( SELECT order_number, min(type) as low_type
FROM bookings
GROUP BY order_number
)
SELECT b.*
FROM g
JOIN bookings b ON g.order_number = b.order_number
AND g.low_type = b.type
The JOIN ON conditions will work so that if both types are present then low_type will be 1, and only that type of record will be chosen. If there is only one type it will be identical to low_type.
This should work fine as long as 1 and 2 are the only types allowed in the bookings table. If not then you can simply add a WHERE clause in the CTE and in the outer SELECT.

mysql SELECT COUNT(*) ... GROUP BY ... not returning rows where the count is zero

SELECT student_id, section, count( * ) as total
FROM raw_data r
WHERE response = 1
GROUP BY student_id, section
There are 4 sections on the test, each with a different number of questions. I want to know, for each student, and each section, how many questions they answered correctly (response=1).
However, with this query, if a student gets no questions right in a given section, that row will be completely missing from my result set. How can I make sure that for every student, 4 rows are ALWAYS returned, even if the "total" for a row is 0?
Here's what my result set looks like:
student_id section total
1 DAP--29 3
1 MEA--16 2
1 NNR--13 1 --> missing the 4th section for student #1
2 DAP--29 1
2 MEA--16 4
2 NNR--13 2 --> missing the 4th section for student #2
3 DAP--29 2
3 MEA--16 3
3 NNR--13 3 --> missing the 4th section for student #3
4 DAP--29 5
4 DAP--30 1
4 MEA--16 1
4 NNR--13 2 --> here, all 4 sections show up because student 4 got at least one question right in each section
Thanks for any insight!
UPDATE: I tried
SELECT student_id, section, if(count( * ) is null, 0, count( * )) as total
and that didn't change the results at all. Other ideas?
UPDATE 2: I got it working thanks to the response below:
SELECT student_id, section, SUM(CASE WHEN response = '1' THEN 1 ELSE 0 END ) AS total
FROM raw_data r
WHERE response = 1
GROUP BY student_id, section
SELECT student_id, section, sum(case when response=1 then 1 else 0 end) as total
FROM raw_data_r GROUP BY student_id, section
Note that there's no WHERE condition.
SELECT r.student_id,
r.subject,
sum( r.response ) as total
FROM raw_data r
GROUP BY student_id, subject
if you have a separate table with student information, you can select students from that table and left join the results to the data_raw table:
SELECT si.student_name, rd.student_id, rd.section, rd.count(*) AS total
FROM student_info AS si LEFT JOIN raw_data AS rd USING rd.student_id = si.student_id
This way, it first selects all students, then executes the count command.