Grouping results based on CASE expression? - sql

I'm working with a table that stores the results of a questionnaire administered to people. Each question and its result is stored as a separate record, as shown below. I've written a CASE expression that creates a simple 1/0 flag depending on people's answers to certain questions. My results look something like this.
PersonID Question Answer Flag
---------------------------------------------
1001 Question 1 yes 1
1001 Question 2 3 0
1001 Question 3 1 or more 1
1234 Question 1 no 0
1234 Question 2 2 0
1234 Question 3 none 0
My issue now is that if a person has even one flagged response, I need to flag their entire questionnaire. I've been looking around for other examples of this issue—this is dealing with almost exactly the same thing, but I don't necessarily want to actually group my results, because I still want to be able to see the flags for the individual questions ("oh, this person's questionnaire got flagged, let's see which question it was for and what their response was"). I know it's probably not the most efficient thing, but I'm hoping for results that look like this:
PersonID Question Answer Flag Overall
--------------------------------------------------------
1001 Question 1 yes 1 1
1001 Question 2 3 0 1
1001 Question 3 1 or more 1 1
1234 Question 1 no 0 0
1234 Question 2 2 0 0
1234 Question 3 none 0 0
This is where I'm at with my query. It works fine for flagging the individual questions, but I'm not sure what steps to take in order to flag the whole questionnaire based on the individual answers. What kind of logic/syntax should I be looking at?
SELECT
PersonID,
QuestionDescription as Question,
ResultValue as Answer,
(CASE
WHEN (QuestionDescription LIKE '%ion 1%' AND ResultValue = 'yes') THEN 1
WHEN (QuestionDescription LIKE '%ion 2%' AND ResultValue >= 5) THEN 1
WHEN (QuestionDescription LIKE '%ion 3%' AND ResultValue = '1 or more') THEN 1
ELSE 0
END) as Flag
FROM Questionnaire
ORDER BY PersonID, QuestionDescription

At its most simplistic you can add up the flags within a partition of the person and see whether they sum to 0 or not:
WITH x AS
(
SELECT
PersonID,
QuestionDescription as Question,
ResultValue as Answer,
CASE
WHEN (QuestionDescription LIKE '%ion 1%' AND ResultValue = 'yes') THEN 1
WHEN (QuestionDescription LIKE '%ion 2%' AND ResultValue >= 5) THEN 1
WHEN (QuestionDescription LIKE '%ion 3%' AND ResultValue = '1 or more') THEN 1
ELSE 0
END as Flag
FROM Questionnaire
)
SELECT
*,
CASE WHEN SUM(Flag) OVER(PARTITION BY PersonID) > 0 THEN 1 ELSE 0 END as Overall
FROM
x
SUM(...) OVER(...) is a bit like doing the following:
WITH x AS (
--your existing query here
)
SELECT *, CASE WHEN SumFlag > 0 THEN 1 ELSE 0 END as OVerall
FROM
x
INNER JOIN
(SELECT PersonId, SUM(Flag) AS SumFlag FROM X GROUP BY PersonId) y ON x.PersonId = y.PersonId
i.e. SUM OVER does a grouping on PersonId, a Sum and then auto joins the result back to each row on the thing that was grouped (PersonId) - they're incredibly powerful and useful things, these window functions
This latter form (where a separate query groups and is rejoined) would also work if you can't get on with the window function (SUM OVER) approach - it's something akin to what datarocker pointed to in their answer

Related

SQL Conditional Aggregation not returning all expected rows

So I've been trying to get a conditional aggregation running on one of my tables in SQL Server Management Studio and I've run across a problem: only one row is being returned when there should be 2.
SELECT ListID,
MAX(CASE WHEN QuestionName = 'Probability Value' THEN Answer END) AS 'prob',
MAX(CASE WHEN QuestionName = 'Impact Value' THEN Answer END) As 'impa',
MAX(CASE WHEN QuestionName = 'What is the Risk Response Strategy' THEN Answer END) AS 'strat',
MAX(CASE WHEN QuestionName = 'Response Comment' THEN Answer END) AS 'rrap'
FROM table1
GROUP BY ListID
By the information stored on the table is should return two rows, something like:
ListID | Prob | Impa | Strat | rrap |
1 2 3 Admin text1
1 5 5 Elim text2
but only the first row appears. I don't have any good leads at the moment, but I wonder if you good people might have spotted something obviously wrong with the initial query.
Your only group by is ListID and your 2 rows both have 1 on ListID, that's why they group up
Why do you think it should return more than 1 row? You are grouping by ListID and getting the MAX answer for all these questions.
If you want more rows returned you will have to group by other columns/expressions as well. You can't expect ListID 1 to appear more than once if you grouped by ListID only.

How to identify subsequent user actions based on prior visits

I want to identify the users who visited section a and then subsequently visited b. Given the following data structure. The table contains 300,000 rows and updates daily with approx. 8,000 rows:
**USERID** **VISITID** **SECTION** Desired Solution--> **Conversion**
1 1 a 0
1 2 a 0
2 1 b 0
2 1 b 0
2 1 b 0
1 3 b 1
Ideally I want a new column that flags the visit to section b. For example on the third visit User 1 visited section b for the first time. I was attempting to do this using a CASE WHEN statement but after many failed attempts I am not sure it is even possible with CASE WHEN and feel that I should take a different approach, I am just not sure what that approach should be. I do also have a date column at my disposal.
Any suggestions on a new way to approach the problem would be appreciated. Thanks!
Correlated sub-queries should be avoided at all cost when working with Redshift. Keep in mind there are no indexes for Redshift so you'd have to rescan and restitch the column data back together for each value in the parent resulting in an O(n^2) operation (in this particular case going from 300 thousand values scanned to 90 billion).
The best approach when you are looking to span a series of rows is to use an analytic function. There are a couple of options depending on how your data is structured but in the simplest case, you could use something like
select case
when section != lag(section) over (partition by userid order by visitid)
then 1
else 0
end
from ...
This assumes that your data for userid 2 increments the visitid as below. If not, you could also order by your timestamp column
**USERID** **VISITID** **SECTION** Desired Solution--> **Conversion**
1 1 a 0
1 2 a 0
2 1 b 0
2 *2* b 0
2 *3* b 0
1 3 b 1
select t.*, case when v.ts is null then 0 else 1 end as conversion
from tbl t
left join (select *
from tbl x
where section = 'b'
and exists (select 1
from tbl y
where y.userid = x.userid
and y.section = 'a'
and y.ts < x.ts)) v
on t.userid = v.userid
and t.visitid = v.visitid
and t.section = v.section
Fiddle:
http://sqlfiddle.com/#!15/5b954/5/0
I added sample timestamp data as that field is necessary to determine whether a comes before b or after b.
To incorporate analytic functions you could use:
(I've also made it so that only the first occurrence of B (after an A) will get flagged with the 1)
select t.*,
case
when v.first_b_after_a is not null
then 1
else 0
end as conversion
from tbl t
left join (select userid, min(ts) as first_b_after_a
from (select t.*,
sum( case when t.section = 'a' then 1 end)
over( partition by userid
order by ts ) as a_sum
from tbl t) x
where section = 'b'
and a_sum is not null
group by userid) v
on t.userid = v.userid
and t.ts = v.first_b_after_a
Fiddle: http://sqlfiddle.com/#!1/fa88f/2/0

Group and Separate into Different Columns

I'm trying to group some data and separate one column into several. The following is the kind of table I'm working with although each of these columns are from individual tables connected through an ID on each:
ParticipantId QuestionText QuestionAnswer
1 What is your gender? 2
2 What is your gender? 1
3 What is your gender? 1
4 What is your gender? 2
5 What is your gender? 1
1 What is your age? 28
2 What is your age? NULL
3 What is your age? 55
4 What is your age? 63
And this is what I want to achieve:
ParticipantId Question1Answer Question2Answer Question3Answer
1 2 28 3
2 1 NULL 4
I imagine this is quite a difficult thing to do? As the questionnaire contains around 100 questions. I don't think using case would be suitable without typing each questionID out. I'm using SQL Server 2008. The following is some of the table structures I'm working with. I'm sure there's an clearer way than typing it out.
The QuestionnaireQuestion table contains QuestionNumber for the sequence and joins to the Question table to via questionID which is the Question tables PID. The question table contains QuestionText and links to the Answer table using QuestionID which contains the answer field. Then the answer table goes through a link table called QuestionnaireInstance which finally links to the PaperQuestionnaire table which contains the ParticipantID.
That probably hasn't made it any clearer, just let me know anything else that might clear it up a bit.
In case you don't want to have to type out all of the question text each time, you could always use this:
;with sample_data as
(
SELECT
ParticipantId
,QuestionText
,QuestionAnswer
,row_number() OVER (PARTITION BY PARTICIPANTID ORDER BY (SELECT NULL)) AS rn
FROM yourdatatable
)
SELECT
PARTICIPANTID
,MAX(CASE WHEN rn = 1 THEN questionanswer END) AS Q1
,MAX(CASE WHEN rn = 2 THEN questionanswer END) AS Q2
,MAX(CASE WHEN rn = 3 THEN questionanswer END) AS Q3
,MAX(CASE WHEN rn = 4 THEN questionanswer END) AS Q4
FROM sample_data
GROUP BY ParticipantId
Although it might be better in your case to consider dynamic pivoting instead, depending on how many columns you want to ultimately end up with
If you have uniquness in you table for column combination ParticipantId and QuestionText then you can use below query also to acive the desired output -
SELECT Participantid,
MAX(CASE
WHEN Questiontext = 'What is your gender?' THEN
Questionanswer
ELSE
NULL
END) AS Question1answer,
MAX(CASE
WHEN Questiontext = 'What is your age?' THEN
Questionanswer
ELSE
NULL
END) AS Question2answer,
MAX(CASE
WHEN Questiontext = '...your third question...' THEN
Questionanswer
ELSE
NULL
END) AS Question3answer,
..
..
FROM Your_Table_Name
GROUP BY Participantid

Exclude value of a record in a group if another is present v2

In the example table below, I'm trying to figure out a way to sum amount over marks in two situations: the first, when mark 'C' exists within a single id, and the second, when mark 'C' doesn't exist within an id (see id 1 or 2). In the first situation, I want to exclude the amount against mark 'A' within that id (see id 3 in the desired conversion table below). In the second situation, I want to perform no exclusion and take a simple sum of the amounts against the marks.
In other words, for id's containing both mark 'A' and 'C', I want to make the amount against 'A' as zero. For id's that do not contain mark 'C' but contain mark 'A', keep the original amount against mark 'A'.
My desired output is at the bottom. I've considered trying to partition over id or use the EXISTS command, but I'm having trouble conceptualizing the solution. If any of you could take a look and point me in the right direction, it would be greatly appreciated :)
example table:
id mark amount
------------------
1 A 1
2 A 3
2 B 2
3 A 1
3 C 3
desired conversion:
id mark amount
------------------
1 A 1
2 A 3
2 B 2
3 A 0
3 C 3
desired output:
mark sum(amount)
--------------------
A 4
B 2
C 3
You could slightly modify my previous answer and end up with this:
SELECT
mark,
sum(amount) AS sum_amount
FROM atable t
WHERE mark <> 'A'
OR NOT EXISTS (
SELECT *
FROM atable
WHERE id = t.id
AND mark = 'C'
)
GROUP BY
mark
;
There's a live demo at SQL Fiddle.
Try:
select
mark,
sum(amount)
from ( select
id,
mark,
case
when (mark = 'A' and id in (select id from table where mark = 'C')) then 0
else amount
end as amount
from table ) t1
group by mark

In SQL, how can I count the number of values in a column and then pivot it so the column becomes the row?

I have a survey database with one column for each question and one row for each person who responds. Each question is answered with a value from 1 to 3.
Id Quality? Speed?
-- ------- -----
1 3 1
2 2 1
3 2 3
4 3 2
Now, I need to display the results as one row per question, with a column for each response number, and the value in each column being the number of responses that used that answer. Finally, I need to calculate the total score, which is the number of 1's plus two times the number of 2's plus three times the number of threes.
Question 1 2 3 Total
-------- -- -- -- -----
Quality? 0 2 2 10
Speed? 2 1 1 7
Is there a way to do this in set-based SQL? I know how to do it using loops in C# or cursors in SQL, but I'm trying to make it work in a reporting tool that doesn't support cursors.
This will give you what you're asking for:
SELECT
'quality' AS question,
SUM(CASE WHEN quality = 1 THEN 1 ELSE 0 END) AS [1],
SUM(CASE WHEN quality = 2 THEN 1 ELSE 0 END) AS [2],
SUM(CASE WHEN quality = 3 THEN 1 ELSE 0 END) AS [3],
SUM(quality)
FROM
dbo.Answers
UNION ALL
SELECT
'speed' AS question,
SUM(CASE WHEN speed = 1 THEN 1 ELSE 0 END) AS [1],
SUM(CASE WHEN speed = 2 THEN 1 ELSE 0 END) AS [2],
SUM(CASE WHEN speed = 3 THEN 1 ELSE 0 END) AS [3],
SUM(speed)
FROM
dbo.Answers
Keep in mind though that this will quickly balloon as you add questions or even potential answers. You might be much better off if you normalized a bit and had an Answers table with a row for each answer with a question code or id, instead of putting them across as columns in one table. It starts to look a little bit like the entity-value pair design, but I think that it's different enough to be useful here.
You can also leverage SQL 2005's pivoting functions to achieve what you want. This way you don't need to hard code any questions as you do in cross-tabulation. Note that I called the source table "mytable" and I used common table expressions for readability but you could also use subqueries.
WITH unpivoted AS (
SELECT id, value, question
FROM mytable a
UNPIVOT (value FOR question IN (quality,speed) ) p
)
,counts AS (
SELECT question, value, count(*) AS counts
FROM unpivoted
GROUP BY question, value
)
, repivoted AS (
SELECT question, counts, [1], [2], [3]
FROM counts
PIVOT (count(value) FOR value IN ([1],[2],[3])) p
)
SELECT question, sum(counts*[1]) AS [1], sum(counts*[2]) AS [2], sum(counts*[3]) AS [3]
,sum(counts*[1]) + 2*sum(counts*[2]) + 3*sum(counts*[3]) AS Total
FROM repivoted
GROUP BY question
Note if you don't want the breakdown the query is simpler:
WITH unpivoted AS (
SELECT id, value, question
FROM mytable a
UNPIVOT (value FOR question IN (quality,speed) ) p
)
, totals AS (
SELECT question, value, count(value)*value AS score
FROM unpivoted
GROUP BY question, value
)
SELECT question, sum(score) AS score
FROM totals
GROUP BY question