PostgreSQL - Only different values in fields - sql

I have a table called questions containing a single column filled with Exam question IDs. Let's say:
Q1
Q2
Q2
...
Qn
Now I'd like to pick all the combinations of three questions, something like:
1, 2, 3
...
2, 5, 6
4, 7, 1
...
9, 6, 8
And seclect a subset of them made of rows that have globally unique values only. In the previous case:
1, 2, 3
9, 6, 8
Because the other two records contain 2 and 1 which are both contained in the (1, 2, 3) record.
How can this be achieved in SQL? The purpose is to create, let's say, 8 exams made of questions that are all different by each other.

lets consider question table has
Table : questions
Qid
1
2
3
.
n
Now if you want to select only 8 (of three questions) distinct randomized subsets then
SELECT FO,STRING_AGG(QID , ',')
FROM (SELECT Qid , (Qid / 3) :: INT AS FO FROM QUETIONS
ORDER BY RANDOM ()
LIMIT 8*3 )
GROUP BY FO

A trivial approach is like this (the CROSS joins will be slow - e.g. for 100 questions may take half a minute)
SELECT Q1.ID AS Q1, Q2.ID AS Q2, Q3.ID AS Q3
FROM Questions AS Q1, Questions AS Q2, Questions AS Q3
WHERE Q1.ID <> Q2.ID AND Q1.ID <> Q3.ID AND Q2.ID <> Q3.ID
ORDER BY RANDOM() LIMIT 8
However, there are more clever ways to do it - this answer is for MS SQL Server but can be adapted to PostgreSQL

Give your questions numbers: 1, 2, 3, 4, 5 ... n. Then divide by 3 dismissing the rest: 0, 0, 0, 1, 1, ... n/3 to get groups of three. It's up to you how to number the questions, e.g. by ID (least ID is record #1, next ID is record #2, ...) or randomly. Here is an example for randomly:
select *, (row_number() over (order by random()) - 1 ) / 3 as grp
from questions
order by grp;
Keep the result as is or pivot it to get one row per grp with three columns instead, e.g.
select
max(case when rn % 3 = 0 then q end) as q1,
max(case when rn % 3 = 1 then q end) as q2,
max(case when rn % 3 = 2 then q end) as q3
from
(
select *, row_number() over (order by random()) - 1 as rn
from questions
) numbered
group by rn / 3
order by rn / 3;

Related

Progressive Select Query in Oracle Database

I want to write a select query that selects distinct rows of data progressively.
Explaining with an example,
Say i have 5000 accounts selected for repayment of loan, these accounts are ordered in descending order( Account 1st has highest outstanding while account 5000nd will have the lowest).
I want to select 1000 unique accounts 5 times such that the total outstanding amount of repayment in all 5 cases are similar.
i have tried out a few methods by trying to select rownums based on odd/even or other such way, but it's only good for upto 2 distributions. I was expecting more like a A.P. as in maths that selects data progressively.
A naïve method of splitting sets into (for example) 5 bins, numbered 0 to 4, is give each row a unique sequential numeric index and then, in order of size, assign the first 10 rows to bins 0,1,2,3,4,4,3,2,1,0 and then repeat for additional sets of 10 rows:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT bin,
COUNT(*) AS num_values,
SUM(value) AS bin_size
FROM assign_bins
GROUP BY bin
Which, for some random data:
CREATE TABLE table_name ( value ) AS
SELECT FLOOR(DBMS_RANDOM.VALUE(1, 1000001)) FROM DUAL CONNECT BY LEVEL <= 1000;
May output:
BIN
NUM_VALUES
BIN_SIZE
0
200
100012502
1
200
100004633
2
200
99980342
3
200
99976774
4
200
100005756
It will not get the bins to have equal values but it is relatively simple and will get a close approximation if your values are approximately evenly distributed.
If you want to select values from a certain bin then:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT value
FROM assign_bins
WHERE bin = 0
fiddle

SQL - Returning unique row based on criteria and a priority

I have a data table that looks in practice like this:
Team Shirt Number Name
1 1 Seaman
1 13 Lucas
2 1 Bosnic
2 14 Schmidt
2 23 Woods
3 13 Tubilandu
3 14 Lev
3 15 Martin
I want to remove duplicates of team by the following logic - if there is a "1" shirt number, use that. If not, look for a 13. If not look for 14 then any.
I realise it is probably quite basic but I don't seem to be making any progress with case statements. I know it's something with sub-queries and case statements but I'm struggling and any help gratefully received!
Using SSMS.
Since you didn't specified any DBMS, let me assume row_number() would work for that :
DELETE
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY Team
ORDER BY (CASE WHEN Shirt_Number = 1
THEN 1
WHEN Shirt_Number = 13
THEN 2
WHEN Shirt_Number = 14
THEN 3
ELSE 4
END)
) AS Seq
FROM table t
) t
WHERE Seq = 1;
This assuming Shirt_Numbers have a gap else only order by Shirt_Number enough.
I think you are looking for a partition by clause usage. Solution below worked in Sql Server.
create table #eray
(team int, shirtnumber int, name varchar(200))
insert into #eray values
(1, 1, 'Seaman'),
(1, 13, 'Lucas'),
(2, 1, 'Bosnic'),
(2, 14, 'Schmidt')
;with cte as (
Select Team, ShirtNumber, Name,
ROW_NUMBER() OVER (PARTITION BY Team ORDER BY ShirtNumber ASC) AS rn
From #eray
where ShirtNumber in (1,13,14)
)
select * from cte where rn=1
If you have a table of teams, you can use cross apply:
select ts.*
from teams t cross apply
(select top (1) ts.*
from teamshirts ts
where ts.team = t.team
order by (case shirt_number when 1 then 1 when 13 then 2 when 14 then 3 else 4 end)
) ts;
If you have no numbers between 2 and 12, you can simplify this to:
select ts.*
from teams t cross apply
(select top (1) ts.*
from teamshirts ts
where ts.team = t.team
order by shirt_number
) ts;

How do I filter out data that has duplicate values under a specific column?

I have a table that holds the results of a survey:
submitter issue q1 q2 q3 q4 q5
mike 11557 4 3 4 5 1
mark 13554 5 5 5 5 5
luke 15110 1 1 1 1 1
luke 15110 1 1 1 1 1
donald 16900 4 2 2 4 5
joe 11562 5 5 5 5 5
joe 11562 5 5 5 5 5
sam 12485 2 3 4 3 4
sam 12485 2 3 4 3 4
sam 12485 2 3 4 3 4
I want to be able to filter out multiple submissions and count only 1 of them.
Some folks submitted 3 or 4 times.
I know how to find out how many times a survey was submitted and by whom:
SELECT
submitter
,issue
,COUNT(*) as '# of times Survey submitted'
FROM
Survey
GROUP BY
submitter, issue
HAVING
COUNT(*) > 1
But, I'm not sure how I can use this query to filter out the multiple submissions.
The current query I am working with is:
SELECT 'Question #1' as 'Survey Question'
,CAST(CAST(SUM(q1) AS float)/COUNT(q1) AS decimal (4,2)) as 'Average Score'
FROM Survey
WHERE COALESCE(q1,q2,q3,q4,q5) IS NOT NULL
UNION ALL
SELECT 'Question #2' as 'Survey Question'
,CAST(CAST(SUM(q2) AS float)/COUNT(q2) AS decimal (4,2)) as 'Average Score'
FROM Survey
WHERE COALESCE(q1,q2,q3,q4,q5) IS NOT NULL
UNION ALL
etc...
The desired outcome is: (Note: this result set is not accurate. Just format i would like to have.)
Survey Question Average Score
Question #1 4.58
Question #2 4.80
Question #3 4.60
Question #4 4.59
Question #5 4.64
Can anyone provide a clue?
Thanks so much!
I think I got the math right, but my results don't match yours exactly. Are you sure your desired results are correct?
DECLARE #yourTable TABLE (submitter VARCHAR(10), Issue INT, q1 TINYINT, q2 TINYINT,q3 TINYINT, q4 TINYINT,q5 TINYINT);
INSERT INTO #yourTable
VALUES ('mike',11557,4,3,4,5,1),
('mark',13554,5,5,5,5,5),
('luke',15110,1,1,1,1,1),
('luke',15110,1,1,1,1,1),
('donald',16900,4,2,2,4,5),
('joe',11562,5,5,5,5,5),
('joe',11562,5,5,5,5,5),
('sam',12485,2,3,4,3,4),
('sam',12485,2,3,4,3,4),
('sam',12485,2,3,4,3,4);
WITH CTE_Distinct
AS
(
SELECT DISTINCT *
FROM #yourTable --just change this to your actual table name.
)
SELECT REPLACE(question,'q','Question #') AS [Survey Question],
CAST(AVG(val * 1.0) AS DECIMAL(4,2)) AS [Average Score]
FROM CTE_Distinct
UNPIVOT
(
val FOR question IN (q1,q2,q3,q4,q5)
) unpvt
GROUP BY question
Results:
Survey Question Average Score
-------------------- ---------------------------------------
Question #1 3.50
Question #2 3.17
Question #3 3.50
Question #4 3.83
Question #5 3.50
WITH TestData AS (
SELECT *
FROM (VALUES
('Mike', 11557, 4, 3, 4, 5, 1)
, ('Mark', 13554, 5, 3, 5, 5, 5)
, ('Luke', 15110, 1, 1, 1, 1, 1)
, ('Luke', 15110, 1, 1, 1, 1, 1)
, ('Donald', 16900, 4, 2, 2, 4, 5)
, ('Joe', 11562, 5, 5, 5, 5, 5)
, ('Joe', 11562, 5, 5, 5, 5, 5)
, ('Sam', 12485, 2, 3, 4, 3, 4)
, ('Sam', 12485, 2, 3, 4, 3, 4)
, ('Sam', 12485, 2, 3, 4, 3, 4)
) A (Submitter, Issue, Q1, Q2, Q3, Q4, Q5)
)
SELECT SurveyQuestion
, AverageScore = AVG(QuestionAnswer * 1.) -- Change the math here if this isn't what you want
FROM (
SELECT A.Submitter
, A.Issue
, B.SurveyQuestion
, B.QuestionAnswer
, RowNum = ROW_NUMBER() OVER(PARTITION BY A.Submitter, A.Issue, B.SurveyQuestion ORDER BY (SELECT NULL)) -- Replace ORDER BY (SELECT NULL) with something more meaningful if you can
FROM TestData A
CROSS APPLY(VALUES -- Unpivot
('Question #1', A.Q1)
, ('Question #2', A.Q2)
, ('Question #3', A.Q3)
, ('Question #4', A.Q4)
, ('Question #5', A.Q5)
) B (SurveyQuestion, QuestionAnswer)
WHERE B.SurveyQuestion IS NOT NULL
) A
WHERE RowNum = 1
GROUP BY SurveyQuestion;
The first solution I think you can apply is: pick the submitter and issue and the max value of every one of the answers given per per sumitter:
select submitter, issue,
(select max(q1)
from survey
where submitter = parent.submitter
and issue = parent.issue) as q1,
(select max(q2)
from survey
where submitter = parent.submitter
and issue = parent.issue) as q2,
(select max(q3)
from survey
where submitter = parent.submitter
and issue = parent.issue) as q3,
(select max(q4)
from survey
where submitter = parent.submitter
and issue = parent.issue) as q4,
(select max(q5)
from survey
where submitter = parent.submitter
and issue = parent.issue) as q5
from survery as parent
group by submitter, issue;
But the problem of this solution is that it gives for example the greatest answers per question, which might not be the desired output.
Another approach passes by adding an id per register:
alter table survery add id bigint auto_increment;
With an id that marks every line as different, this is fish of another keetle. The select is much, much simpler:
select *
from survey
where (submitter, issue, id ) in
(
select submitter, issue, max(id)
from survey
group by submitter, issue);
The inner select (the one that haves the group by), identifies what id you want to get, the second select retrieves all the information: submitter, id, and the answers. You can use it with a max() to retrieve the last answer as the good answer, or you can use it with a min() to retrieve the first answer.
Update
Sorry, I din't read that "average" request you made. In the case you want an average value instead of the answer, I humbly recomend the second approach. The select would be then:
select avg(q1) as avg_q1,
avg(q2) as avg_q2,
....
from survey
where (submitter, issue, id ) in
(
select submitter, issue, max(id)
from survey
group by submitter, issue);

SQL random number that doesn't repeat within a group

Suppose I have a table:
HH SLOT RN
--------------
1 1 null
1 2 null
1 3 null
--------------
2 1 null
2 2 null
2 3 null
I want to set RN to be a random number between 1 and 10. It's ok for the number to repeat across the entire table, but it's bad to repeat the number within any given HH. E.g.,:
HH SLOT RN_GOOD RN_BAD
--------------------------
1 1 9 3
1 2 4 8
1 3 7 3 <--!!!
--------------------------
2 1 2 1
2 2 4 6
2 3 9 4
This is on Netezza if it makes any difference. This one's being a real headscratcher for me. Thanks in advance!
To get a random number between 1 and the number of rows in the hh, you can use:
select hh, slot, row_number() over (partition by hh order by random()) as rn
from t;
The larger range of values is a bit more challenging. The following calculates a table (called randoms) with numbers and a random position in the same range. It then uses slot to index into the position and pull the random number from the randoms table:
with nums as (
select 1 as n union all select 2 union all select 3 union all select 4 union all select 5 union all
select 6 union all select 7 union all select 8 union all select 9
),
randoms as (
select n, row_number() over (order by random()) as pos
from nums
)
select t.hh, t.slot, hnum.n
from (select hh, randoms.n, randoms.pos
from (select distinct hh
from t
) t cross join
randoms
) hnum join
t
on t.hh = hnum.hh and
t.slot = hnum.pos;
Here is a SQLFiddle that demonstrates this in Postgres, which I assume is close enough to Netezza to have matching syntax.
I am not an expert on SQL, but probably do something like this:
Initialize a counter CNT=1
Create a table such that you sample 1 row randomly from each group and a count of null RN, say C_NULL_RN.
With probability C_NULL_RN/(10-CNT+1) for each row, assign CNT as RN
Increment CNT and go to step 2
Well, I couldn't get a slick solution, so I did a hack:
Created a new integer field called rand_inst.
Assign a random number to each empty slot.
Update rand_inst to be the instance number of that random number within this household. E.g., if I get two 3's, then the second 3 will have rand_inst set to 2.
Update the table to assign a different random number anywhere that rand_inst>1.
Repeat assignment and update until we converge on a solution.
Here's what it looks like. Too lazy to anonymise it, so the names are a little different from my original post:
/* Iterative hack to fill 6 slots with a random number between 1 and 13.
A random number *must not* repeat within a household_id.
*/
update c3_lalfinal a
set a.rand_inst = b.rnum
from (
select household_id
,slot_nbr
,row_number() over (partition by household_id,rnd order by null) as rnum
from c3_lalfinal
) b
where a.household_id = b.household_id
and a.slot_nbr = b.slot_nbr
;
update c3_lalfinal
set rnd = CAST(0.5 + random() * (13-1+1) as INT)
where rand_inst>1
;
/* Repeat until this query returns 0: */
select count(*) from (
select household_id from c3_lalfinal group by 1 having count(distinct(rnd)) <> 6
) x
;

How can a group by be converted to a self-join

for a table such as:
employeeID | groupCode
1 red111
2 red111
3 blu123
4 blu456
5 red553
6 blu423
7 blu341
how can I count the number of employeeIDs that are in parent groups (such as red or blu, but there are many more groups in the real table) that have a total number of group members greater than 2 (so all those with blu in this particular example) excluding themselves.
To expand: groupCode consists of a parent group (three letters), followed by some numbers for the subgroup.
using a self-join, or at least without using a group by statement.
So far I have:
SELECT T1.employeeID
FROM TABLE T1, TABLE T2
WHERE T1.groupCode <> T2.groupCode
AND SUBSTR(T1.groupCode, 1, 3) = SUBSTR(T2.gorupCode, 1, 3);
but that doesn't do much for me...
Add an index on the first 3 characters of EMPLOYEE.
Then try this one:
SELECT ed.e3
, COUNT(*)
FROM EMPLOYEE e
JOIN
( SELECT DISTINCT
SUBSTR(groupCode, 1, 3) AS e3
FROM EMPLOYEE
) ed
ON e.groupCode LIKE CONCAT(ed.e3, '%')
GROUP BY ed.e3
HAVING COUNT(*) >= 3 --- or whatever is wanted
What about
SELECT substring(empshirtno, 1, 3),
Count(SELECT 1 from myTable as myTable2
WHERE substring(mytable.empshirtno, 1, 3) = substring(mytable2.empshirtno, 1, 3))
FROM MyTable
GROUP BY substring(mytable2.empshirtno, 1, 3)
maybe counting from a subquery is speedier with an index