Split table in (postgreSQL) randomly 50/50 - sql

I have a table with two columns in postgresql: original id and duplicate id.
Sample data:
original_id duplicate_id
1 1
2 2
3 3
4 4
5 5
6 6
I would like to randomly split this table in 50/50, so I can put a specific tag in each
Sample data:
original_id duplicate_id tag
1 1 control
2 2 treatment
3 3 treatment
4 4 control
5 5 treatment
6 6 control
What is important:
1. The selection has to be random
2. The split has to be 50/50 (or the closest to this if the number of rows is odd)

You can select half of the rows in a random order with this query:
select *
from my_table
order by random()
limit (select count(*)/ 2 from my_table)
Use it to tag the rows:
with control as (
select *
from my_table
order by random()
limit (select count(*)/ 2 from my_table)
)
select
*,
case when t in (select t from control t) then 'control' else 'treatment' end
from my_table t;
Working example in rextester.

You can use rownumber() OVER (ORDER BY random()) to assign a random number to each record. Then use it in a CASE to assign either the tag 'control' or 'treatment' depending on the number being less than (or equal) than the half of the count of rows in the table or not.
For a SELECT that looks like this:
SELECT original_id,
duplicate_id,
CASE
WHEN rn <= (SELECT count(*) / 2
FROM elbat) THEN
'control'
ELSE
'treatment'
END tag
FROM (SELECT original_id,
duplicate_id,
row_number() OVER (ORDER BY random()) rn
FROM elbat) x;
If you want an UPDATE (I'm not sure on this), assuming, that the pair of original_id and duplicate_id is unique, this could look like:
UPDATE elbat t
SET tag = CASE
WHEN rn <= (SELECT count(*) / 2
FROM elbat) THEN
'control'
ELSE
'treatment'
END
FROM (SELECT original_id,
duplicate_id,
row_number() OVER (ORDER BY random()) rn
FROM elbat) x
WHERE x.original_id = t.original_id
AND x.duplicate_id = t.duplicate_id;
db<>fiddle
(BTW, that SELECT result on the Fiddle gives a nice example, that the order of the rows returned can be totally different from the physical one, if the optimizer likes it better that way.)

I would use window functions:
select t.*,
(case when seqnum <= cnt / 2
then 'treatment' else 'control
end) as tag
from (select t.*,
count(*) over () as cnt,
row_number() over (order by random() as seqnum
from t
) t;
Actually, random is random. So, you don't need the count. You can use modulo arithmetic instead:
select t.*,
(case when row_number() over (order by random()) % 2 = 1
then 'treatment' else 'control'
end) as tag
from t;

You can make the random() generate the values 1 or 2 using the formula: (random() + 1)::int
select t.*,
case (random() + 1)::int
when 1 then 'treatment'
else 'control'
end as tag
from t;
In general, (random() * (upper_limit - 1) + lower_limit)::int will generate numbers between upper_limit and lower_limit (inclusive). If upper limit is 2 then the multiplication can be removed (because it would be * 1 which doesn't change anything), but if you want to e.g. generate four random values you can use that as well:
select t.*,
case (random() * 3 + 1)::int
when 1 then 'treatment'
when 2 then 'control'
when 3 then 'something'
else 'some other thing'
end as tag
from t;

Related

Randomly flagging records in an Oracle Table

Given a table of IDs in an Oracle database, what is the best method to randomly flag (x) percent of them? In the example below, I am randomly flagging 20% of all records.
ID ELIG
1 0
2 0
3 0
4 1
5 0
My current approach shown below works fine, but I am wondering if there is a more efficient way to do this?
WITH DAT
AS ( SELECT LEVEL AS "ID"
FROM DUAL
CONNECT BY LEVEL <= 5),
TICKETS
AS (SELECT "ID", 1 AS "ELIG"
FROM ( SELECT *
FROM DAT
ORDER BY DBMS_RANDOM.VALUE ())
WHERE ROWNUM <= (SELECT ROUND (COUNT (*) / 5, 0) FROM DAT)),
RAFFLE
AS (SELECT "ID", 0 AS "ELIG"
FROM DAT
WHERE "ID" NOT IN (SELECT "ID"
FROM TICKETS)
UNION
SELECT * FROM TICKETS)
SELECT *
FROM RAFFLE;
You could use a ROW_NUMBER approach here:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (ORDER BY dbms_random.value) rn,
COUNT(*) OVER () cnt
FROM yourTable t
)
SELECT t.*,
CASE WHEN rn / cnt <= 0.2 THEN 'FLAG' END AS flag -- 0.2 to flag 20%
FROM cte t
ORDER BY ID;
Demo
Sample output for one run of the above query:
Note that one of five records is flagged, which is 20%.

Oracle Specific Sorting

I have the following problem:
I need to sort some products where one needs to be a specific row and others to be random.
So if I have products: A B C D, I need for example B to be the third product while others can be random like:
C 1
A 2
B 3
D 4
Best shot I have tried is (3 is a dynamic value):
SELECT
product_name,
CASE
WHEN product = 'B' THEN 3
ELSE ( CASE WHEN rownum < 3 THEN rownum ELSE rownum + 1 END )
END sorting
FROM
products
ORDER BY
sorting ASC;
but I'm not always getting the desired outcome.
Any help or lead is appreciated.
This is rather tricky, but you can use row_number() and a bunch of arithmetic:
select p.*
from (select p.*,
row_number() over (order by case when product = 'B' then 2 else 1 end),
dbms_random.value
) as seqnum
from products p
) p
order by (case when seqnum < 3 then seqnum end),
(case when product = 'B' then 1 else 2 end),
seqnum;
The logic is:
Enumerate the values randomly, with the special value going last.
Put in the rows with lower values.
Put in the row with the special value.
Put in the rest of the rows.
The above uses a subquery because the randomness is enforced. You can do this without a subquery as:
order by (case when row_number() over (order by (case when product = 'B' then 2 else 1 end) < 3
then dbms_random.value
else 2 -- bigger than value
end),
(case when product = 'B' then 1 else 2 end),
dbms_random.value;

get intervals of nonchanging value from a sequence of numbers

I need to sumarize a sequence of values into intervals of nonchanging values - begin, end and value for each such interval. I can easily do it in plsql but would like a pure sql solution for both performance and educational reasons. I have been trying for some time to solve it with analytical functions, but can't figure how to properly define windowing clause. The problem I am having is with a repeated value.
Simplified example -
given input:
id value
1 1
2 1
3 2
4 2
5 1
I'd like to get output
from to val
1 2 1
3 4 2
5 5 1
You want to identify groups of adjacent values. One method is to use lag() to find the beginning of the sequence, then a cumulative sum to identify the groups.
Another method is the difference of row number:
select value, min(id) as from_id, max(id) as to_id
from (select t.*,
(row_number() over (order by id) -
row_number() over (partition by val order by id
) as grp
from table t
) t
group by grp, value;
Using a CTE to collect all the rows and identifying them into changing values, then finally grouping together for the changing values.
CREATE TABLE #temp (
ID INT NOT NULL IDENTITY(1,1),
[Value] INT NOT NULL
)
GO
INSERT INTO #temp ([Value])
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 2 UNION ALL
SELECT 1;
WITH Marked AS (
SELECT
*,
grp = ROW_NUMBER() OVER (ORDER BY ID)
- ROW_NUMBER() OVER (PARTITION BY Value ORDER BY ID)
FROM #temp
)
SELECT MIN(ID) AS [From], MAX(ID) AS [To], [VALUE]
FROM Marked
GROUP BY grp, Value
ORDER BY MIN(ID)
DROP TABLE #temp;

Duplicate Counts - TSQL

I want to get All records that has duplicate values for SOME of the fields (i.e. Key columns).
My code:
CREATE TABLE #TEMP (ID int, Descp varchar(5), Extra varchar(6))
INSERT INTO #Temp
SELECT 1,'One','Extra1'
UNION ALL
SELECT 2,'Two','Extra2'
UNION ALL
SELECT 3,'Three','Extra3'
UNION ALL
SELECT 1,'One','Extra4'
SELECT ID, Descp, Extra FROM #TEMP
;WITH Temp_CTE AS
(SELECT *
, ROW_NUMBER() OVER (PARTITION BY ID, Descp ORDER BY (SELECT 0))
AS DuplicateRowNumber
FROM #TEMP
)
SELECT * FROM Temp_cte
DROP TABLE #TEMP
The last column tells me how many times each row has appeared based on ID and Descp values.
I want that row but I ALSO need another column* that indicates both rows for ID = 1 and Descp = 'One' has showed up more than once.
So an extra column* (i.e. MultipleOccurances (bool)) which has 1 for two rows with ID = 1 and Descp = 'One' and 0 for other rows as they are only showing up once.
How can I achieve that? (I want to avoid using Count(1)>1 or something if possible.
Edit:
Desired output:
ID Descp Extra DuplicateRowNumber IsMultiple
1 One Extra1 1 1
1 One Extra4 2 1
2 Two Extra2 1 0
3 Three Extra3 1 0
SQL Fiddle
You say "I want to avoid using Count" but it is probably the best way. It uses the partitioning you already have on the row_number
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, Descp
ORDER BY (SELECT 0)) AS DuplicateRowNumber,
CASE
WHEN COUNT(*) OVER (PARTITION BY ID, Descp) > 1 THEN 1
ELSE 0
END AS IsMultiple
FROM #Temp
And the execution plan just shows a single sort
Well, I have this solution, but using a Count...
SELECT T1.*,
ROW_NUMBER() OVER (PARTITION BY T1.ID, T1.Descp ORDER BY (SELECT 0)) AS DuplicateRowNumber,
CASE WHEN T2.C = 1 THEN 0 ELSE 1 END MultipleOcurrences FROM #temp T1
INNER JOIN
(SELECT ID, Descp, COUNT(1) C FROM #TEMP GROUP BY ID, Descp) T2
ON T1.ID = T2.ID AND T1.Descp = T2.Descp

Oracle SQL -- Analytic functions OVER a group?

My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question