Randomly flagging records in an Oracle Table - sql

Given a table of IDs in an Oracle database, what is the best method to randomly flag (x) percent of them? In the example below, I am randomly flagging 20% of all records.
ID ELIG
1 0
2 0
3 0
4 1
5 0
My current approach shown below works fine, but I am wondering if there is a more efficient way to do this?
WITH DAT
AS ( SELECT LEVEL AS "ID"
FROM DUAL
CONNECT BY LEVEL <= 5),
TICKETS
AS (SELECT "ID", 1 AS "ELIG"
FROM ( SELECT *
FROM DAT
ORDER BY DBMS_RANDOM.VALUE ())
WHERE ROWNUM <= (SELECT ROUND (COUNT (*) / 5, 0) FROM DAT)),
RAFFLE
AS (SELECT "ID", 0 AS "ELIG"
FROM DAT
WHERE "ID" NOT IN (SELECT "ID"
FROM TICKETS)
UNION
SELECT * FROM TICKETS)
SELECT *
FROM RAFFLE;

You could use a ROW_NUMBER approach here:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (ORDER BY dbms_random.value) rn,
COUNT(*) OVER () cnt
FROM yourTable t
)
SELECT t.*,
CASE WHEN rn / cnt <= 0.2 THEN 'FLAG' END AS flag -- 0.2 to flag 20%
FROM cte t
ORDER BY ID;
Demo
Sample output for one run of the above query:
Note that one of five records is flagged, which is 20%.

Related

Oracle get rank for only latest date

I have origin table A:
dt
c1
value
2022/10/1
1
1
2022/10/2
1
2
2022/10/3
1
3
2022/10/1
2
4
2022/10/2
2
6
2022/10/3
2
5
Currently I got the latest dt's percent_rank by:
select * from
(
select
*,
percent_rank() over (partition by c1 order by value) as prank
from A
) as pt
where pt.dt = Date'2022-10-3'
Demo: https://www.db-fiddle.com/f/rXynTaD5nmLqFJdjDSCZpL/0
the excepted result looks like:
dt
c1
value
prank
2022/10/3
1
3
1
2022/10/3
2
5
0.5
Which means at 2022-10-3, the value in c1 group's percent_rank in history is 100% while in c2 group is 66%.
But this sql will sort evey partition which I thought it's time complexity is O(n log n).
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count() which cost O(n).
Any suggestions?
Rather than hard-coding the maximum date, you can use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
PERCENT_RANK() OVER (PARTITION BY c1 ORDER BY value) AS prank,
ROW_NUMBER() OVER (PARTITION BY c1 ORDER BY dt DESC) AS rn
FROM table_name t
) t
WHERE rn = 1
Which, for the sample data:
CREATE TABLE table_name (dt, c1, value) AS
SELECT DATE '2022-10-01', 1, 1 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 1, 2 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 1, 3 FROM DUAL UNION ALL
SELECT DATE '2022-10-01', 2, 4 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 2, 6 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 2, 5 FROM DUAL;
Outputs:
DT
C1
VALUE
PRANK
RN
2022-10-03 00:00:00
1
3
1
1
2022-10-03 00:00:00
2
5
.5
1
fiddle
But this sql will sort every partition which I thought it's time complexity is O(n log n).
Whatever you do you will need to iterate over the entire result-set.
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count().
Then you will need to find the last value which (unless you are hardcoding the last date) will involve using an index- or table-scan over all the values in each partition and sorting the values and then to find a count of the greater values will require a second index- or table-scan. You can profile both solutions but I expect you would find that using analytic functions is going to be equally efficient, if not better, than trying to use aggregation functions.
For example:
SELECT c1,
dt,
value,
( SELECT ( COUNT(CASE WHEN value <= t.value THEN 1 END) - 1 )
/ ( COUNT(*) - 1 )
FROM table_name c
WHERE c.c1 = t.c1
) AS prank
FROM table_name t
WHERE dt = DATE '2022-10-03'
If going to access the table twice and you are likely to find that the I/O costs of table access are going to far outweight any potential savings from using a different method. However, if you look at the explain plan (fiddle) then the query is still performing an aggregate sort so there is not going to be any cost savings, only additional costs from this method.
Try this
select t.c1, t.dt, t.value
from TABLENAME t
inner join (
select c1, max(dt) as MaxDate
from TABLENAME
group BY dt
) tm on t.c1 = tm.c1 and t.dt = tm.MaxDate ORDER BY dt DESC;
Or as simple as
SELECT * from TABLENAME ORDER BY dt DESC;
I fiddled it a bit, it is almost the same answer as MT0 already put.
select dt, c1, val, prank*100 as percent_rank from (select
t1.*,
percent_rank() over (partition by c1 order by val) as prank,
row_number() over (partition by c1 order by dt desc) rn from t1) where rn=1;
result
DT C1 VAL PERCENT_RANK
2022-10-03 1 3 100
2022-10-03 2 5 50
http://sqlfiddle.com/#!4/ec60a/23
I used Row_number = 1 to get the latest date.
And also pushed the percent_rank as percent.
Is this what you desire?

Random records in Oracle table based on conditions

I have a Oracle table with the following columns
Table Structure
In a query I need to return all the records with CPER>=40 which is trivial. However, apart from CPER>=40 I need to list 5 random records for each CPID.
I have attached a sample list of records. However, in my table I have around 50,000 records.
Appreciate if you can help.
Oracle solution:
with CTE as
(
select t1.*,
row_number() over(order by DBMS_RANDOM.VALUE) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
SQL Server:
with CTE as
(
select t1.*,
row_number() over(order by newid()) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
How about something like this...
SELECT *
FROM (SELECT CID,
CVAL,
CPID,
CPER,
Row_number() OVER (partition BY CPID ORDER BY CPID ASC ) AS RN
FROM Table) tmp
WHERE CPER>=40 OR pids <= 5
However, this is not random.
Assuming that you want five additional random records, you can do:
select t.*
from (select t.*,
row_number() over (partition by cpid,
(case when cper >= 40 then 1 else 2 end)
order by dbms_random.value
) as seqnum
from t
) t
where seqnum <= 5 or cper >= 40;
The row_number() is enumerating the rows for each cpid in two groups -- based on the cper value. The outer where is taking all cper values in the range you want as well as five from the other group.

Oracle: Need to fetch the rows exponentially

I got below query from another post which selects 100 rows from every 2000 rows.
Like this: 1-100,2001-2100,4001-4100,6001-6100,8001-8100 and so on.
SELECT * FROM (SELECT t.*,ROWNUM AS rn FROM(SELECT * FROM your_table ORDER BY your_condition) t)WHERE MOD( rn - 1, 2000 ) < 100;
Now I want to select my data exponentially.Such that it will select 100 rows from first 1000 rows, then from next 2000 rows, then from next 4000 rows.
Like this: 1-100,2000-2100,4000-4100,8000-8100,16000-16100 and so on.
The idea is to scan rows with a specific pattern.
You asked this in a comment on your previous question and I answered there...
SELECT *
FROM (
SELECT t.*,
ROWNUM AS rn -- Secondly, assign a row number to the ordered rows
FROM (
SELECT *
FROM your_table
ORDER BY your_condition -- First, order the data
) t
)
WHERE rn - POWER( -- Finally, filter the top 100.
2,
TRUNC( CAST( LOG( 2, CEIL( rn / 1000 ) ) AS NUMBER(20,4) ) )
) * 1000 + 1000 <= 100
This will take the first 100 rows from the groups 1-1000, 1001-3000, 3001-7000, 7001-15000, etc.
Or, to get the rows:
1-100,2000-2100,4000-4100,8000-8100,16000-16100, 32000-32100 and so on.
Then:
WHERE CASE -- Finally, filter the top 100.
WHEN rn <= 2000 THEN rn
ELSE rn - POWER(
2,
TRUNC( CAST( LOG( 2, CEIL( rn / 1000 - 1 ) ) AS NUMBER(20,4) ) )
) * 1000
END <= 100
You could use power function and simple hierarchical query, then join it with your table. Here is example with all_objects view:
with rng as (select 0 num from dual union all
select 1000 * power(2, level) from dual connect by level < 10 )
select *
from (select row_number() over (order by object_name) rn, object_name from all_objects)
join rng on rn between num + 1 and num + 100
From what you describe, you can use logs to define the groups. This is probably close enough to what you want:
select t.*
from (select t.*,
row_number() over (floor(log(2, floor(1 + (seqnum - 1) / 1000) ))
order by col
) as seqnum_2
from (select t.*, row_number() over (order by col) as seqnum
from t
) t
where seqnum_2 <= 100;
The difference from your description is that the first group is 1-999, 1000-1999, and so on.

Oracle: I need to select n rows from every k rows of a table

For example:
My table has 10000 rows. First I will divide it in 5 sets of 2000(k) rows. Then from each set of 2000 rows I will select only top 100(n) rows.
With this approach I am trying to scan some rows of table with a specific pattern.
Assuming you are ordering them 1 - 10000 using some logic and want to output only rows 1-100,2001-2100,4001-4100,etc then you can use the ROWNUM pseudocolumn:
SELECT *
FROM (
SELECT t.*,
ROWNUM AS rn -- Secondly, assign a row number to the ordered rows
FROM (
SELECT *
FROM your_table
ORDER BY your_condition -- First, order the data
) t
)
WHERE MOD( rn - 1, 2000 ) < 100; -- Finally, filter the top 100 per 2000.
Or you could use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( ORDER BY your_condition ) AS rn
FROM your_table
)
WHERE MOD( rn - 1, 2000 ) < 100;
Is it possible to increase the set of sample data exponentially. Like 1k, 2k, 4k,8k....and then fetch some rows from these.
Replace the WHERE clause with:
WHERE rn - POWER(
2,
TRUNC( CAST( LOG( 2, CEIL( rn / 1000 ) ) AS NUMBER(20,4) ) )
) * 1000 + 1000 <= 100
This solution uses the analytic ntile() to split the raw data into five buckets. That result set is labelled using the analytic row_number() which provides a filter to produce the final set:
with sq1 as ( select id, col1, ntile(5) over (order by id asc) as quintile
from t23
)
, sq2 as ( select id, col1, quintile
, row_number() over ( partition by quintile order by id asc) as rn
from sq1 )
select *
from sq2
where rn <= 200
order by quintile, rn
/
use partition by and order by with row_number. it will look like following:
row_number()over(partition by partition_column order by order_column)<=100
partition_column will be your condition to divide set.
order_column will be your condition to select top 100.

How do I intermerge multiple SELECT results?

First of all, I'm not familiar with SQL in depth, so this may be a beginner question.
I know how to select data ordered by Id: SELECT * FROM foo ORDER BY id LIMIT 100 as well as how to select a random subset: SELECT * FROM foo ORDER BY RAND() LIMIT 100.
I'd like to merge these two queries into 1 in a zip manner, choosing limit/2 from each (i.e. 50). For example:
0
85
1
35
2
38
3
19
4
...
I would like to avoid duplicates. The easiest way is probably to just add a WHERE id > 100/2 to the part of the query that retrieves randomly ordered rows.
Additional info: It is unknown how many rows exist.
To get the "zip-manner" merge add a generated rownumber to each query and use an union with order by rownnumber.
Use even numbers for one and odd numbers for the other query.
Try this for MySQL
SELECT
#rownum0:=#rownum0+2 rn,
f.*
FROM ( SELECT * FROM foo ORDER BY id ) f, (SELECT #rownum0:=0) r
UNION
SELECT #rownum1:=#rownum1+2 rn,
b.*
FROM ( SELECT * FROM bar ORDER BY RAND() ) b, (SELECT #rownum1:=-1) r
ORDER BY rn
LIMIT 100
This should be self-explicative but doesnt remove duplicates:
select #rownum:=#rownum+1 as rownum,
(#rownum-1) % 50 as sortc, u.id
from (
(select id from player order by id limit 50)
union all
(select id from player order by rand() limit 50)) u,
(select #rownum:=0) r
order by sortc,rownum;
If you replace "union all" with "union", you remove duplicates but get less rows as a consequence.
This will deal with duplicates, does not restrict random numbers in the ids > 50, and always return 100 rows:
SELECT #rownum := #rownum + 1 AS rownum,
( #rownum - 1 ) % 50 AS sortc,
u.id
FROM ((SELECT id
FROM foo
ORDER BY Rand()
LIMIT 50)
UNION
(SELECT id
FROM foo
WHERE id <= 100
ORDER BY id)) u,
(SELECT #rownum := 0) r
WHERE #rownum < 100
ORDER BY sortc,
rownum DESC
LIMIT 100;
SELECT * FROM foo ORDER BY id LIMIT 50 UNION SELECT * FROM foo ORDER BY RAND() LIMIT 50
If I understand your requirement correctly. UNION removes duplicates by itself