Generating distrubted amount of records - sql

I have some code that is generating parent and a random number of child records for each parent. I want there to 5 or more child records for each parent and less than 20.
I ran this several times and I seem to be getting none or very few child records in the range of 5-13.
Can someone please explain how I can get a more distributed value of child records.
If you run the last query below you will see there are no count(*) or very few for from the values 6-15.
No doubt I have a problem with my logic but I can't seem to find it. I'm also open to any new code that accomplishes the same task and produces a distributed amount of child records with an INSERT all statement.
My goal is to generate a huge amount of testing data to examine the application queries. For now I'm only generating 30 days worth.
CREATE TABLE emp_info
(
empid INTEGER,
empname VARCHAR2(50)
);
CREATE TABLE emp_attendance
(
empid INTEGER,
start_date DATE,
end_date DATE
);
INSERT ALL
-- WHEN rn=1 insert the parent record.
-- 1 will always =1 always insert a
-- child record.
WHEN rn = 1 then into emp_info (empid, empname) values (id, name)
WHEN 1 = 1 then into emp_attendance (empid, start_date, end_date)
VALUES(id, d1, d1 + DBMS_RANDOM.value (0, .75))
SELECT *
FROM
(
-- get the highest empid as start
-- so this can be run more than once.
-- if never run before start WITH 0.
WITH t AS ( SELECT nvl(max(empid), 0) maxid FROM emp_info )
SELECT CEIL(maxid + level/20) id,
CASE MOD(maxid + level, 20) WHEN 1 THEN 1 END rn,
-- create an alpha name from 3-15
-- characters in length.
DBMS_RANDOM.string('U', DBMS_RANDOM.value(3, 15)) name,
-- set the start date any where from
-- today + 30 days
TRUNC(sysdate) + DBMS_RANDOM.value (1, 30) d1,
CASE WHEN ROW_NUMBER() OVER (PARTITION BY CEIL(maxid + level/20) ORDER BY level) > 5 THEN
-- Ensure there is a minimum of
-- 5 child and a max of 20 records
-- for each parent.
--
-- Exclude first 5 records and then
-- for 6-20 records, generating
-- random number between 5-20.
-- We can then compare with any
-- number between 5-20 so that it
-- can give us any number of
-- records.
DBMS_RANDOM.value(5, 20) ELSE 5 END AS random_val
FROM t
CONNECT BY level <= 20 * 1000
)
WHERE random_val <= 19;
-- why is this where clause neeed?
SELECT empid, COUNT(*)
FROM emp_attendance
GROUP BY empid
ORDER BY empid;
EMPID COUNT(*)
1 20
2 20
3 20
4 18
5 19
6 20
7 20
8 19
9 20
10 20
11 19
……
50 20

Something like this should get you going.
SQL> with
2 emps as
3 ( select level empid, dbms_random.value(5,20) children from dual connect by level <= 20 ),
4 empatt as
5 ( select e.empid , x.start_date, x.start_date+dbms_random.value(0,0.75) end_date
6 from emps e,
7 lateral(
8 select
9 trunc(sysdate)+dbms_random.value(1,30) start_date
10 from dual
11 connect by level <= e.children
12 ) x
13 )
14 select empid, count(*)
15 from empatt
16 group by empid
17 order by 1;
EMPID COUNT(*)
---------- ----------
1 5
2 14
3 17
4 6
5 10
6 18
7 12
8 13
9 16
10 11
11 7
12 14
13 7
14 7
15 7
16 13
17 18
18 9
19 9
20 12

INSERT ALL
WHEN attendid = 1
THEN INTO emp_info (empid, empname) VALUES
(empid, dbms_random.string ( 'U', dbms_random.value (3, 15))
)
WHEN attendid <= attend_cnt
THEN INTO emp_attendance (empid, start_date, end_date) VALUES
(empid, start_date, start_date + dbms_random.value (0, .75))
WITH got_maxid AS
(
SELECT NVL (MAX (empid), 0) AS maxid
FROM emp_info
)
, new_emps AS
(
SELECT maxid + CEIL (LEVEL / 50) AS empid
, MOD (LEVEL, 50) + 1 AS attendid,
CASE
WHEN MOD (LEVEL, 50) = 0
THEN dbms_random.value (5, 50+ 1)
END AS attend_cnt0
FROM got_maxid
CONNECT BY LEVEL <= 2000
)
SELECT n.*,
MIN (attend_cnt0) OVER (PARTITION BY empid) AS attend_cnt,
TRUNC (SYSDATE) + dbms_random.value (5, 30) AS start_date
FROM new_emps n;

Related

Is it possible to use a aggregate function over partition by as a case condition in SQL?

Problem statement is to calculate median from a table that has two columns. One specifying a number and the other column specifying the frequency of the number.
For e.g.
Table "Numbers":
Num
Freq
1
3
2
3
This median needs to be found for the flattened array with values:
1,1,1,2,2,2
Query:
with ct1 as
(select num,frequency, sum(frequency) over(order by num) as sf from numbers o)
select case when count(num) over(order by num) = 1 then num
when count(num) over (order by num) > 1 then sum(num)/2 end median
from ct1 b where sf <= (select max(sf)/2 from ct1) or (sf-frequency) <= (select max(sf)/2 from ct1)
Is it not possible to use count(num) over(order by num) as the condition in the case statement?
Find the relevant row / 2 rows based of the accumulated frequencies, and take the average of num.
The example and Fiddle will also show you the
computations leading to the result.
If you already know that num is unique, rowid can be removed from the ORDER BY clauses
with
t1 as
(
select t.*
,nvl(sum(freq) over (order by num,rowid rows between unbounded preceding and 1 preceding),0) as freq_acc_sum_1
,sum(freq) over (order by num, rowid) as freq_acc_sum_2
,sum(freq) over () as freq_sum
from t
)
select t1.*
,case
when freq_sum/2 between freq_acc_sum_1 and freq_acc_sum_2
then 'V'
end as relevant_record
from t1
order by num, rowid
Fiddle
Example:
ID
NUM
FREQ
FREQ_ACC_SUM_1
FREQ_ACC_SUM_2
FREQ_SUM
RELEVANT_RECORD
7
8
1
0
1
18
5
10
1
1
2
18
1
29
3
2
5
18
6
31
1
5
6
18
3
33
2
6
8
18
4
41
1
8
9
18
V
9
49
2
9
11
18
V
2
52
1
11
12
18
8
56
3
12
15
18
10
92
3
15
18
18
MEDIAN
45
Fiddle for 1M records
You can find the one (or two) middle value(s) and then average:
SELECT AVG(num) AS median
FROM (
SELECT num,
freq,
SUM(freq) OVER (ORDER BY num) AS cum_freq,
(SUM(freq) OVER () + 1)/2 AS median_freq
FROM table_name
)
WHERE cum_freq - freq < median_freq
AND median_freq < cum_freq + 1
Or, expand the values using a LATERAL join to a hierarchical query and then use the MEDIAN function:
SELECT MEDIAN(num) AS median
FROM table_name t
CROSS JOIN LATERAL (
SELECT LEVEL
FROM DUAL
WHERE freq > 0
CONNECT BY LEVEL <= freq
)
Which, for the sample data:
CREATE TABLE table_name (Num, Freq) AS
SELECT 1, 3 FROM DUAL UNION ALL
SELECT 2, 3 FROM DUAL;
Outputs:
MEDIAN
1.5
(Note: For your sample data, there are 6 items, an even number, so the MEDIAN will be half way between the value of 3rd and 4rd items; so half way between 1 and 2 = 1.5.)
db<>fiddle here

Break up running sum into maximum group size / length

I am trying to break up a running (ordered) sum into groups of a max value. When I implement the following example logic...
IF OBJECT_ID(N'tempdb..#t') IS NOT NULL DROP TABLE #t
SELECT TOP (ABS(CHECKSUM(NewId())) % 1000) ROW_NUMBER() OVER (ORDER BY name) AS ID,
LEFT(CAST(NEWID() AS NVARCHAR(100)),ABS(CHECKSUM(NewId())) % 30) AS Description
INTO #t
FROM sys.objects
DECLARE #maxGroupSize INT
SET #maxGroupSize = 100
;WITH t AS (
SELECT
*,
LEN(Description) AS DescriptionLength,
SUM(LEN(Description)) OVER (/*PARTITION BY N/A */ ORDER BY ID) AS [RunningLength],
SUM(LEN(Description)) OVER (/*PARTITION BY N/A */ ORDER BY ID)/#maxGroupSize AS GroupID
FROM #t
)
SELECT *, SUM(DescriptionLength) OVER (PARTITION BY GroupID) AS SumOfGroup
FROM t
ORDER BY GroupID, ID
I am getting groups that are larger than the maximum group size (length) of 100.
A recusive common table expression (rcte) would be one way to resolve this.
Sample data
Limited set of fixed sample data.
create table data
(
id int,
description nvarchar(20)
);
insert into data (id, description) values
( 1, 'qmlsdkjfqmsldk'),
( 2, 'mldskjf'),
( 3, 'qmsdlfkqjsdm'),
( 4, 'fmqlsdkfq'),
( 5, 'qdsfqsdfqq'),
( 6, 'mds'),
( 7, 'qmsldfkqsjdmfqlkj'),
( 8, 'qdmsl'),
( 9, 'mqlskfjqmlkd'),
(10, 'qsdqfdddffd');
Solution
For every recursion step evaluate (r.group_running_length + len(d.description) <= #group_max_length) if the previous group must be extended or a new group must be started in a case expression.
Set group target size to 40 to better fit the sample data.
declare #group_max_length int = 40;
with rcte as
(
select d.id,
d.description,
len(d.description) as description_length,
len(d.description) as running_length,
1 as group_id,
len(d.description) as group_running_length
from data d
where d.id = 1
union all
select d.id,
d.description,
len(d.description),
r.running_length + len(d.description),
case
when r.group_running_length + len(d.description) <= #group_max_length
then r.group_id
else r.group_id + 1
end,
case
when r.group_running_length + len(d.description) <= #group_max_length
then r.group_running_length + len(d.description)
else len(d.description)
end
from rcte r
join data d
on d.id = r.id + 1
)
select r.id,
r.description,
r.description_length,
r.running_length,
r.group_id,
r.group_running_length,
gs.group_sum
from rcte r
cross apply ( select max(r2.group_running_length) as group_sum
from rcte r2
where r2.group_id = r.group_id ) gs -- group sum
order by r.id;
Result
Contains both the running group length as well as the group sum for every row.
id description description_length running_length group_id group_running_length group_sum
-- ---------------- ------------------ -------------- -------- -------------------- ---------
1 qmlsdkjfqmsldk 14 14 1 14 33
2 mldskjf 7 21 1 21 33
3 qmsdlfkqjsdm 12 33 1 33 33
4 fmqlsdkfq 9 42 2 9 39
5 qdsfqsdfqq 10 52 2 19 39
6 mds 3 55 2 22 39
7 qmsldfkqsjdmfqlkj 17 72 2 39 39
8 qdmsl 5 77 3 5 28
9 mqlskfjqmlkd 12 89 3 17 28
10 qsdqfdddffd 11 100 3 28 28
Fiddle to see things in action (includes random data version).

SQL for a generated table with column 1 a sequence of numbers and column 2 a running sum

What would be the SQL (standard, or any major variant) to produce a table like the following?
1 1 -- 1
2 3 -- 2+1
3 6 -- 3+2+1
4 10 -- 4+3+2+1
5 15 -- 5+4+3+2+1
6 21 -- 6+5+4+3+2+1
... ...
The second column is the sum of the numbers in the first.
I couldn't get past this:
select rownum from all_objects where rownum <= 10;
Which produces column 1 (PL/SQL)
Tried to think on the following lines but clearly it is wrong, even syntactically:
select rownum, count(t2.rownum)
from
(select sum(rownum) from all_objects where rownum <= 10) t2,
all_objects
where rownum <= 10;
It is simple math:
select rownum, rownum * (rownum + 1) / 2
from all_objects
where rownum <= 10;
You don't need to hit the all_objects view; you could use a hierarchical query:
select level as position, sum(level) over (order by level) as running_sum
from dual
connect by level <= 10;
POSITION RUNNING_SUM
---------- -----------
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
10 55
or using #forpas' arithmetic-series method:
select level as position, level * (level + 1) / 2 as running_sum
from dual
connect by level <= 10;
POSITION RUNNING_SUM
---------- -----------
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
10 55
Or recursive subquery factoring (11gR2+):
with rcte (position, running_sum) as (
select 1, 1 from dual
union all
select position + 1, running_sum + position + 1
from rcte
where position < 10
)
select * from rcte
order by position;
POSITION RUNNING_SUM
---------- -----------
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
10 55
You are looking for a cumulative sum:
select rownum, sum(rownum) over (order by rownum)
from all_objects
where rownum <= 10;
I wasn't sure this would actually work on rownum, but it does.

skip consecutive rows after specific value

Note: I have a working query, but am looking for optimisations to use it on large tables.
Suppose I have a table like this:
id session_id value
1 5 7
2 5 1
3 5 1
4 5 12
5 5 1
6 5 1
7 5 1
8 6 7
9 6 1
10 6 3
11 6 1
12 7 7
13 8 1
14 8 2
15 8 3
I want the id's of all rows with value 1 with one exception:
skip groups with value 1 that directly follow a value 7 within the same session_id.
Basically I would look for groups of value 1 that directly follow a value 7, limited by the session_id, and ignore those groups. I then show all the remaining value 1 rows.
The desired output showing the id's:
5
6
7
11
13
I took some inspiration from this post and ended up with this code:
declare #req_data table (
id int primary key identity,
session_id int,
value int
)
insert into #req_data(session_id, value) values (5, 7)
insert into #req_data(session_id, value) values (5, 1) -- preceded by value 7 in same session, should be ignored
insert into #req_data(session_id, value) values (5, 1) -- ignore this one too
insert into #req_data(session_id, value) values (5, 12)
insert into #req_data(session_id, value) values (5, 1) -- preceded by value != 7, show this
insert into #req_data(session_id, value) values (5, 1) -- show this too
insert into #req_data(session_id, value) values (5, 1) -- show this too
insert into #req_data(session_id, value) values (6, 7)
insert into #req_data(session_id, value) values (6, 1) -- preceded by value 7 in same session, should be ignored
insert into #req_data(session_id, value) values (6, 3)
insert into #req_data(session_id, value) values (6, 1) -- preceded by value != 7, show this
insert into #req_data(session_id, value) values (7, 7)
insert into #req_data(session_id, value) values (8, 1) -- new session_id, show this
insert into #req_data(session_id, value) values (8, 2)
insert into #req_data(session_id, value) values (8, 3)
select id
from (
select session_id, id, max(skip) over (partition by grp) as 'skip'
from (
select tWithGroups.*,
( row_number() over (partition by session_id order by id) - row_number() over (partition by value order by id) ) as grp
from (
select session_id, id, value,
case
when lag(value) over (partition by session_id order by session_id) = 7
then 1
else 0
end as 'skip'
from #req_data
) as tWithGroups
) as tWithSkipField
where tWithSkipField.value = 1
) as tYetAnotherOutput
where skip != 1
order by id
This gives the desired result, but with 4 select blocks I think it's way too inefficient to use on large tables.
Is there a cleaner, faster way to do this?
The following should work well for this.
WITH
cte_ControlValue AS (
SELECT
rd.id, rd.session_id, rd.value,
ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
FROM
#req_data rd
CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)
)
SELECT
cv.id, cv.session_id, cv.value
FROM
cte_ControlValue cv
WHERE
cv.value = 1
AND cv.ControlValue <> 7;
Results...
id session_id value
----------- ----------- -----------
5 5 1
6 5 1
7 5 1
11 6 1
13 8 1
Edit: How and why it works...
The basic premise is taken from Itzik Ben-Gan's "The Last non NULL Puzzle".
Essentially, we are relying 2 different behaviors that most people don't usually think about...
1) NULL + anything = NULL.
2) You can CAST or CONVERT an INT into a fixed length BINARY data type and it will continue to sort as an INT (as opposed to sorting like a text string).
This is easier to see when the intermittent steps are added to the query in the CTE...
SELECT
rd.id, rd.session_id, rd.value,
bv.BinVal,
SmearedBinVal = MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id),
SecondHalfAsINT = CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT),
ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
FROM
#req_data rd
CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)
Results...
id session_id value BinVal SmearedBinVal SecondHalfAsINT ControlValue
----------- ----------- ----------- ------------------ ------------------ --------------- ------------
1 5 7 0x0000000100000007 0x0000000100000007 7 7
2 5 1 NULL 0x0000000100000007 7 7
3 5 1 NULL 0x0000000100000007 7 7
4 5 12 0x000000040000000C 0x000000040000000C 12 12
5 5 1 NULL 0x000000040000000C 12 12
6 5 1 NULL 0x000000040000000C 12 12
7 5 1 NULL 0x000000040000000C 12 12
8 6 7 0x0000000800000007 0x0000000800000007 7 7
9 6 1 NULL 0x0000000800000007 7 7
10 6 3 0x0000000A00000003 0x0000000A00000003 3 3
11 6 1 NULL 0x0000000A00000003 3 3
12 7 7 0x0000000C00000007 0x0000000C00000007 7 7
13 8 1 NULL NULL NULL 999
14 8 2 0x0000000E00000002 0x0000000E00000002 2 2
15 8 3 0x0000000F00000003 0x0000000F00000003 3 3
Looking at the BinVal column, we see an 8 byte hex value for all non-[value] = 1 rows and NULLS where [value] = 1... The 1st 4 bytes are the Id (used for ordering) and the 2nd 4 bytes are [value] (used to set the "previous non-1 value" or set the whole thing to NULL.
The 2nd step is to "smear" the non-NULL values into the NULLs using the window framed MAX function, partitioned by session_id and ordered by id.
The 3rd step is to parse out the last 4 bytes and convert them back to an INT data type (SecondHalfAsINT) and deal with any nulls that result from not having any non-1 preceding value (ControlValue).
Since we can't reference a windowed function in the WHERE clause, we have to throw the query into a CTE (a derived table would work just as well) so that we can use the new ControlValue in the where clause.
SELECT CRow.id
FROM #req_data AS CRow
CROSS APPLY (SELECT MAX(id) AS id FROM #req_data PRev WHERE PRev.Id < CRow.id AND PRev.session_id = CRow.session_id AND PRev.value <> 1 ) MaxPRow
LEFT JOIN #req_data AS PRow ON MaxPRow.id = PRow.id
WHERE CRow.value = 1 AND ISNULL(PRow.value,1) <> 7
You can use the following query:
select id, session_id, value,
coalesce(sum(case when value <> 1 then 1 end)
over (partition by session_id order by id), 0) as grp
from #req_data
to get:
id session_id value grp
----------------------------
1 5 7 1
2 5 1 1
3 5 1 1
4 5 12 2
5 5 1 2
6 5 1 2
7 5 1 2
8 6 7 1
9 6 1 1
10 6 3 2
11 6 1 2
12 7 7 1
13 8 1 0
14 8 2 1
15 8 3 2
So, this query detects islands of consecutive 1 records that belong to the same group, as specified by the first preceding row with value <> 1.
You can use a window function once more to detect all 7 islands. If you wrap this in a second cte, then you can finally get the desired result by filtering out all 7 islands:
;with session_islands as (
select id, session_id, value,
coalesce(sum(case when value <> 1 then 1 end)
over (partition by session_id order by id), 0) as grp
from #req_data
), islands_with_7 as (
select id, grp, value,
count(case when value = 7 then 1 end)
over (partition by session_id, grp) as cnt_7
from session_islands
)
select id
from islands_with_7
where cnt_7 = 0 and value = 1

Assigning random value to each record, not just whole query

I have written a query which needs to match one of five possible values to around 1500 records randomly for each record. I have managed to get it to assign a value randomly, but the value assigned is the same for every record. Is there a way of doing this without using PL/SQL? Please let me know your thoughts. Query is below (database is Oracle 11g):
select
ioi.ioi_mstc
,ioi.ioi_seq2
,max(decode(rn, (select round(dbms_random.value(1,5)) num from intuit.srs_ioi where rownum < 2), uddc))
from
intuit.srs_ioi ioi
,intuit.srs_cap cap
,(select
sub.udd_code uddc
,row_number() over(partition by sub.udd_udvc order by rownum) rn
from
(select * from
intuit.men_udd udd
where
udd.udd_udvc = 'PA_REJ_REAS'
order by dbms_random.value) sub
where rownum <= 5) rejReas
where
ioi.ioi_stuc = cap.cap_stuc
and ioi.ioi_iodc = 'PAPERLESS'
and cap.cap_ayrc = '2013/4'
and cap.cap_idrc like '%R%'
group by ioi.ioi_mstc
,ioi.ioi_seq2
This is due to sub-query caching. Consider the following query and the values returned:
SQL> with numbers as (
2 select level as lvl
3 from dual
4 connect by level <= 10
5 )
6 select lvl
7 , ( select dbms_random.value(1,5)
8 from dual ) as sq
9 , dbms_random.value(1,5) as nsq
10 from numbers
11 ;
LVL SQ NSQ
---------- ---------- ----------
1 2.56973281 2.86381746
2 2.56973281 3.54313541
3 2.56973281 1.71969631
4 2.56973281 3.71918833
5 2.56973281 3.10287264
6 2.56973281 3.9887797
7 2.56973281 2.6800834
8 2.56973281 3.57760938
9 2.56973281 2.47035426
10 2.56973281 3.77448435
10 rows selected.
SQL>
The value is being cached by the sub-query; simply remove it.
select ioi.ioi_mstc
, ioi.ioi_seq2
, max(decode(rn, round(dbms_random.value(1,5)) , uddc))
from ...