Oracle SQL - Efficient Join N Rows Per Match

Oracle SQL - Efficient Join N Rows Per Match - sql

The basic idea is to join two tables, let's call them MYTABLE1 and MYTABLE2 on a field JOINID. There will be lots of matches per JOINID (one row from MYTABLE1 corresponding to many rows in MYTABLE2, and for the purposes of testing, MYTABLE1 has 50 rows), but we want to only select up to N matches per JOINID value. I have seen lots of inefficient solutions, for example:
select t1.*, t2.*
from MYTABLE1 t1 inner join
(select MYTABLE2.*,
row_number() over (partition by MYTABLE2.JOINKEY order by 1) as seqnum
from MYTABLE2) t2
on t1.joinkey = t2.joinkey and seqnum <= 2;
which takes over 5 minutes for me to run and returns less than 100 results, whereas something like
select t1.*, t2.*
from MYTABLE1 t1 inner join MYTABLE2 t2 on t1.JOINKEY = t2.JOINKEY
where rownum <= 100;
returns 100 results in ~60 milliseconds.
(To be sure of the validity of these results, I selected a different test table and performed the second query above on a specific single JOINKEY until I got a result set with less than 100 results, meaning it did in fact search through all of MYTABLE2. The total query time was 30 milliseconds. Afterwards, I started the original query, but this time getting 50 joins per row of MYTABLE1, which again took over 5 minutes to complete.)
How can I approach this in a not-so-terribly-inefficient manner?
It seems so simple, all we need to do is go through the rows of MYTABLE1 and matching the JOINKEY field to that of rows of MYTABLE2, moving on to the next row of MYTABLE1 once we have matched the desired number for that row.
In the worst case scenario for my second example, we should have to spend 30ms searching through the full TABLE2 per row of TABLE1, of which there are 50 rows, for a total execution time of 1.5 seconds.

I wouldn't call the below approach efficient by any means and it cheats a little and has some clunkiness, but it comes in under the 1500ms limit you provided so I'll add as something to consider.
This example cheats in that it compiles a TYPE, so it can table an anonymous function.
This approach just iteratively probes MYTABLE2 with each JOINKEY from MYTABLE1 using an anonymous subquery-factoring-clause function and accumulates the results as it goes.
I don't know the real structure of the tables involved, so this example pretends MYTABLE2 has one additional CHAR attribute called OTHER_DATA that is the target of the SELECT.
First, setup the test tables:
CREATE TABLE MYTABLE1 (
JOINKEY NUMBER NOT NULL
);
CREATE TABLE MYTABLE2 (
JOINKEY NUMBER NOT NULL,
OTHER_DATA CHAR(1) NOT NULL
);
CREATE INDEX MYTABLE2_I
ON MYTABLE2 (JOINKEY);
Then add the test data. 50 rows to MYTABLE1 and 100M rows to MYTABLE2:
INSERT INTO MYTABLE1
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL < 51;
BEGIN
<<COMMIT_LOOP>>
FOR OUTER_POINTER IN 1..4000 LOOP
<<DATA_LOOP>>
FOR POINTER IN 1..10 LOOP
INSERT INTO MYTABLE2
SELECT
JOINKEY, OTHER_DATA
FROM
(SELECT LEVEL AS JOINKEY FROM DUAL CONNECT BY LEVEL < 51)
CROSS JOIN
(SELECT CHR(64 + LEVEL) AS OTHER_DATA FROM DUAL CONNECT BY LEVEL < 51);
END LOOP DATA_LOOP;
COMMIT;
END LOOP COMMIT_LOOP;
END;
/
Then gather stats...
Verify the table counts:
SELECT COUNT(*) FROM MYTABLE1;
50
SELECT COUNT(*) FROM MYTABLE2;
100000000
Then create a TYPE that includes the desired data:
CREATE OR REPLACE TYPE JOINKEY_OTHER_DATA IS OBJECT (JOINKEY1 NUMBER, OTHER_DATA CHAR(1));
/
CREATE OR REPLACE TYPE JOINKEY_OTHER_DATA_LIST IS TABLE OF JOINKEY_OTHER_DATA;
/
And then run a query that uses an anonymous subquery-factoring-block function that imposes a rowcount per JOINKEY to be returned. In this first example, it fetches two MYTABLE2 rows per JOINKEY:
SELECT SYSTIMESTAMP FROM DUAL;
WITH FUNCTION FETCH_N_ROWS
(P_MATCHES_LIMIT IN NUMBER)
RETURN JOINKEY_OTHER_DATA_LIST
AS
V_JOINKEY_OTHER_DATAS JOINKEY_OTHER_DATA_LIST;
BEGIN
V_JOINKEY_OTHER_DATAS := JOINKEY_OTHER_DATA_LIST();
FOR JOINKEY_POINTER IN (SELECT MYTABLE1.JOINKEY
FROM MYTABLE1)
LOOP
DECLARE
V_MYTABLE2_JOINKEYS JOINKEY_OTHER_DATA_LIST;
BEGIN
SELECT JOINKEY_OTHER_DATA(MYTABLE2.JOINKEY, MYTABLE2.OTHER_DATA)
BULK COLLECT INTO V_MYTABLE2_JOINKEYS
FROM MYTABLE2 WHERE MYTABLE2.JOINKEY = JOINKEY_POINTER.JOINKEY
FETCH FIRST P_MATCHES_LIMIT ROWS ONLY;
V_JOINKEY_OTHER_DATAS := V_JOINKEY_OTHER_DATAS MULTISET UNION ALL V_MYTABLE2_JOINKEYS;
END;
END LOOP;
RETURN V_JOINKEY_OTHER_DATAS;
END;
SELECT *
FROM TABLE (FETCH_N_ROWS(2));
/
SELECT SYSTIMESTAMP FROM DUAL;
Results in:
SYSTIMESTAMP
18-APR-17 03.32.10.623056000 PM -06:00
JOINKEY1 OTHER_DATA
1 A
1 B
2 A
2 B
3 A
3 B
...
49 A
49 B
50 A
50 B
100 rows selected.
SYSTIMESTAMP
18-APR-17 03.32.11.014554000 PM -06:00
By changing the number passed to FETCH_N_ROWS, different data volumens can be fetched with fairly consistent performance.
...
SELECT * FROM TABLE (FETCH_N_ROWS(13));
Returns:
...
50 K
50 L
50 M
650 rows selected.

You cannot compare the two queries. The second query is simply returning whatever rows come first. The second has to go through all the data to return anything. A more apt comparison would use:
select . . .
from (select t1.*, t2.*
from MYTABLE1 t1 inner join
MYTABLE2 t2
on t1.JOINKEY = t2.JOINKEY
order by t1.JOINKEY
) t1
where rownum <= 100;
This has to read all the data before returning anything so it is more analogous to using row number.
But, start with this query:
select t1.*, t2.*
from MYTABLE1 t1 inner join
(select MYTABLE2.*,
row_number() over (partition by t2.JOINKEY order by 1) as seqnum
from MYTABLE2 t2
) t2
on t1.joinkey = t2.joinkey and seqnum <= 2;
For this query, you want an index on MYTABLE2(JOINKEY). If the ORDER BY has another key, that should be in the query as well.

Related

Randomly choosing a row

I have table s1, which has 3 rows. How can I RANDOMLY pick a row from s1 and INSERT its corresponding value into d1.
I don't want a hard coded solution. Can ROWNUM() be used then dbms_random? Let's say I want 10 rows in d1.
An example would be appreciated.
Create table s1(
val NUMBER(4)
);
INSERT into s1
(val) VALUES (30);
INSERT into s1
(val) VALUES (40);
INSERT into s1
(val) VALUES (50);
Create table d1(
val NUMBER(4)
);

You can sort by a random value and select one row:
insert into d1 (val)
select val
from (select s1.*
from s1
order by dbms_random.value
) s1
where rownum = 1;
In Oracle 12C+, you don't need the subquery:
insert into d1 (val)
select val
from s1
order by dbms_random.value
fetch first 1 row only;
Note: This assumes that you really mean random and not arbitrary. A random row means that any row in the table has an equal chance of being chosen in any given invocation of the query.

In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 2 well-known methods:
Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
Update
and 3rd variant: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);

Oracle SQL - How can I write an insert statement that is conditional and looped?

Context:
I have two tables: markettypewagerlimitgroups (mtwlg) and stakedistributionindicators (sdi). When a mtwlg is created, 2 rows are created in the sdi table which are linked to the mtwlg - each row with the same values bar 2, the id and another field (let's call it column X) which must contain a 0 for one row and 1 for the other.
There was a bug present in our codebase which prevented this happening automatically, so any mtwlg's created during the time that bug was present do not have the related sdi's, causing NPE's in various places.
To fix this, a patch needs to be written to loop through the mtwlg table and for each ID, search the sdi table for the 2 related rows. If the rows are present, do nothing; if there is only 1 row, check if F is a 0 or a 1, and insert a row with the other value; if neither row is present, insert them both. This needs to be done for every mtwlg, and a unique ID needs to be inserted too.
Pseudocode:
For each market type wager limit group ID
Check if there are 2 rows with that id in the stake distributions table, 1 where column X = 0 and one where column X = 1
if none
create 2 rows in the stake distributions table with unique id's; 1 for each X value
if one
create the missing row in the stake distributions table with a unique id
if 2
do nothing
If it helps at all - the patch will be applied using liquibase.
Anyone with any advice or thoughts as to if and how this will be possible to write in SQL/a liquibase patch?
Thanks in advance, let me know of any other information you need.
EDIT:
I've actually just been advised to do this using PL/SQL, do you have any thoughts/suggestions in regards to this?
Thanks again.

Oooooh, an excellent job for MERGE.
Here's your pseudo code again:
For each market type wager limit group ID
Check if there are 2 rows with that id in the stake distributions table,
1 where column X = 0 and one where column X = 1
if none
create 2 rows in the stake distributions table with unique id's;
1 for each X value
if one
create the missing row in the stake distributions table with a unique id
if 2
do nothing
Here's the MERGE variant (still pseudo-code'ish as I don't know how your data really looks):
MERGE INTO stake_distributions d
USING (
SELECT limit_group_id, 0 AS x
FROM market_type_wagers
UNION ALL
SELECT limit_group_id, 1 AS x
FROM market_type_wagers
) t
ON (
d.limit_group_id = t.limit_group_id AND d.x = t.x
)
WHEN NOT MATCHED THEN INSERT (d.limit_group_id, d.x)
VALUES (t.limit_group_id, t.x);
No loops, no PL/SQL, no conditional statements, just plain beautiful SQL.
Nice alternative suggested by Boneist in the comments uses a CROSS JOIN rather than UNION ALL in the USING clause, which is likely to perform better (unverified):
MERGE INTO stake_distributions d
USING (
SELECT w.limit_group_id, x.x
FROM market_type_wagers w
CROSS JOIN (
SELECT 0 AS x FROM DUAL
UNION ALL
SELECT 1 AS x FROM DUAL
) x
) t
ON (
d.limit_group_id = t.limit_group_id AND d.x = t.x
)
WHEN NOT MATCHED THEN INSERT (d.limit_group_id, d.x)
VALUES (t.limit_group_id, t.x);

Answer: you don't. There is absolutely no need to loop through anything - you can do it in a single insert. All you need to do is identify the rows that are missing, and then you just need to add them in.
Here is an example:
drop table t1;
drop table t2;
drop sequence t2_seq;
create table t1 (cola number,
colb number,
colc number);
create table t2 (id number,
cola number,
colb number,
colc number,
colx number);
create sequence t2_seq
START WITH 1
INCREMENT BY 1
MAXVALUE 99999999
MINVALUE 1
NOCYCLE
CACHE 20
NOORDER;
insert into t1 values (1, 10, 100);
insert into t2 values (t2_seq.nextval, 1, 10, 100, 0);
insert into t2 values (t2_seq.nextval, 1, 10, 100, 1);
insert into t1 values (2, 20, 200);
insert into t2 values (t2_seq.nextval, 2, 20, 200, 0);
insert into t1 values (3, 30, 300);
insert into t2 values (t2_seq.nextval, 3, 30, 300, 1);
insert into t1 values (4, 40, 400);
commit;
insert into t2 (id, cola, colb, colc, colx)
with dummy as (select 1 id from dual union all
select 0 id from dual)
select t2_seq.nextval,
t1.cola,
t1.colb,
t1.colc,
d.id
from t1
cross join dummy d
left outer join t2 on (t2.cola = t1.cola and d.id = t2.colx)
where t2.id is null;
commit;
select * from t2
order by t2.cola;
ID COLA COLB COLC COLX
---------- ---------- ---------- ---------- ----------
1 1 10 100 0
2 1 10 100 1
3 2 20 200 0
5 2 20 200 1
7 3 30 300 0
4 3 30 300 1
6 4 40 400 0
8 4 40 400 1

If the processing logic is too gnarly to be encapsulated in a single SQL statement, you may need to resort to cursor for loops and row types - basically allows you to do things like the following:
DECLARE
r_mtwlg markettypewagerlimitgroups%ROWTYPE;
BEGIN
FOR r_mtwlg IN (
SELECT mtwlg.*
FROM markettypewagerlimitgroups mtwlg
)
LOOP
-- do stuff here
-- refer to elements of the current row like this
DBMS_OUTPUT.PUT_LINE(r_mtwlg.id);
END LOOP;
END;
/
You can obviously nest another loop inside this one that hits the stakedistributionindicators table, but I'll leave that as an exercise for you. You could also left join to stakedistributionindicators a couple of times in this first cursor so that you only return rows that don't already have an x=1 and x=0, again you can probably work that bit out for yourself.

If you would rather write your logic in Java vs. PL/SQL, Liquibase allows you to create custom changes. The custom change points to a Java class you write that can do whatever logic you need. A simple example can be found here

create a table of duplicated rows of another table using the select statement

I have a table with one column containing different integers.
For each integer in the table I would like to duplicate it as the number of digits -
For example:
12345 (5 digits):
1. 12345
2. 12345
3. 12345
4. 12345
5. 12345
I thought doing it using with recursion t (...) as () but I didn't manage, since I don't really understand how it works and what is happening "behind the scenes.
I don't want to use insert because I want it to be scalable and automatic for as many integers as needed in a table.
Any thoughts and an explanation would be great.

The easiest way is to join to a table with numbers from 1 to n in it.
SELECT n, x
FROM yourtable
JOIN
(
SELECT day_of_calendar AS n
FROM sys_calendar.CALENDAR
WHERE n BETWEEN 1 AND 12 -- maximum number of digits
) AS dt
ON n <= CHAR_LENGTH(TRIM(ABS(x)))
In my example I abused TD's builtin calendar, but that's not a good choice, as the optimizer doesn't know how many rows will be returned and as the plan must be a Product Join it might decide to do something stupid. So better use a number table...

Create a numbers table that will contain the integers from 1 to the maximum number of digits that the numbers in your table will have (I went with 6):
create table numbers(num int)
insert numbers
select 1 union select 2 union select 3 union select 4 union select 5 union select 6
You already have your table (but here's what I was using to test):
create table your_table(num int)
insert your_table
select 12345 union select 678
Here's the query to get your results:
select ROW_NUMBER() over(partition by b.num order by b.num) row_num, b.num, LEN(cast(b.num as char)) num_digits
into #temp
from your_table b
cross join numbers n
select t.num
from #temp t
where t.row_num <= t.num_digits

I found a nice way to perform this action. Here goes:
with recursive t (num,num_as_char,char_n)
as
(
select num
,cast (num as varchar (100)) as num_as_char
,substr (num_as_char,1,1)
from numbers
union all
select num
,substr (t.num_as_char,2) as num_as_char2
,substr (num_as_char2,1,1)
from t
where char_length (num_as_char2) > 0
)
select *
from t
order by num,char_length (num_as_char) desc

Returning rows that had no matches

I've read and read and read but I haven't found a solution to my problem.
I'm doing something like:
SELECT a
FROM t1
WHERE t1.b IN (<external list of values>)
There are other conditions of course but this is the jist of it.
My question is: is there a way to show which in the manually entered list of values didn't find a match? I've looked but I can't find and I'm going in circles.

Create a temp table with the external list of values, then you can do:
select item
from tmptable t
where t.item not in ( select b from t1 )
If the list is short enough, you can do something like:
with t as (
select case when t.b1='FIRSTITEM' then 1 else 0 end firstfound
case when t.b1='2NDITEM' then 1 else 0 end secondfound
case when t.b1='3RDITEM' then 1 else 0 end thirdfound
...
from t1 wher t1.b in 'LIST...'
)
select sum(firstfound), sum(secondfound), sum(thirdfound), ...
from t
But with proper rights, I would use Nicholas' answer.

To display which values in the list of values haven't found a match, as one of the approaches, you could create a nested table SQL(schema object) data type:
-- assuming that the values in the list
-- are of number datatype
create type T_NumList as table of number;
and use it as follows:
-- sample of data. generates numbers from 1 to 11
SQL> with t1(col) as(
2 select level
3 from dual
4 connect by level <= 11
5 )
6 select s.column_value as without_match
7 from table(t_NumList(1, 2, 15, 50, 23)) s -- here goes your list of values
8 left join t1 t
9 on (s.column_value = t.col)
10 where t.col is null
11 ;
Result:
WITHOUT_MATCH
-------------
15
50
23
SQLFiddle Demo

There is no easy way to convert "a externally provided" list into a table that can be used to do the comparison. One way is to use one of the (undocumented) system types to generate a table on the fly based on the values supplied:
with value_list (id) as (
select column_value
from table(sys.odcinumberlist (1, 2, 3)) -- this is the list of values
)
select l.id as missing_id
from value_list l
left join t1 on t1.id = l.id
where t1.id is null;

There are ways to get what you have described, but they have requirements which exceed the statement of the problem. From the minimal description provided, there's no way to have the SQL return the list of the manually-entered values that did not match.
For example, if it's possible to insert the manually-entered values into a separate table - let's call it matchtbl, with the column named b - then the following should do the job:
SELECT matchtbl.b
FROM matchtbl
WHERE matchtbl.b NOT IN (SELECT distinct b
FROM t1)
Of course, if the data is being processed by a programming language, it should be relatively easy to keep track of the set of values returned by the original query, by adding the b column to the output, and then perform the set difference.

Putting the list in an in clause makes this hard. If you can put the list in a table, then the following works:
with list as (
select val1 as value from dual union all
select val2 from dual union all
. . .
select valn
)
select list.value, count(t1.b)
from list left outer join
t1
on t1.b = list.value
group by list.value;

Most efficient way of selecting the changes between timestamped snapshots

I have a table that holds data about items that existed at a certain time - regular snapshots taken.
Simple example:
Timestamp ID
1 A
1 B
2 A
2 B
2 C
3 A
3 D
4 D
4 E
In this case, Item C gets created sometime between snapshot 1 and 2 and sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.
The table is reasonably large (millions of records) and for each timestamp there are about 50 records.
What's the most efficient way of selecting the item IDs for items that disappear between two consecutive timestamps?
So for the above example ...
Between 1 and 2: NULL
Between 2 and 3: B, C
Between 3 and 4: A
If it doesn't make the query inefficient, can it be extended to automatically use the latest (i.e. MAX) timestamp and the previous one?

Another way to view this is that you want to find records that exist in timestamp #1 that do not exist in timestamp #2. The easiest way?
SELECT Timestamp
FROM records AS t1
WHERE NOT EXISTS (SELECT 1 FROM records AS t2 WHERE t2.id = t1.id AND t2.Timestamp = t1.Timestamp + 1)
Of course, I'm exploiting here the fact that your example timestamps are integers, when in reality I imagine they are genuine timestamps. But it turns out the integers work so well for this particular purpose, they'd be really handy to have around. So, perhaps we should make a numbered list of all available timestamps. The easiest way to get that?
CREATE TEMPORARY TABLE timestamp_map AS (
timestamp_id AS INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
timestamp_value AS DATETIME
);
INSERT INTO timestamp_map (timestamp_value) (SELECT DISTINCT timestamp FROM records ORDER BY timestamp);
(You could also maintain such a table permanently by use of triggers.)
It's a bit out there, but I've gotten similar techniques to work very efficiently in the past for data like what you describe, when lots of back-and-forth subqueries and NOT EXISTS proved too slow.

Update:
See this entry in my blog for performance details:
MySQL: difference between sets
SELECT ts,
(
SELECT GROUP_CONCAT(id)
FROM mytable mi
WHERE mi.ts =
(
SELECT MAX(ts)
FROM mytable mp
WHERE mp.ts = mo.pts
)
AND NOT EXISTS
(
SELECT NULL
FROM mytable mn
WHERE mn.ts = mo.ts
AND mn.id = mi.id
)
)
FROM (
SELECT #r AS pts,
#r := ts AS ts
FROM (
SELECT #r := NULL
) vars,
(
SELECT DISTINCT ts
FROM mytable
) moo
) mo
To select only the last change:
SELECT ts,
(
SELECT GROUP_CONCAT(id)
FROM mytable mi
WHERE mi.ts =
(
SELECT MAX(ts)
FROM mytable mp
WHERE mp.ts < mo.ts
)
AND NOT EXISTS
(
SELECT NULL
FROM mytable mn
WHERE mn.ts = mo.ts
AND mn.id = mi.id
)
)
FROM (
SELECT MAX(ts) AS ts
FROM mytable
) mo
For this to be efficient, you need to have a composite index on mytable (timestamp, id) (in this order).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas