SQL join where value in second table is first lower value w.r.t the first table

SQL join where value in second table is first lower value w.r.t the first table - sql

Let's say I have 2 tables and both of them have a column that contains timestamp for various events. The timestamp values in both the tables are different as they are for different events.
I want to join the two tables such that every record in table1 is joined with first lower timestamp on table2.
For e.g.
Table1 Table2
142.13 141.16
157.34 145.45
168.45 155.85
170.23 166.76
168.44
Joined Table should be:
142.13,141.16
157.34,155.85
168.45,166.76
170.23,168.44
I am using Apache Spark SQL.
I am a noob in SQL and this doesn't look like job for a noob :). Thanks.

Try this:
with t1 as (
select 142.13 v from dual union all
select 157.34 v from dual union all
select 168.45 v from dual union all
select 170.23 v from dual
),
t2 as (
select 141.16 v from dual union all
select 145.45 v from dual union all
select 155.85 v from dual union all
select 166.76 v from dual union all
select 168.44 v from dual
)
select v, ( select max(v) from t2 where t2.v <= t1.v )
from t1;
V (SELECTMAX(V)FROMT2WHERET2.V<=T1.V)
---------- -----------------------------------
142.13 141.16
157.34 155.85
168.45 168.44
170.23 168.44
4 rows selected.
the WITH clause is just me faking the data ...
the simplified query is just:
select t1.v, ( select max(t2.v) from table2 t2 where t2.v <= t1.v ) from table1 t1
[edit]
admittedly, I'm not familiar with Spark .. but this is simple enough SQL .. I'm assuming it works :)
[/edit]

Ditto has shown the straight-forward way to solve this. If Apache Spark really has problems with this very basic query, then join first (which can lead to a big intermediate result) and aggregate then:
select t1.v, max(t2.v)
from table1 t1
join table2 t2 on t2.v <= t1.v
group by t1.v
order by t1.v;

If you are using apache spark sql then you can join these two tables as dataframes with a adding a column using monotonically_increasing_id()
val t1 = spark.sparkContext.parallelize(Seq(142.13, 157.34, 168.45, 170.23)).toDF("c1")
val t2 = spark.sparkContext.parallelize(Seq(141.16,145.45,155.85,166.76,168.44)).toDF("c2")
val t11 = t1.withColumn("id", monotonically_increasing_id())
val t22 = t2.withColumn("id", monotonically_increasing_id())
val res = t11.join(t22, t11("id") + 1 === t22("id") ).drop("id")
Output:
+------+------+
| c1| c2|
+------+------+
|142.13|145.45|
|168.45|166.76|
|157.34|155.85|
|170.23|168.44|
+------+------+
Hope this helps

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.

One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).

Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.

The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Using Analytical Clauses with DISTINCT

The purpose is to query multiple tables using DISTINC (if not I get millions of rows as results), but at the same time using sample to gather a 10% sample from the results that should all be unique. I am getting the following error:
ORA-01446: cannot select ROWID from, or sample, a view with DISTINCT, GROUP BY, etc.
Here is the code I have written:
WITH V AS (SELECT DISTINCT AL1."NO", AL3."IR", AL1."ACCT", AL3."CUST_DA", AL1."NA",
AL3."1_LINE", AL3."2_LINE", AL3."3_LINE", AL1."DA",
AL1."CD", AL1."TITLE_NA", AL1."ENT_NA", AL3."ACCT",
AL3."ACCTLNK_ENRL_CNT"
FROM "DOC"."DOCUMENT" AL1, "DOC"."VNDR" AL2, "DOC"."CUST_ACCT" AL3
WHERE (AL1."ACCT"=AL2."VNDR"
AND AL2."ACCT"=AL3."ACCT")
AND ((AL1."IMG_DA" >= Trunc(sysdate-1)
AND AL1."PROC"='A'
AND AL3."ACCT"<>'03')))
SELECT * FROM V SAMPLE(10.0)

You can't sample a join view like this.
Simpler test case (MCVE):
with v as
( select d1.dummy from dual d1
join dual d2 on d2.dummy = d1.dummy
)
select * from v sample(10);
Fails with:
ORA-01445: cannot select ROWID from, or sample, a join view without a key-preserved table
The simplest fix would be to move the sample clause to the driving table:
with v as
( select d1.dummy from dual sample(10) d1
join dual d2 on d2.dummy = d1.dummy
)
select * from v;
I would therefore rewrite your view as:
with v as
( select distinct
d.no
, a.ir
, d.acct
, a.cust_da
, d.na
, a."1_LINE", a."2_LINE", a."3_LINE"
, d.da, d.cd, d.title_na, d.ent_na
, a.acct
, a.acctlnk_enrl_cnt
from doc.document sample(10) d
join doc.vndr v
on v.vndr = d.acct
join doc.cust_acct a
on a.acct = v.acct
and d.img_da >= trunc(sysdate - 1)
and d.proc = 'A'
and a.acct <> '03'
)
select * from v;

can't use JOIN with generate_series on Redshift

generate_series function on Redshift works as expected, when used in a simple select statement.
WITH series AS (
SELECT n as id from generate_series (-10, 0, 1) n
) SELECT * FROM series;
-- Works fine
As soon as I add a JOIN condition, redshift throws
com.amazon.support.exceptions.ErrorException: Function
generate_series(integer,integer,integer)" not supported"
DROP TABLE testing;
CREATE TABLE testing (
id INT
);
WITH series AS (
SELECT n as id from generate_series (-10, 0, 1) n
) SELECT * FROM series S JOIN testing T ON S.id = T.id;
-- Function "generate_series(integer,integer,integer)" not supported.
Redshift Version
SELECT version();
-- PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1485
Are there any workarounds to make this work?

generate_series is not supported by Redshift. It works only standalone on a leader node.
A workaround would be using row_number against any table that has sufficient number of rows:
with
series as (
select (row_number() over ())-11 from some_table limit 10
) ...
also, this question was asked multiple times already

You are correct that this does not work on Redshift.
See here.
The easiest workaround is to create a permanent table "manually" beforehand with the values within that table, e.g. you could have rows on that table for -1000 to +1000, then select the range from that table,
So for your example you would have something like
WITH series AS (
SELECT n as id from (select num as n from newtable where num between -10 and 0) n
) SELECT * FROM series S JOIN testing T ON S.id = T.id;
Does that work for you?
Alternatively, if you cannot create the table beforehand or prefer not to, you could use something like this
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
,generted_numbers AS
(
SELECT (1000*t1.num) + (100*t2.num) + (10*t3.num) + t4.num-5000 as gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
select gen_num from generted_numbers
where gen_num between -10 and 0
order by 1;

Oracle 'where clause' to become shorter

Lets assume 'table1' has three columns:
'key',
'singleID',
'multipleIDs'
Rows would be like:
1,'8736', '1234;6754;9785;6749'
2,'7446', '9959;7758;6485;9264'
To search for all rows which have an id either in 'singleID' or as part of
the concatenated IDs in the 'multipleIDs' I would:
select key from table1 where
singleID = '8888' or multipleIDs like '%8888%';
When searching not only for one ID (8888) like in this statement but for 100 it would be necessary to repeate the where clause 100 times with different id like:
select key from table1 where
singleID = '8888' or multipleIDs like '%8888%' or
singleID = '9999' or multipleIDs like '%9999%' or
....;
The IDs to search for are taken dynamically from another query like
select id from table2;
The query shall
be created dynamically since the number of IDs might vary.
Like this the SQL statement would become quite long.
Is there a nice and short way to express that in Oracle SQL? PLSQL perhaps?

Something like this?
This is the test version:
with sv_qry
as
(
SELECT trim(regexp_substr(search_values, '[^,]+', 1, LEVEL)) val
FROM (select '1234,7446' as search_values
from dual
)
CONNECT BY LEVEL <= regexp_count(search_values, ',')+1
)
, table1_qry
as
(select 1 as id,'8736' as single_id, '1234;6754;9785;6749' as multiple_id from dual
union all
select 2,'7446' as single_id, '9959;7758;6485;9264' as multiple_id from dual
)
select *
from table1_qry
inner join
sv_qry
on single_id = val or multiple_id like '%'||val||'%'
And this would be with a table called table1:
with sv_qry
as
(
SELECT trim(regexp_substr(search_values, '[^,]+', 1, LEVEL)) val
FROM (select '1234,7446' as search_values
from dual
)
CONNECT BY LEVEL <= regexp_count(search_values, ',')+1
)
select *
from table1
inner join
sv_qry
on single_id = val or multiple_id like '%'||val||'%'
Partial credit goes here:
Splitting string into multiple rows in Oracle

You can express the query like this :
select key
from table1 a
join ( select id from table2 where id in ('yyyy','xxxx','zzzz',...) b
on a.singleId = b.id or a.multipleID like '%'||b.id||'%';

SELECT DISTINCT for data groups

I have following table:
ID Data
1 A
2 A
2 B
3 A
3 B
4 C
5 D
6 A
6 B
etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:
DataID Data
101 A
102 A
102 B
103 C
104 D
So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:
ID DataID
1 101
2 102
3 102
4 103
5 104
6 102
How can I do that?
PS. Code to generate the first table:
CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'

In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons).
Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this
with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings

The following query using CASE will give you the result shown below.
From there on, getting the distinct datagroups and proceeding further should not really be a problem.
SELECT
id,
MAX(CASE data WHEN 'A' THEN data ELSE '' END) +
MAX(CASE data WHEN 'B' THEN data ELSE '' END) +
MAX(CASE data WHEN 'C' THEN data ELSE '' END) +
MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM t1
GROUP BY id
ID DataGroups
1 A
2 AB
3 AB
4 C
5 D
6 AB
However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.
In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)
LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.

From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:
with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
select ID,data
from #t1 where id in (select id from NonDups)
), Renumbered as (
select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered
Giving:
1 A
2 A
2 B
3 C
4 D
I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.

Just to share my own dirty solution that I'm using for the moment:
SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY (
SELECT CAST(Data AS VARCHAR) + ','
FROM #t1 t2
WHERE t2.id = t1.id
ORDER BY Data ASC
FOR XML PATH('') )
D ( Data )
And then going analog to LuckyLuke's solution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL join where value in second table is first lower value w.r.t the first table - sql

Ditto has shown the straight-forward way to solve this. If Apache Spark really has problems with this very basic query, then join first (which can lead to a big intermediate result) and aggregate then: select t1.v, max(t2.v) from table1 t1 join table2 t2 on t2.v <= t1.v group by t1.v order by t1.v;

Related

BigQuery recursively join based on links between 2 ID columns

Using Analytical Clauses with DISTINCT

can't use JOIN with generate_series on Redshift

Oracle 'where clause' to become shorter

SELECT DISTINCT for data groups

Categories

Resources