BigQuery - Find the closest region - sql

I have two tables, and for each region in A, I want to find the closest regions in B.
A:
------------------------
ID | Start | End | Color
------------------------
1 | 400 | 500 | White
------------------------
1 | 10 | 20 | Red
------------------------
2 | 2 | 10 | Blue
------------------------
4 | 88 | 90 | Color
------------------------
B:
------------------------
ID | Start | End | Name
------------------------
1 | 1 | 2 | XYZ1
------------------------
1 | 50 | 60 | XYZ4
------------------------
2 | 150 | 160 | ABC1
------------------------
2 | 50 | 60 | ABC2
------------------------
4 | 100 | 120 | EFG
------------------------
RS:
---------------------------------------
ID | Start | End | Color | Closest Name
---------------------------------------
1 | 400 | 500 | White | XYZ4
---------------------------------------
1 | 10 | 20 | Red | XYZ1
---------------------------------------
2 | 2 | 10 | Blue | ABC2
---------------------------------------
4 | 88 | 90 | Color | EFG
---------------------------------------
Currently, I first find min distance by joining two tables:
MinDist Table:
SELECT A.ID, A.Start, A.End,
MIN(CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END) AS distance
FROM ( Select A ... )
Join B On A.ID=B.ID)
Group By A.ID, A.Start, A.End
Then recompute distance for by joining table A and B again,
GlobDist Table (Note, the query retrieves B.Name in this case):
SELECT A.ID, A.Start, A.End,
CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END AS distance,
B.Name
FROM ( Select A ... )
Join B On A.ID=B.ID)
Finally join these two tables MinDist and GlobDist Tables on
GlobDist.ID= MinDist.ID,
GlobDist.Start=MinDist.Start,
GlobDist.End= MinDist.End,
GlobDist.distance= MinDist.distance.
I tested ROW_NUMBER() and PARTITION BY over (ID, Start, End), but it took much longer. So, what's the fastest and most efficient way of solving this problem? How can I reduce duplicate computation?
Thanks!

Below solution is for BigQuery Standard SQL and as simple and short as below
#standardSQL
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
-- ORDER BY a_id
You can test / play with above using dummy data in your question
#standardSQL
WITH A AS (
SELECT 1 a_id, 400 a_start, 500 a_end, 'White' color UNION ALL
SELECT 1, 10, 20 , 'Red' UNION ALL
SELECT 2, 2, 10, 'Blue' UNION ALL
SELECT 4, 88, 90, 'Color'
), B AS (
SELECT 1 b_id, 1 b_start, 2 b_end, 'XYZ1' name UNION ALL
SELECT 1, 50, 60, 'XYZ4' UNION ALL
SELECT 2, 150, 160,'ABC1' UNION ALL
SELECT 2, 50, 60, 'ABC2' UNION ALL
SELECT 4, 100, 120,'EFG'
)
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
ORDER BY a_id
with result as below
Row a_id a_start a_end color name
1 1 400 500 White XYZ4
2 1 10 20 Red XYZ1
3 2 2 10 Blue ABC2
4 4 88 90 Color EFG

Related

Defining a new variable in SELECT clause in SQL developer

I am new to SQL and the extraction of data from databases, so please bear with me. I only have experience with coding in statistical programs, including Stata, SAS, and R.
I currently have a SELECT clause that extracts a table from an Oracle database.
To simplify the question, I make use of a illustrative example:
I am interested in CREATING a new variable, which is not included in the database and must be defined based on the other variables, that contains the weight of their mother. Since I am new to SQL, I do not know if this is possible to do in the SELECT clause or if there exist more efficient options
Note that,
Mother and Mother_number are referring to the "same numbers", meaning that mothers and daughters are represented in the model.
AA (number 1) and CC (number 3) have the same mother (BB) (number 2)
I need to do some conversion of the date, e.g. to_char(a.from_date, 'dd-mm-yyyy') as fromdat since SQL confuses the year with the day-of-the month
The SQL code:
select to_char(a.from_date, 'dd-mm-yyyy') as fromdate, a.Name, a.Weight, a.Number, a.Mother_number
from table1 a, table2 b
where 1=1
and a.family_ref=b.family_ref
and .. (other conditions)
What I currently obtain:
| fromdate | Name | Weight | Number | Mother_number |
|------------|------|--------|--------|---------------|
| 06-07-2021 | AA | 100 | 1 | 2 |
| 06-07-2021 | BB | 200 | 2 | 3 |
| 06-07-2021 | CC | 300 | 3 | 2 |
| 06-07-2021 | DD | 400 | 4 | 5 |
| 06-07-2021 | EE | 500 | 5 | 6 |
| ... | ... | ... | ... | ... |
What I wish to obtain:
| fromdate | Name | Weight | Number | Mother_number | Mother_weight |
|------------|------|--------|--------|---------------|---------------|
| 06-07-2021 | AA | 100 | 1 | 2 | 200 |
| 06-07-2021 | BB | 200 | 2 | 3 | 300 |
| 06-07-2021 | CC | 300 | 3 | 2 | 200 |
| 06-07-2021 | DD | 400 | 4 | 5 | 500 |
| 06-07-2021 | EE | 500 | 5 | 6 | … |
| | … | … | … | … | …
Assuming the MOTHER_NUMBER value is referencing the same value as the NUMBER variable just join the table with itself.
select a.fromdate
, a.name
, a.weight
, a.number
, a.mother_number
, b.weight as mother_weight
from HAVE a
left join HAVE b
on a.mother_number = b.number
Although I'm not sure I'm following the "mother" logic, the way you need to implement the last column in your SELECT statement is to add b.weight as Mother_Weight in the end of the first line, before the for keyword.
Since the b table references "Mothers", you can add the column just by taking the weight of the person in table b.
If instead you wish to add the data of a person's mother's weight, you can do that by adding a column to the relevant table and then updating each row in your table by executing the statements below:
ALTER TABLE table1 ADD Mother_weight FLOAT;
UPDATE table1 SET Mother_weight=(SELECT (Weight) FROM table2 WHERE table1.family_ref=table2.familyref);
Then you add the a.Mother_weight clause in your SELECT statement.
Use a hierarchical query:
SELECT to_char(a.fromdate, 'dd-mm-yyyy') as fromdate,
a.Name,
a.Weight,
a."NUMBER",
a.Mother_number,
PRIOR weight AS mother_weight
FROM table1 a
INNER JOIN table2 b
ON (a.family_ref=b.family_ref)
WHERE LEVEL = 2
OR ( LEVEL = 1
AND NOT EXISTS(
SELECT 1
FROM table1 x
WHERE a.mother_number = x."NUMBER"
)
)
CONNECT BY NOCYCLE
PRIOR "NUMBER" = mother_number
AND PRIOR a.family_ref = a.family_ref
ORDER BY a."NUMBER"
Or, a sub-query factoring clause and a self-join:
WITH data (fromdate, name, weight, "NUMBER", mother_number) AS (
SELECT to_char(a.fromdate, 'dd-mm-yyyy'),
a.Name,
a.Weight,
a."NUMBER",
a.Mother_number
FROM table1 a
INNER JOIN table2 b
ON (a.family_ref=b.family_ref)
)
SELECT d.*,
m.weight AS mother_weight
FROM data d
LEFT OUTER JOIN data m
ON (d.mother_number = m."NUMBER")
ORDER BY d."NUMBER"
Which, for the sample data:
CREATE TABLE table1 (family_ref, fromdate, Name, Weight, "NUMBER", Mother_number) AS
SELECT 1, DATE '2021-07-06', 'AA', 100, 1, 2 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'BB', 200, 2, 3 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'CC', 300, 3, 2 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'DD', 400, 4, 5 FROM DUAL UNION ALL
SELECT 1, DATE '2021-07-06', 'EE', 500, 5, 6 FROM DUAL;
CREATE TABLE table2 (family_ref) AS
SELECT 1 FROM DUAL;
Both output:
FROMDATE
NAME
WEIGHT
NUMBER
MOTHER_NUMBER
MOTHER_WEIGHT
06-07-2021
AA
100
1
2
200
06-07-2021
BB
200
2
3
300
06-07-2021
CC
300
3
2
200
06-07-2021
DD
400
4
5
500
06-07-2021
EE
500
5
6
db<>fiddle here

Join number of pairs in a single table using SQL

I have two tables of events in bigquery that look like as follows. The main idea is two count the number of events in each table (are always pairs of event_id and user_id) and join them in a single table that for each pair in any table it tells the number of events.
table 1:
| event_id | user id |
| -------- | ------- |
| 1 | 1 |
| 2 | 1 |
| 2 | 3 |
| 2 | 5 |
| 1 | 1 |
| 4 | 7 |
table 2:
| event_id | user id |
| -------- | ------- |
| 1 | 1 |
| 3 | 1 |
| 2 | 3 |
I would like to get a table which has the number of events of each table:
| event_id | user id | num_events_table1 | num_events_table2 |
| -------- | ------- | ----------------- | ----------------- |
| 1 | 1 | 2 | 1 |
| 2 | 1 | 1 | 0 |
| 2 | 3 | 1 | 1 |
| 2 | 5 | 1 | 0 |
| 4 | 7 | 1 | 0 |
| 3 | 1 | 0 | 1 |
Any idea of how to do this with sql? I have tried this:
SELECT i1, e1, num_viewed, num_displayed FROM
(SELECT id as i1, event as e1, count(*) as num_viewed
FROM table_1
group by id, event) a
full outer JOIN (SELECT id as i2, event as e2, count(*) as num_displayed
FROM table_2
group by id, event) b
on a.i1 = b.i2 and a.e1 = b.e2
This is not getting exactly what I want. I amb getting i1 which are null and e1 that are null.
Consider below
#standardSQL
with `project.dataset.table1` as (
select 1 event_id, 1 user_id union all
select 2, 1 union all
select 2, 3 union all
select 2, 5 union all
select 1, 1 union all
select 4, 7
), `project.dataset.table2` as (
select 1 event_id, 1 user_id union all
select 3, 1 union all
select 2, 3
)
select event_id, user_id,
countif(source = 1) as num_events_table1,
countif(source = 2) as num_events_table2
from (
select 1 source, * from `project.dataset.table1`
union all
select 2, * from `project.dataset.table2`
)
group by event_id, user_id
if applied to sample data in your question - output is
If I understand correctly, the simplest method is to modify your query via a USING clause along with COALESCE():
SELECT id, event, COALESCE(num_viewed, 0), COALESCE(num_displayed, 0)
FROM (SELECT id, event, count(*) as num_viewed
FROM table_1
GROUP BY id, event
) t1 FULL JOIN
(SELECT id , event, COUNT(*) as num_displayed
FROM table_2
GROUP BY id, event
) t2
USING (id, event);
Note: This requires that the two columns used for the JOIN have the same name. If this is not the case, then you might still need column aliases in the subqueries.
One way is aggregate the union
select event_id, user id, sum(cnt1) cnt1, sum(cnt2) cnt2
from (
select event_id, user id, 1 cnt1, 0 cnt2
from table_1
union all
select event_id, user id, 0 cnt1, 1 cnt2
from table_2 ) t
group by event_id, user id

Filtering a table via another table's values

I have 2 tables:
Value
+----+-------+
| id | name |
+----+-------+
| 1 | Peter |
| 2 | Jane |
| 3 | Joe |
+----+-------+
Filter
+----+---------+------+
| id | valueid | type |
+----+---------+------+
| 1 | 1 | A |
| 2 | 1 | B |
| 3 | 1 | C |
| 4 | 1 | D |
| 5 | 2 | A |
| 6 | 2 | C |
| 7 | 2 | E |
| 8 | 3 | A |
| 9 | 3 | D |
+----+---------+------+
I need to retrieve the values from the Value table where the related Filter table does not contain the type 'B' or 'C'
So in this quick example this would be only Joe.
Please note this is a DB2 DB and i have limited permissions to run selects only.
Or also a NOT IN (<*fullselect*) predicate:
Only that my result is 'Joe', not 'Jane' - and the data constellation would point to that ...
WITH
-- your input, sans reserved words
val(id,nam) AS (
SELECT 1,'Peter' FROM sysibm.sysdummy1
UNION ALL SELECT 2,'Jane' FROM sysibm.sysdummy1
UNION ALL SELECT 3,'Joe' FROM sysibm.sysdummy1
)
,
filtr(id,valueid,typ) AS (
SELECT 1,1,'A' FROM sysibm.sysdummy1
UNION ALL SELECT 2,1,'B' FROM sysibm.sysdummy1
UNION ALL SELECT 3,1,'C' FROM sysibm.sysdummy1
UNION ALL SELECT 4,1,'D' FROM sysibm.sysdummy1
UNION ALL SELECT 5,2,'A' FROM sysibm.sysdummy1
UNION ALL SELECT 6,2,'C' FROM sysibm.sysdummy1
UNION ALL SELECT 7,2,'E' FROM sysibm.sysdummy1
UNION ALL SELECT 8,3,'A' FROM sysibm.sysdummy1
UNION ALL SELECT 9,3,'D' FROM sysibm.sysdummy1
)
-- real query starts here
SELECT
*
FROM val
WHERE id NOT IN (
SELECT valueid FROM filtr WHERE typ IN ('B','C')
)
;
-- out id | nam
-- out ----+-------
-- out 3 | Joe
Or also, a failing left join:
SELECT
val.*
FROM val
LEFT JOIN (
SELECT valueid FROM filtr WHERE typ IN ('B','C')
) filtr
ON filtr.valueid = val.id
WHERE valueid IS NULL
You can use EXISTS, as in:
select *
from value v
where not exists (
select null from filter f
where f.valueid = v.id and f.type in ('B', 'C')
);
Result:
ID NAME
--- -----
3 Joe
See running example at db<>fiddle.

Grouping by similar values in multiple columns

I have a table of entities with an id, and a category (few different values with NULL allowed) from 3 different years (category can be different from 1 year to another), in 'wide' table format:
| ID | CATEG_Y1 | CATEG_Y2 | CATEG_Y3 |
+-----+----------+----------+----------+
| 1 | NULL | B | C |
| 2 | A | A | C |
| 3 | B | A | NULL |
| 4 | A | C | B |
| ... | ... | ... | ... |
I would like to simply count the number of entities by category, grouped by category, independently for the year:
+-------+----+----+----+
| CATEG | Y1 | Y2 | Y3 |
+-------+----+----+----+
| A | 6 | 4 | 5 | <- 6 entities w/ categ_y1, 4 w/ categ_y2, 5 w/ categ_y3
| B | 3 | 1 | 10 |
| C | 8 | 4 | 5 |
| NULL | 3 | 3 | 3 |
+-------+----+----+----+
I guess I could do it by grouping values one column after the other and UNION ALL the results, but I was wondering if there was a more rapid & convenient way, and if it can be generalized if I have more columns/years to manage (e.g. 20-30 different values)
A bit clumsy, but probably someone has a better idea. Query first collects all diferent categories (the union-query in the from part), and then counts the occurences with dedicated subqueries in the select part. One could omit the union-part if there is a table already defining the available categories (I suppose categ_y1 is a foreign key to such a primary category table). Hope there are not to many typos:
select categories.cat,
(select count(categ_y1) from table ty1 where select categories.cat = categ_y1) as y1,
(select count(categ_y2) from table ty2 where select categories.cat = categ_y2) as y2,
(select count(categ_y3) from table ty3 where select categories.cat = categ_y3) as y3
from ( select categ_y1 as cat from table t1
union select categ_y2 as cat from table t2
union select categ_y3 as cat from table t3) categories
Use jsonb functions to transpose the data (from the question) to this format:
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
order by 1;
categ | jdata
-------+-----------------------------------------------
A | {"categ_y1": 2, "categ_y2": 2}
B | {"categ_y1": 1, "categ_y2": 1, "categ_y3": 1}
C | {"categ_y2": 1, "categ_y3": 2}
| {"categ_y1": 1, "categ_y3": 1}
(4 rows)
For a known (static) number of years you can easily unpack the jsonb column:
select categ, jdata->'categ_y1' as y1, jdata->'categ_y2' as y2, jdata->'categ_y3' as y3
from (
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
) s
order by 1;
categ | y1 | y2 | y3
-------+----+----+----
A | 2 | 2 |
B | 1 | 1 | 1
C | | 1 | 2
| 1 | | 1
(4 rows)
To get fully dynamic solution you can use the function create_jsonb_flat_view() described in Flatten aggregated key/value pairs from a JSONB field.
I would do this as using union all followed by aggregation:
select categ, sum(categ_y1) as y1, sum(categ_y2) as y2,
sum(categ_y3) as y3
from ((select categ_y1, 1 as categ_y1, 0 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y2, 0 as categ_y1, 1 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y3, 0 as categ_y1, 0 as categ_y2, 1 as categ_y3
from t
)
)
group by categ ;

Select a row X times

I have a very specific sql problem.
I have a table given with order positions (each position belongs to one order, but this isn't a problem):
| Article ID | Amount |
|--------------|----------|
| 5 | 3 |
| 12 | 4 |
For the customer, I need an export with every physical item that is ordered, e.g.
| Article ID | Position |
|--------------|------------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |
How can I build my select statement to give me this results? I think there are two key tasks:
1) Select a row X times based on the amount
2) Set the position for each physical article
You can do it like this
SELECT ArticleID, n.n Position
FROM table1 t JOIN
(
SELECT a.N + b.N * 10 + 1 n
FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
) n
ON n.n <= t.amount
ORDER BY ArticleID, Position
Note: subquery n generates a sequence of numbers on the fly from 1 to 100. If you do a lot of such queries you may consider to create persisted tally(numbers) table and use it instead.
Here is SQLFiddle demo
or using a recursive CTE
WITH tally AS (
SELECT 1 n
UNION ALL
SELECT n + 1 FROM tally WHERE n < 100
)
SELECT ArticleID, n.n Position
FROM table1 t JOIN tally n
ON n.n <= t.amount
ORDER BY ArticleID, Position
Here is SQLFiddle demo
Output in both cases:
| ARTICLEID | POSITION |
|-----------|----------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |
Query:
SQLFIDDLEExample
SELECT t1.[Article ID],
t2.number
FROM Table1 t1,
master..spt_values t2
WHERE t1.Amount >= t2.number
AND t2.type = 'P'
AND t2.number <= 255
AND t2.number <> 0
Result:
| ARTICLE ID | NUMBER |
|------------|--------|
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
| 12 | 1 |
| 12 | 2 |
| 12 | 3 |
| 12 | 4 |