BigQuery - find N nearest vectors - sql

I have a bigquery table, which has a column of the repeated data type with a 512 dimensional vector (float).
I would like to run a query which finds the N most similar vectors.
In my case similarity can be simply defined as the inner product of the target vector and each vector in the database.
I have found and run the below query, which generates this across all the combinations in the table:
#standardSQL
CREATE TABLE ml.url_cosine_similarity AS
WITH pairwise AS (
SELECT t1.url AS id_1, t2.url AS id_2
FROM `project.dataset.table` t1
INNER JOIN `project.dataset.table` t2
ON t1.url < t2.url
)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.page_vector) value1 WITH OFFSET pos1
JOIN UNNEST(b.page_vector) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise t
JOIN `project.dataset.table` a ON a.url = id_1
JOIN `project.dataset.table` b ON b.url = id_2
However since I do not have a good grasp of how arrays work in bigquery, I am unsure on how to change this query to take in a target vector, and return N neighbours.

See simplified example - it returns top 3 nearest pairs of vectors in the table
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, [1,2,3,4,5] page_vector UNION ALL
SELECT 2, [1,3,4,5,16] UNION ALL
SELECT 3, [2,3,4,5,6] UNION ALL
SELECT 4, [2,4,6,8,9] UNION ALL
SELECT 5, [1,3,4,5,16] UNION ALL
SELECT 6, [11,12,13,14,15]
)
SELECT a.id id1, b.id id2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.page_vector) value1 WITH OFFSET pos1
JOIN UNNEST(b.page_vector) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table` a
JOIN `project.dataset.table` b
ON a.id < b.id
ORDER BY cosine_similarity DESC
LIMIT 3
with output
Row id1 id2 cosine_similarity
1 2 5 1.0
2 1 4 0.9986422261219272
3 3 4 0.9962894120648842
If you want to output the nearest vectors (let's say two) for every vector in table - see below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, [1,2,3,4,5] page_vector UNION ALL
SELECT 2, [1,3,4,5,16] UNION ALL
SELECT 3, [2,3,4,5,6] UNION ALL
SELECT 4, [2,4,6,8,9] UNION ALL
SELECT 5, [1,3,4,5,16] UNION ALL
SELECT 6, [11,12,13,14,15]
)
SELECT id, ANY_VALUE(page_vector) page_vector,
ARRAY_AGG(
STRUCT(id2 AS id, page_vector2 AS page_vector, cosine_similarity AS cosine_similarity)
ORDER BY cosine_similarity DESC
LIMIT 2
) similar_vectors
FROM (
SELECT a.id, a.page_vector,
b.id id2, b.page_vector page_vector2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.page_vector) value1 WITH OFFSET pos1
JOIN UNNEST(b.page_vector) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table` a
JOIN `project.dataset.table` b
ON a.id != b.id
)
GROUP BY id
ORDER BY id
this will produce below output

Related

BigQuery: Symmetric difference (xor) between two sets

BigQuery has UNION, INTERSECT, and EXCEPT [1], but not XOR.
SELECT * FROM [0, 1,2,3] XOR SELECT * FROM [2,3,4]
would return
0
1
4
As 0 and 1 are present in the first select but not second, and 4 is present in the second select, but not first.
I'd like to use it to find discrepancies between two tables, eg find customers that are present in one table, but not other and vice versa.
Any hints how to best do it?
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
BigQuery does not need a XOR operator as it can be obtained from existing operators:
a first way to do so as #Genato points it out is to use JOIN like in this issue
Another way is to use set operators: A XOR B can be translated as (A AND NOT B) OR (B AND NOT A), so with your example you could write
(
SELECT * FROM UNNEST(ARRAY<int64>[0, 1, 2, 3]) AS number
EXCEPT DISTINCT SELECT * FROM UNNEST(ARRAY<int64>[2, 3, 4]) AS number)
UNION ALL
(
SELECT * FROM UNNEST(ARRAY<int64>[2,3,4]) AS number
EXCEPT DISTINCT SELECT * FROM UNNEST(ARRAY<int64>[0, 1, 2, 3]) AS number);
which results in:
Few 'workarounds'
Option 1
with table1 as (
select * from unnest([0, 1,2,3]) num
), table2 as (
select * from unnest([2,3,4]) num
)
select * from table1 where not num in (select num from table2)
union all
select * from table2 where not num in (select num from table1)
Option 2
with table1 as (
select * from unnest([0,1,2,3]) num
), table2 as (
select * from unnest([2,3,4]) num
)
select num from (
select distinct num from table1 union all
select distinct num from table2
)
group by num
having count(*) = 1
in both cases - output is

SQL for storing numbers from cold to hot for specific range?

We have a table that looks list this: date, val1, val2, val3, val4, val5
for a given row, val1 -val5 are unique and between 1 and 37
Using T-SQL, How can I list numbers 1 -37 by cold to hot with their frequency for a given date range?
[![enter image description here][1]][1]
Sample Output (NOT ACTUAL): Numbers by frequency descending:
36=0, 2=1, 5=1, 7=3, 34=5, 30=6, etc.
With a recursive CTE create the dataset 1-37 and then UNION ALL to create a dataset with all the numbers in the table.
Join the 2 datasets and group by the number and aggregate:
with cte(n) as (
select 1 union all select (cte.n + 1) n from cte where cte.n < 37
)
select
cte.n, count(t.number) counter
from cte left join (
select date, val1 number from tablename union all
select date, val2 from tablename union all
select date, val3 from tablename union all
select date, val4 from tablename union all
select date, val5 from tablename
) t on t.number = cte.n and t.date between '2019-05-01' and '2019-05-31'
group by cte.n
order by counter, cte.n
Generate table of 37 numbers and left join your data
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
numbers(N) AS (
SELECT TOP (37) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select n.N, count(t.val)
from numbers N
left join (
select dt, val
from
-- your table here
( values
('2017-01-01', 22, 23, 4, 22, 5)
) myTable (dt, val1, val2,val3,val4,val5)
-- end of your table
cross apply (
values (val1),(val2),(val3),(val4),(val5)
) t(val)
) t on t.val = n.N
group by n.N
order by n.N;
You need to generate a list of 37 numbers (a recursive CTE is handy for this).
Then you can use a join if the values are unique in each row:
with n(n) as (
select 1 as n union all
select (cte.n + 1) as n
from cte
where cte.n < 37
)
select n.n, count(t.id) counter
from n left join
t
on n.n in (t.val1, t.val2, t.val3, t.val4, t.val5)
group by n.n;
If the numbers can be repeated within a row, then the above only counts the row once (it really counts matching rows rather than matching values). If you want them counted separately, then unpivot. For this, I recommend apply:
select n.n, count(v.val) counter
from n left join
(t cross apply
(values (t.val1), (t.val2), (t.val3), (t.val4), (t.val5)
) v(val)
)
on n.n = v.val
group by n.n;

SQL Dynamic Self Cross Join using TSql

I have a table that has data like this
Group Value
1 A
1 B
1 C
2 F
2 G
3 J
3 K
I want to join all members of one group to all members of each of the other groups into a single column like this:
AFJ
AFK
AGJ
AGK
BFJ
BFK
BGJ
BGK
CFJ
CFK
CGJ
CGK
There can be n number of groups and n number of values
Thank you
SQL does not offer many options for such a query. The one standard method is a recursive CTE. Other methods would involve recursive functions or stored procedures.
The following is an example of a recursive CTE that solves this problem:
with groups as (
select v.*
from (values (1, 'a'), (1, 'b'), (1, 'c'), (2, 'f'), (2, 'g'), (3, 'h'), (3, 'k')
) v(g, val)
),
cte as (
select 1 as groupid, val, 1 as lev
from groups
where g = 1
union all
select cte.groupid + 1, cte.val + g.val, lev + 1
from cte join
groups g
on g.g = cte.groupid + 1
)
select val
from (select cte.*, max(lev) over () as max_lev
from cte
) cte
where lev = max_lev
order by 1;
Some databases that support recursive CTEs don't use the recursive keyword.
Here is a db<>fiddle.
If you need to consider gaps between the groups, i.e. Group 1, then no 2, then 3.
with t(Grp, val) as (
select 1, 'A' from dual
union all
select 1, 'B' from dual
union all
select 1, 'C' from dual
union all
select 2, 'F' from dual
union all
select 2, 'G' from dual
union all
select 3, 'J' from dual
union all
select 3, 'K' from dual
)
, grpIndexes as (
SELECT ROWNUM indx
, grp
FROM (select distinct
grp
from t)
)
, Grps as (
select t.*
, grpIndexes.indx
from t
inner join grpIndexes
on grpIndexes.grp = t.grp
)
, rt(val, indx, lvl) as (
select val
, indx
, 0 as lvl
from Grps
where indx = (select min(indx) from Grps)
union all
select previous.val || this.val as val
, this.indx
, previous.lvl + 1 as lvl
from rt previous
, Grps this
where this.indx = (select min(indx) from Grps where indx > previous.indx)
and previous.indx <> (select max(indx) from Grps)
)
select val
from rt
where lvl = (select max(lvl) from rt)
order by val
;
Note that I renamed the columns because Group and Value are reserved words.

Bigquery- Struct format

WITH yourTable AS (
SELECT 1 AS id, '2013,1625,1297,7634' AS string_col UNION ALL
SELECT 2, '1,2,3,4,5'
)
SELECT id,
(SELECT ARRAY_AGG(CAST(num AS INT64))
FROM UNNEST(SPLIT(string_col)) AS num
) AS num,
ARRAY(SELECT CAST(num AS INT64)
FROM UNNEST(SPLIT(string_col)) AS num
) AS num_2
FROM yourTable
This is how exactly my actual table is designed and Now I would like to multiply num*num_2 and then later sum it up. Is there a way to get this into struct format like ID, nums.num,nums.num_2 so that I can simply multiply which gives me the necessary result.
PS: I am looking for solution in the select statement above but not within "with" statement.
Ok, assuming that you really have reason to have your table the way you have (see my comment on your question) - below should work
#standardSQL
SELECT id,
(
SELECT SUM(num * num_2)
FROM (SELECT pos, num FROM UNNEST(num) num WITH OFFSET pos) a
JOIN (SELECT pos_2, num_2 FROM UNNEST(num_2) num_2 WITH OFFSET pos_2) b
ON a.pos = b.pos_2
) mul
FROM yourTable
you can test it with below
#standardSQL
WITH yourTable AS (
SELECT 1 id, [2013,1625,1297,7634] num, [2013,1625,1297,7634] num_2 UNION ALL
SELECT 2, [1,2,3,4,5], [1,2,3,4,5]
)
SELECT id,
(
SELECT SUM(num * num_2)
FROM (SELECT pos, num FROM UNNEST(num) num WITH OFFSET pos) a
JOIN (SELECT pos_2, num_2 FROM UNNEST(num_2) num_2 WITH OFFSET pos_2) b
ON a.pos = b.pos_2
) mul
FROM yourTable

Select query select based on a priority

Someone please change my title to better reflect what I am trying to ask.
I have a table like
Table (id, value, value_type, data)
ID is NOT unique. There is no unique key.
value_type has two possible values, let's say A and B.
Type B is better than A, but often not available.
For each id if any records with value_type B exists, I want all the records with that id and value_type B.
If no record for that id with value_Type B exists I want all records with that id and value_type A.
Notice that if B exists for that id I don't want records with type A.
I currently do this with a series of temp tables. Is there a single select statement (sub queries OK) that can do the job?
Thanks so much!
Additional details:
SQL Server 2005
RANK, rather than ROW_NUMBER, because you want ties (those with the same B value) to have the same rank value:
WITH summary AS (
SELECT t.*,
RANK() OVER (PARTITION BY t.id
ORDER BY t.value_type DESC) AS rank
FROM TABLE t
WHERE t.value_type IN ('A', 'B'))
SELECT s.id,
s.value,
s.value_type,
s.data
FROM summary s
WHERE s.rank = 1
Non CTE version:
SELECT s.id,
s.value,
s.value_type,
s.data
FROM (SELECT t.*,
RANK() OVER (PARTITION BY t.id
ORDER BY t.value_type DESC) AS rank
FROM TABLE t
WHERE t.value_type IN ('A', 'B')) s
WHERE s.rank = 1
WITH test AS (
SELECT 1 AS id, 'B' AS value_type
UNION ALL
SELECT 1, 'B'
UNION ALL
SELECT 1, 'A'
UNION ALL
SELECT 2, 'A'
UNION ALL
SELECT 2, 'A'),
summary AS (
SELECT t.*,
RANK() OVER (PARTITION BY t.id
ORDER BY t.value_type DESC) AS rank
FROM test t)
SELECT *
FROM summary
WHERE rank = 1
I get:
id value_type rank
----------------------
1 B 1
1 B 1
2 A 1
2 A 1
SELECT *
FROM table
WHERE value_type = B
UNION ALL
SELECT *
FROM table
WHERE ID not in (SELECT distinct id
FROM table
WHERE value_type = B)
The shortest query to do the job I can think of:
SELECT TOP 1 WITH TIES *
FROM #test
ORDER BY Rank() OVER (PARTITION BY id ORDER BY value_type DESC)
This is about 50% worse on CPU as OMG Ponies' and Christoperous 5000's solutions, but the same number of reads. It's the extra sort that is making it take more CPU.
The best-performing original query I've come up with so far is:
SELECT *
FROM #test
WHERE value_type = 'B'
UNION ALL
SELECT *
FROM #test T1
WHERE NOT EXISTS (
SELECT *
FROM #test T2
WHERE
T1.id = T2.id
AND T2.value_type = 'B'
)
This consistently beats all the others presented on CPU by about 1/3rd (the others are about 50% more) but has 3x the number of reads. The duration on this query is often 2/3rds the time of all the others. I consider it a good contender.
Indexes and data types could change everything.
declare #test as table(
id int , value [nvarchar](255),value_type [nvarchar](255),data int)
INSERT INTO #test
SELECT 1, 'X', 'A',1 UNION
SELECT 1, 'X', 'A',2 UNION
SELECT 1, 'X', 'A',3 UNION
SELECT 1, 'X', 'A',4 UNION
SELECT 2, 'X', 'A',5 UNION
SELECT 2, 'X', 'B',6 UNION
SELECT 2, 'X', 'B',7 UNION
SELECT 2, 'X', 'A',8 UNION
SELECT 2, 'X', 'A',9
SELECT * FROM #test x
INNER JOIN
(SELECT id, MAX(value_type) as value_type FROM
#test GROUP BY id) as y
ON x.id = y.id AND x.value_type = y.value_type
Try this (MSSQL).
Select id, value_typeB, null
from myTable
where value_typeB is not null
Union All
Select id, null, value_typeA
from myTable
where value_typeB is null and value_typeA is not null
Perhaps something like this:
select * from mytable
where id in (select distinct id where value_type = "B")
union
select * from mytable
where id in (select distinct id where value_type = "A"
and id not in (select distinct id where value_type = "B"))
This uses a union, combining all records of value B with all records that have only A values:
SELECT *
FROM mainTable
WHERE value_type = B
GROUP BY value_type UNION SELECT *
FROM mainTable
WHERE value_type = A
AND id NOT IN(SELECT *
FROM mainTable
WHERE value_type = B);