How to compute cosine similarity between two texts in presto? - sql

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id
text1
text2
1
a b b
b c
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id
text1
text2
all_unique_words
map1
map2
similarity
1
a b b
b c
[a b c]
[1, 2, 0]
[0, 1, 1]
0.63
How can we convert two texts into two such maps in presto? Thanks in advance!

Use split to transform string into array and then depended on Presto version either use unnest+histogram trick or array_frequency:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id
_col1
1
0.6324555320336759

Related

How to get all overlapping (ordered) 3-tuples from an array in BigQuery

Given a table like the following
elems
['a', 'b', 'c', 'd', 'e']
['v', 'w', 'x', 'y']
I'd like to transform it into something like this:
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['v', 'w', 'x']
['w', 'x', 'y']
I.e., I'd like to get all overlapping 3-tuples.
My current attempt looks as follows:
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3
But the problem is, it returns some unwanted rows too, i.e., the ones that are "between" the original rows from the foo table.
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['d', 'e', 'v'] <--- unwanted
['e', 'v', 'w'] <--- unwanted
['v', 'w', 'x']
['w', 'x', 'y']
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality? (I guess there may be a simple solution without this step in between.)
Consider below approach
select [elems[offset(index - 1)], elems[offset(index)], elems[offset(index + 1)]] as tuple
from your_table, unnest([array_length(elems)]) len,
unnest(generate_array(1, len - 2)) index
if applied to sample data in your question - output is
You might consider below query.
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality?
afaik, it's not quaranteeded without explicit using WITH OFFSET in the query.
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem WITH OFFSET
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (PARTITION BY FORMAT('%t', elems) ORDER BY offset ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3;
Just to give you another idea
create temp function slice(arr ARRAY<string>, pos float64, len float64)
returns array<string> language js as
r"return arr.slice(pos, pos + len);";
select slice(elems, index, 3) as tuple
from foo, unnest([array_length(elems)]) len,
unnest(generate_array(0, len - 3)) index
leaving it up to you to refactor above query to the point when it will look something like
select tuple
from foo, unnest(slices(elems, 3)) as tuple

Extract last N elements of an array in SQL (hive)

I have a column with arrays and I want to extract the X last elements in an array.
Example trying to extract the last two elements:
Column A
['a', 'b', 'c']
['d', 'e']
['f', 'g', 'h', 'i']
Expected output:
Column A
['b', 'c']
['d', 'e']
['h', 'i']
Best case scenario would be to do it without using a UDF
One method using reverse, explode, filtering and re-assembling array again:
with your_table as (
select stack (4,
0, array(), --empty array to check it works if no elements or less than n
1, array('a', 'b', 'c'),
2, array('d', 'e'),
3, array('f', 'g', 'h', 'i')
) as (id, col_A)
)
select s.id, collect_list(s.value) as col_A
from
(select s.id, a.value, a.pos
from your_table s
lateral view outer posexplode(split(reverse(concat_ws(',',s.col_A)),',')) a as pos, value
where a.pos between 0 and 1 --last two (use n-1 instead of 1 if you want last n)
distribute by s.id sort by a.pos desc --keep original order
)s
group by s.id
Result:
s.id col_a
0 []
1 ["b","c"]
2 ["d","e"]
3 ["h","i"]
More elegant way using brickhouse numeric_range UDF in this answer

how to groupby statement in pandas other than crosstab

let say there is a dataframe
df = pd.DataFrame( {'col1': ['a', 'a', 'a', 'b', 'b'], 'col2': ['x', 'x', 'y', 'y', 'y']} )
I want to show this in a table whose index is a and column x is 2 and y is 1 just like in the 1st attached picture (https://imgur.com/MiWmdIz).
I used crosstab but I am getting two separate rows like in the 2nd attached picture.(https://imgur.com/WiJWT15)
You can use:
df = df.set_index('col1')
Here is the result:
col2
col1
a x
a x
a y
b y
b y
Same as your first attached picture.

Idiomatic equivalent to map structure

My analytics involves the need to aggregate rows and to store the number of different values occurrences of a field someField in all the rows.
Sample data structure
[someField, someKey]
I'm trying to GROUP BY someKey and then be able to know for each of the results how many time there was each someField values
Example:
[someField: a, someKey: 1],
[someField: a, someKey: 1],
[someField: b, someKey: 1],
[someField: c, someKey: 2],
[someField: d, someKey: 2]
What I would like to achieve:
[someKey: 1, fields: {a: 2, b: 1}],
[someKey: 2, fields: {c: 1, d: 1}],
Does it work for you?
WITH data AS (
select 'a' someField, 1 someKey UNION all
select 'a', 1 UNION ALL
select 'b', 1 UNION ALL
select 'c', 2 UNION ALL
select 'd', 2)
SELECT
someKey,
ARRAY_AGG(STRUCT(someField, freq)) fields
FROM(
SELECT
someField,
someKey,
COUNT(someField) freq
FROM data
GROUP BY 1, 2
)
GROUP BY 1
Results:
It won't give exactly the results you are looking for, but it might work to receive the same queries your previous result would. As you said, for each key you can retrieve how many times (column freq) someField happened.
I've been looking for a way on how to aggregate structs and couldn't find one. But retrieving the results as an ARRAY of STRUCTS turned out to be quite straightforward.
There's probably a smarter way to do this (and get it in the format you want e.g. using an Array for the 2nd column), but this might be enough for you:
with sample as (
select 'a' as someField, 1 as someKey UNION all
select 'a' as someField, 1 as someKey UNION ALL
select 'b' as someField, 1 as someKey UNION ALL
select 'c' as someField, 2 as someKey UNION ALL
select 'd' as someField, 2 as someKey)
SELECT
someKey,
SUM(IF(someField = 'a', 1, 0)) AS a,
SUM(IF(someField = 'b', 1, 0)) AS b,
SUM(IF(someField = 'c', 1, 0)) AS c,
SUM(IF(someField = 'd', 1, 0)) AS d
FROM
sample
GROUP BY
someKey order by somekey asc
Results:
someKey a b c d
---------------------
1 2 1 0 0
2 0 0 1 1
This is well used technique in BigQuery (see here).
I'm trying to GROUP BY someKey and then be able to know for each of the results how many time there was each someField values
#standardSQL
SELECT
someKey,
someField,
COUNT(someField) freq
FROM yourTable
GROUP BY 1, 2
-- ORDER BY someKey, someField
What I would like to achieve:
[someKey: 1, fields: {a: 2, b: 1}],
[someKey: 2, fields: {c: 1, d: 1}],
This is different from what you expressed in words - it is called pivoting and based on your comment - The a, b, c, and d keys are potentially infinite - most likely is not what you need. At the same time - pivoting is easily doable too (if you have some finite number of field values) and you can find plenty of related posts

Spark SQL: column values can only be a combination of A,T,G,C or N

I'm trying to query a spark table to find all rows in the 'ref' column that contain letters that are not A, T, G, C or N.
A valid result should only contain those letters, and can contain any length or combination of those letters.
For example:
Valid = AA, ATTTGGGGCCCC, C, G, TTG, N, etc.
Invalid = P, ., NULL
The following query is returning columns with single nucleotides only:
SELECT ref
from test_set
where ref not in ('*A*', '*T*', '*G*', '*C*', '*N*')
ref
1 T
2 C
3 T
4 C
5 T
The following query works in impala sql, but not spark, and is also pretty ugly:
SELECT regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(ref, 'A', ''), 'T', ''), 'G', ''), 'C', ''), 'N', '')
from spark_df
Ok.. I figured it out:
SELECT regexp_extract(ref, 'ATGCN', 0)
from test_set
Or
SELECT alt
FROM test_set
WHERE regexp_extract( alt, '([^ACGTN.])', 0 ) IS NULL
If you did not want to use regexp_extract, the same result is obtainable by performing:
SELECT ref
from test_set
where not (
ref like '*A*' or
ref like '*T*' or
ref like '*C*' or
ref like '*G*' or
ref like '*N*'
)