Postgres similarity operator with ANY clause - limit for each parameter? - sql

I have a query:
SELECT
word,
similarity(word, 'foo') as "similarity"
FROM words
WHERE word ~* 'foo'
ORDER BY similarity DESC
LIMIT 5
which gets called for each word in a search. For example searching for 'oxford university england' would call this query 3 times.
I would now like to write a query which can find similar words to all 3 in one trip to the database, and so far I have something like this:
SELECT set_limit(0.4);
SELECT word
FROM words
WHERE word % ANY(ARRAY['foo', 'bar', 'baz']);
However, I can't think of a way to say 'give me 5 results for each word'.
Is there a way to write this? Thanks.

Unnest the array of words you're searching for and join on the % condition instead of applying it in WHERE, then number rows for each search word and filter on that number:
SELECT subq.search_word
, subq.word
FROM (SELECT srch.word as search_word
, wrd.word
, ROW_NUMBER() OVER (PARTITION BY srch.word ORDER BY similarity(wrd.word, srch.word) DESC) AS rn
FROM words wrd
JOIN UNNEST(ARRAY['foo', 'bar']) as srch(word)
ON wrd.word % srch.word) subq
WHERE subq.rn <= 5

Related

Count the amount of times a word appears in BigQuery column

I have a column with some long strings and need to count the most used words in it.
I need something that works like this https://towardsdatascience.com/very-simple-python-script-for-extracting-most-common-words-from-a-story-1e3570d0b9d0. The word counting part at least...
And it is very important that i have the option to blacklist some words so they dont count.
Try below simple approach
with blacklist as (
select 'with' word union all
select 'that' union all
select 'add more as you see needed'
)
select lower(word) word, count(*) frequency
from data, unnest(regexp_extract_all(col, r'[\w]*')) word
where length(word) > 3
and word not in (select word from blacklist)
group by word
order by frequency desc

Postgresql - Count number of instances of substring in results of ILIKE query

I have a query like this, that returns rows of text that match the given query:
SELECT content
FROM messages
WHERE created_at >= '2021-07-24T20:17:18.141Z'
AND created_at <= '2021-07-31T20:11:20.542Z'
AND content ILIKE '%Search Term%';
Right now, if I just count the rows, it just returns the number of messages with that search term. However, I'd like to count the number of Instances of the search term, rather than the number of Rows that contain the search term.
I thought of making a stored function that looped through the results of the above function and counted the instances. But it ended up being insanely slow. I’m okay with it being pretty slow, but is there a solution to this that either runs slightly faster or doesn’t require a function?
Try this:
SELECT
SUM(
CASE
WHEN content ILIKE '%Search Term%' THEN 1
ELSE 0
END
)
FROM
messages
WHERE
created_at >= '2021-07-24T20:17:18.141Z' AND
created_at <= '2021-07-31T20:11:20.542Z';
You can use regexp_matches():
select count(*)
from messages m cross join lateral
regexp_matches(m.content, $search_string, 'g')
Note: This assumes that $search_string doesn't contain regular expression special characters.
If you want the count on each row, you can use:
select m.*,
(select count(*)
from regexp_matches(m.content, $search_string, 'g')
) as num_matches
from messages m;
Now I understood your question properly. You can use regexp_matches() in substring to get what you want.
select content,(SELECT count(*) FROM regexp_matches(content, 'Search Term', 'g')) search_count
FROM messages
WHERE created_at >= '2021-07-24T20:17:18.141Z'
AND created_at <= '2021-07-31T20:11:20.542Z'
AND content ILIKE '%Search Term%';
** The 'g' flag repeats multiple matches on a string.

How to identify stopwords with BigQuery?

I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?
One approach to identify stopwords is to look at the ones that show up in most documents.
Steps in this query:
Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length).
Unescape reddit HTML encoded values.
Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+').
Each word counts only once per doc (comment), regardless of how many times it's repeated in each comment.
Get the top x words by counting on how many documents they showed up.
Query:
#standardSQL
WITH words_by_post AS (
SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
, r'[a-z]{1,20}\'?[a-z]+') words
FROM `fh-bigquery.reddit_comments.2017_07`
WHERE body NOT IN ('[deleted]', '[removed]')
AND subreddit IN ('AskReddit', 'funny', 'movies')
AND score > 100
), words_per_doc AS (
SELECT id, word
FROM words_by_post, UNNEST(words) word
WHERE ARRAY_LENGTH(words) > 20
GROUP BY id, word
)
SELECT word, COUNT(*) docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100

SQL group by number and replace characters

I have data stored in my database for mobile numbers.
I want to group by the column number in the database.
For example, some numbers may show 44123456789 and 0123456789 which is the same number. How can I group these together?
SELECT DIGITS(column_name) FROM table_name
You should use this format in DB then you assign it any variable, next you can matching their digits with the others.
Not sure it really suits you, but you could build this kind of subquery:
SELECT ta.`phone_nbr`,
COALESCE(list.`normalized_nbr`,ta.`phone_nbr`) AS nbr
FROM (
SELECT
t.`phone_nbr`,
SUBSTRING(t.`phone_nbr`,2) AS normalized_nbr
FROM `your_table` t
WHERE LEFT(t.`phone_nbr`,1) = '0'
UNION
SELECT
t.`phone_nbr`,
sub.`filter_nbr` AS normalized_nbr
FROM `your_table` t,
( SELECT
SUBSTRING(t2.`phone_nbr`,2) AS filter_nbr
FROM `your_table` t2
WHERE LEFT(t2.`phone_nbr`,1) = '0') sub
WHERE LEFT(t.`phone_nbr`,1) != '0'
AND t.`phone_nbr` LIKE CONCAT('%',sub.`filter_nbr`)
) list
LEFT OUTER JOIN `your_table` ta
ON ta.`phone_nbr` = list.`phone_nbr`
It will return you a list of phone numbers with their "normalized" number, i.e. with the 0 or international prefix removed if there is a duplicate match, and the raw number otherwise.
You can then use a GROUP BY clause on the nbr field, join on the phone_nbr for the rest of your query.
It has some limits, as it can unfortunately group similar stripped numbers. +49123456789, +44123456789 and 0123456789 will unfortunately have the same normalized number.

Computing a moving maximum in BigQuery

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)
Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.
(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)
One approach is to do the following:
SELECT word,
(CASE
WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
ELSE word_count_2 END
) AS max_count
FROM (
SELECT word, word_count,
LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'
)
This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:
SELECT word,
IF(word_count >= word_count_1,
IF(word_count >= word_count_2, word_count, word_count_2),
IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...
This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:
SELECT word,
IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
SELECT word,
IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
FROM ...
)
It seems like doing 20 nested queries might not be a great idea performance-wise, though.
Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?
A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.
SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM
(
SELECT mean_temp, year-i ending_at_year, year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;
I have another way to do the thing you are trying to achieve. See query below
SELECT word, max(words)
FROM
(SELECT word,
word_count AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 1) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 2) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth')
group by word order by word
You can try it and compare performance with your approach (I didn't try that)
There's an example creating a moving using window function in the docs here.
Quoting:
The following example calculates a moving average of the values in the current row and the row preceding it. The window frame comprises two rows that move with the current row.
#legacySQL
SELECT
name,
value,
AVG(value)
OVER (ORDER BY value
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
AS MovingAverage
FROM
(SELECT "a" AS name, 0 AS value),
(SELECT "b" AS name, 1 AS value),
(SELECT "c" AS name, 2 AS value),
(SELECT "d" AS name, 3 AS value),
(SELECT "e" AS name, 4 AS value);