How to identify stopwords with BigQuery? - sql

I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?

One approach to identify stopwords is to look at the ones that show up in most documents.
Steps in this query:
Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length).
Unescape reddit HTML encoded values.
Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+').
Each word counts only once per doc (comment), regardless of how many times it's repeated in each comment.
Get the top x words by counting on how many documents they showed up.
Query:
#standardSQL
WITH words_by_post AS (
SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
, r'[a-z]{1,20}\'?[a-z]+') words
FROM `fh-bigquery.reddit_comments.2017_07`
WHERE body NOT IN ('[deleted]', '[removed]')
AND subreddit IN ('AskReddit', 'funny', 'movies')
AND score > 100
), words_per_doc AS (
SELECT id, word
FROM words_by_post, UNNEST(words) word
WHERE ARRAY_LENGTH(words) > 20
GROUP BY id, word
)
SELECT word, COUNT(*) docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100

Related

Count the most popular occurrences of a hashtag in a string column postgreSQL

I have a column in my dataset with the following format:
hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]
I have managed to count the hashtags like so:
SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST
But I'm not sure if it's possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:
hashtags count
christmas 3
newyears 2
The datatype is just varchar, but I'm a bit confused on how I should approach this. Any help would be appreciated!
That's a bad idea to store this data. It's risky because we don't know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.
Anyway, if you can't improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.
So the general idea is something like this:
WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
Due to the braces and spaces within your column, this will not produce the correct result.
So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.
With your sample data, following construct will produce the intended outcome:
WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
See here
But as already said, this is unpleasant and risky.
So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.
This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.
I think you should split all the data in Array to record and then count it with Group by. Something like this query
SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC
Hopefully, it will match your request!
You can do it as follows :
select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^\w,]+','','g'), ',')) as tags, count(1)
from full_data
group by tags
order by count(1) desc
Result :
tags count
christmas 3
newyears 2
easter 2
fourthofjuly 1
valentines 1
REGEXP_REPLACE to remove any special characters.
string_to_array to generate an array
unnest to expand an array to a set of rows
Demo here

Count the amount of times a word appears in BigQuery column

I have a column with some long strings and need to count the most used words in it.
I need something that works like this https://towardsdatascience.com/very-simple-python-script-for-extracting-most-common-words-from-a-story-1e3570d0b9d0. The word counting part at least...
And it is very important that i have the option to blacklist some words so they dont count.
Try below simple approach
with blacklist as (
select 'with' word union all
select 'that' union all
select 'add more as you see needed'
)
select lower(word) word, count(*) frequency
from data, unnest(regexp_extract_all(col, r'[\w]*')) word
where length(word) > 3
and word not in (select word from blacklist)
group by word
order by frequency desc

Postgres similarity operator with ANY clause - limit for each parameter?

I have a query:
SELECT
word,
similarity(word, 'foo') as "similarity"
FROM words
WHERE word ~* 'foo'
ORDER BY similarity DESC
LIMIT 5
which gets called for each word in a search. For example searching for 'oxford university england' would call this query 3 times.
I would now like to write a query which can find similar words to all 3 in one trip to the database, and so far I have something like this:
SELECT set_limit(0.4);
SELECT word
FROM words
WHERE word % ANY(ARRAY['foo', 'bar', 'baz']);
However, I can't think of a way to say 'give me 5 results for each word'.
Is there a way to write this? Thanks.
Unnest the array of words you're searching for and join on the % condition instead of applying it in WHERE, then number rows for each search word and filter on that number:
SELECT subq.search_word
, subq.word
FROM (SELECT srch.word as search_word
, wrd.word
, ROW_NUMBER() OVER (PARTITION BY srch.word ORDER BY similarity(wrd.word, srch.word) DESC) AS rn
FROM words wrd
JOIN UNNEST(ARRAY['foo', 'bar']) as srch(word)
ON wrd.word % srch.word) subq
WHERE subq.rn <= 5

Computing a moving maximum in BigQuery

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)
Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.
(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)
One approach is to do the following:
SELECT word,
(CASE
WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
ELSE word_count_2 END
) AS max_count
FROM (
SELECT word, word_count,
LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'
)
This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:
SELECT word,
IF(word_count >= word_count_1,
IF(word_count >= word_count_2, word_count, word_count_2),
IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...
This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:
SELECT word,
IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
SELECT word,
IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
FROM ...
)
It seems like doing 20 nested queries might not be a great idea performance-wise, though.
Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?
A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.
SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM
(
SELECT mean_temp, year-i ending_at_year, year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;
I have another way to do the thing you are trying to achieve. See query below
SELECT word, max(words)
FROM
(SELECT word,
word_count AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 1) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 2) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth')
group by word order by word
You can try it and compare performance with your approach (I didn't try that)
There's an example creating a moving using window function in the docs here.
Quoting:
The following example calculates a moving average of the values in the current row and the row preceding it. The window frame comprises two rows that move with the current row.
#legacySQL
SELECT
name,
value,
AVG(value)
OVER (ORDER BY value
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
AS MovingAverage
FROM
(SELECT "a" AS name, 0 AS value),
(SELECT "b" AS name, 1 AS value),
(SELECT "c" AS name, 2 AS value),
(SELECT "d" AS name, 3 AS value),
(SELECT "e" AS name, 4 AS value);

How to count collocations in a repeating field

I have a repeating field A which contains a list of strings. what would be a good way to find TOP strings which coincide with a given string. So, if A holds hashtags, for a given hashtag #T1, find the tags that coincide with #T1 in highest number of records.
You can use WITHIN and SUM(IF(...)) to find the matches. For example:
SELECT hashtag, COUNT(*) AS cnt
(SELECT tweet.hashtag as hashtag,
SUM(IF(tweet.hashtag == '#T1', 1, 0)) WITHIN RECORD as tagz
FROM [tweets])
WHERE tagz > 0
GROUP by hashtag,
ORDER BY cnt DESC