Postgresql - Count number of instances of substring in results of ILIKE query

Postgresql - Count number of instances of substring in results of ILIKE query - sql

I have a query like this, that returns rows of text that match the given query:
SELECT content
FROM messages
WHERE created_at >= '2021-07-24T20:17:18.141Z'
AND created_at <= '2021-07-31T20:11:20.542Z'
AND content ILIKE '%Search Term%';
Right now, if I just count the rows, it just returns the number of messages with that search term. However, I'd like to count the number of Instances of the search term, rather than the number of Rows that contain the search term.
I thought of making a stored function that looped through the results of the above function and counted the instances. But it ended up being insanely slow. I’m okay with it being pretty slow, but is there a solution to this that either runs slightly faster or doesn’t require a function?

Try this:
SELECT
SUM(
CASE
WHEN content ILIKE '%Search Term%' THEN 1
ELSE 0
END
)
FROM
messages
WHERE
created_at >= '2021-07-24T20:17:18.141Z' AND
created_at <= '2021-07-31T20:11:20.542Z';

You can use regexp_matches():
select count(*)
from messages m cross join lateral
regexp_matches(m.content, $search_string, 'g')
Note: This assumes that $search_string doesn't contain regular expression special characters.
If you want the count on each row, you can use:
select m.*,
(select count(*)
from regexp_matches(m.content, $search_string, 'g')
) as num_matches
from messages m;

Now I understood your question properly. You can use regexp_matches() in substring to get what you want.
select content,(SELECT count(*) FROM regexp_matches(content, 'Search Term', 'g')) search_count
FROM messages
WHERE created_at >= '2021-07-24T20:17:18.141Z'
AND created_at <= '2021-07-31T20:11:20.542Z'
AND content ILIKE '%Search Term%';
** The 'g' flag repeats multiple matches on a string.

Related

Count the most popular occurrences of a hashtag in a string column postgreSQL

I have a column in my dataset with the following format:
hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]
I have managed to count the hashtags like so:
SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST
But I'm not sure if it's possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:
hashtags count
christmas 3
newyears 2
The datatype is just varchar, but I'm a bit confused on how I should approach this. Any help would be appreciated!

That's a bad idea to store this data. It's risky because we don't know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.
Anyway, if you can't improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.
So the general idea is something like this:
WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
Due to the braces and spaces within your column, this will not produce the correct result.
So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.
With your sample data, following construct will produce the intended outcome:
WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
See here
But as already said, this is unpleasant and risky.
So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.
This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.

I think you should split all the data in Array to record and then count it with Group by. Something like this query
SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC
Hopefully, it will match your request!

You can do it as follows :
select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^\w,]+','','g'), ',')) as tags, count(1)
from full_data
group by tags
order by count(1) desc
Result :
tags count
christmas 3
newyears 2
easter 2
fourthofjuly 1
valentines 1
REGEXP_REPLACE to remove any special characters.
string_to_array to generate an array
unnest to expand an array to a set of rows
Demo here

Problem with GCP project relying on GDELT database

While programming in GCP using the "GDELT" database I encountered a problem, and would like for you to help me solve it.
I would like to program a code in which the database will examine who the person the media dealt with the most during the days of signing the "Abraham Accords" between Israel and the Arab countries.
SELECT
V2Persons,
COUNT(1) AS count
FROM (
SELECT
UNIQUE(REGEXP_REPLACE(SPLIT(V2Persons,';'), r',.*', ")) V2Persons
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE>20200914000000
AND DATE < 20200916000000
AND LOWER(AllNames) LIKE '%Abraham Accords%' )
GROUP BY
Persons
ORDER BY
2 DESC
LIMIT
300
Can you show me what the problem is with the code and how to solve it?

There are many issues here, simplest to debug step-by-step, using WITH statement and getting some data.
So I moved nested query to WITH sub-query. There is no UNIQUE function, I've removed it. Then the SPLIT function returns an array, you cannot just call REGEX function on array, the array first has to be UNNEST'ed. I got this in the end:
WITH persons AS (
SELECT SPLIT(gkg.V2Persons,';') pers_arr
FROM
`gdelt-bq.gdeltv2.gkg_partitioned` gkg
WHERE
DATE > 20200914000000 AND DATE < 20200916000000 AND
LOWER(AllNames) LIKE '%abraham accord%'
)
SELECT REGEXP_REPLACE(V2Persons, r',.*', '') V2Persons, COUNT(1) AS count
FROM persons pers, UNNEST(pers.pers_arr) V2Persons
GROUP BY V2Persons
ORDER BY 2 DESC LIMIT 300;

Postgres similarity operator with ANY clause - limit for each parameter?

I have a query:
SELECT
word,
similarity(word, 'foo') as "similarity"
FROM words
WHERE word ~* 'foo'
ORDER BY similarity DESC
LIMIT 5
which gets called for each word in a search. For example searching for 'oxford university england' would call this query 3 times.
I would now like to write a query which can find similar words to all 3 in one trip to the database, and so far I have something like this:
SELECT set_limit(0.4);
SELECT word
FROM words
WHERE word % ANY(ARRAY['foo', 'bar', 'baz']);
However, I can't think of a way to say 'give me 5 results for each word'.
Is there a way to write this? Thanks.

Unnest the array of words you're searching for and join on the % condition instead of applying it in WHERE, then number rows for each search word and filter on that number:
SELECT subq.search_word
, subq.word
FROM (SELECT srch.word as search_word
, wrd.word
, ROW_NUMBER() OVER (PARTITION BY srch.word ORDER BY similarity(wrd.word, srch.word) DESC) AS rn
FROM words wrd
JOIN UNNEST(ARRAY['foo', 'bar']) as srch(word)
ON wrd.word % srch.word) subq
WHERE subq.rn <= 5

Eliminating Entries Based On Revision

I need to figure out how to eliminate older revisions from my query's results, my database stores orders as 'Q000000' and revisions have an appended '-number'. My query currently is as follows:
SELECT DISTINCT Estimate.EstimateNo
FROM Estimate
INNER JOIN EstimateDetails ON EstimateDetails.EstimateID = Estimate.EstimateID
INNER JOIN EstimateDoorList ON EstimateDoorList.ItemSpecID = EstimateDetails.ItemSpecID
WHERE (Estimate.SalesRepID = '67' OR Estimate.SalesRepID = '61') AND Estimate.EntryDate >= '2017-01-01 00:00:00.000' AND EstimateDoorList.SlabSpecies LIKE '%MDF%'
ORDER BY Estimate.EstimateNo
So for instance, the results would include:
Q120455-10
Q120445-11
Q121675-2
Q122361-1
Q123456
Q123456-1
From this, I need to eliminate 'Q120455-10' because of the presence of '-11' for that order, and 'Q123456' because of the presence of the '-1' revision. I'm struggling greatly with figuring out how to do this, my immediate thought was to use case statements but I'm not sure what is the best way to implement them and how to filter. Thank you in advance, let me know if any more information is needed.

First you have to parse your EstimateNo column into sequence number and revision number using CHARINDEX and SUBSTRING (or STRING_SPLIT in newer versions) and CAST/CONVERT the revision to a numeric type
SELECT
SUBSTRING(Estimate.EstimateNo,0,CHARINDEX('-',Estimate.EstimateNo)) as [EstimateNo],
CAST(SUBSTRING(Estimate.EstimateNo,CHARINDEX('-',Estimate.EstimateNo)+1, LEN(Estimate.EstimateNo)-CHARINDEX('-',Estimate.EstimateNo)+1) as INT) as [EstimateRevision]
FROM
...
You can then use
APPLY - to select TOP 1 row that matches the EstimateNo or
Window function such as ROW_NUMBER to select only records with row number of 1
For example, using a ROW_NUMBER would look something like below:
SELECT
ROW_NUMBER() OVER(PARTITION BY EstimateNo ORDER BY EstimateRevision DESC) AS "LastRevisionForEstimate",
-- rest of the needed columns
FROM
(
-- query above goes here
)
You can then wrap the query above in a simple select with a where predicate filtering out a specific value of LastRevisionForEstimate, for instance
SELECT --needed columns
FROM -- result set above
WHERE LastRevisionForEstimate = 1
Please note that this is to a certain extent, pseudocode, as I do not have your schema and cannot test the query
If you dislike the nested selects, check out the Common Table Expressions

SQL group by number and replace characters

I have data stored in my database for mobile numbers.
I want to group by the column number in the database.
For example, some numbers may show 44123456789 and 0123456789 which is the same number. How can I group these together?

SELECT DIGITS(column_name) FROM table_name
You should use this format in DB then you assign it any variable, next you can matching their digits with the others.

Not sure it really suits you, but you could build this kind of subquery:
SELECT ta.`phone_nbr`,
COALESCE(list.`normalized_nbr`,ta.`phone_nbr`) AS nbr
FROM (
SELECT
t.`phone_nbr`,
SUBSTRING(t.`phone_nbr`,2) AS normalized_nbr
FROM `your_table` t
WHERE LEFT(t.`phone_nbr`,1) = '0'
UNION
SELECT
t.`phone_nbr`,
sub.`filter_nbr` AS normalized_nbr
FROM `your_table` t,
( SELECT
SUBSTRING(t2.`phone_nbr`,2) AS filter_nbr
FROM `your_table` t2
WHERE LEFT(t2.`phone_nbr`,1) = '0') sub
WHERE LEFT(t.`phone_nbr`,1) != '0'
AND t.`phone_nbr` LIKE CONCAT('%',sub.`filter_nbr`)
) list
LEFT OUTER JOIN `your_table` ta
ON ta.`phone_nbr` = list.`phone_nbr`
It will return you a list of phone numbers with their "normalized" number, i.e. with the 0 or international prefix removed if there is a duplicate match, and the raw number otherwise.
You can then use a GROUP BY clause on the nbr field, join on the phone_nbr for the rest of your query.
It has some limits, as it can unfortunately group similar stripped numbers. +49123456789, +44123456789 and 0123456789 will unfortunately have the same normalized number.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Postgresql - Count number of instances of substring in results of ILIKE query - sql

Try this: SELECT SUM( CASE WHEN content ILIKE '%Search Term%' THEN 1 ELSE 0 END ) FROM messages WHERE created_at >= '2021-07-24T20:17:18.141Z' AND created_at <= '2021-07-31T20:11:20.542Z';

Related

Count the most popular occurrences of a hashtag in a string column postgreSQL

Problem with GCP project relying on GDELT database

Postgres similarity operator with ANY clause - limit for each parameter?

Eliminating Entries Based On Revision

SQL group by number and replace characters

Categories

Resources