Trouble filtering out plural words - sql

I have a table with the most frequent words in the English language which looks like this:
word count
cat 43534889
dog 34584357
hat 4343878
...
hats 44747
I'd like to exclude all the plural words like 'hats' if they already exist in singular form.
So I wrote this query
SELECT
word,
CASE WHEN CONCAT(word,'s') IN (
SELECT freq.word from `words.freq` as freq
WHERE freq.word LIKE '%s' AND LENGTH(freq.word) > 4
)
THEN 'plural'
ELSE 'sing'
END AS plural
FROM `words.freq` LIMIT 1000
My logic is: if the word 'hat' + 's' is found among words ending in 's' (subquery), it means it's just the plural form of that noun. Somehow the function CONCAT doesn't seem just to add 's' to each word, but it changes it so for example when I run this query, words like 'that' are somehow displayed as 'plural' as if they were longer than 4 characters and contained 's' at the end. I am really confused. Can anyone help?

This (in MySQL syntax) should do what you're looking for: as you say, this doesn't capture all the ways that English can make plurals, and it will also get some false positives ("hiss" would be considered as plural because "his" exists).
The idea is to look for words of >=4 characters ending in 's', and check whether the corresponding word with the final 's' removed exists:
SELECT word,
CASE
WHEN CHAR_LENGTH(word) >= 4 AND word LIKE '%s' AND LEFT(word, CHAR_LENGTH(word)-1) IN (SELECT word FROM words) THEN 'plural'
ELSE 'singular'
END AS plurality
FROM words;

I thinks you can sort words in alphabetical descending order and compare a word with next word to check if it's singular form of it.
WITH sample_table AS (
SELECT 'cat' word, 43534889 count UNION ALL
SELECT 'dog', 34584357 UNION ALL
SELECT 'hat', 4343878 UNION ALL
SELECT 'dogs', 38738 UNION ALL
SELECT 'hats', 44747
)
SELECT *,
IF(CONCAT(LEAD(word) OVER (ORDER BY word DESC), 's') = word, 'plural', 'singular') is_plural
FROM sample_table;

Related

TSQL: how to select a sentence by complex logic

Would you help me to select a sentence by complex logic.
Platform: TSQL.
Initial data:
sentence
result
company "Apple corp" has an apple on its logotype
1
company "Apple computers" is a large company
0
Apple company
1
conditions:
must have %Apple%
not take into account %"%Apple%"%
This means: if the sentence has only %"%Apple%"%, condition not met
But if the sentence has both %Apple% AND %"%Apple%"% - condition met
I tried to apply some kinds of logic:
First:
Substitute the word "Apple" with some rare symbol. Eg. "|"
Delete in the sentence all the symbols but | and quotes
To look for the "|" symbol and to look left and right from it. If the quote is absent on one of the sides, condition met.
Second:
Split the sentence on the basis of the word Apple
Third:
Split the sentence on the basis of the quotes
But I whether don't know how to technically fulfill the logic, or the logic doesn't meet the goal.
If you really have to use sql, just use multiple conditions in your WHERE clause. This way, you don't have to call a function for replacements or other manipulations.
You can rephrase your conditions like this:
Text contains only Apple but not "Apple"
OR
Text contains both Apple and "Apple"
Possibility 1: First apple, then "apple"
Possibility 2: First "apple", then apple
WHERE
(Col LIKE '%apple%' AND Col NOT LIKE '%"%apple%"%') -- APPLE, but not "APPLE"
OR Col LIKE '%apple%"%apple%"%' -- APPLE .. "APPLE" ..
OR Col LIKE '%"%apple%"%apple%' -- "APPLE" .. APPLE ..
db<>fiddle
Your sample data and explanation appears to just require the following, does this work for you?
with d as (
select 'company "Apple corp" has an apple on its logotype 1' sentence union
select 'company "Apple computers" is a large company 0' union
select 'Apple company'
)
select * , case when Replace(Replace(sentence,'"apple',''),'apple"','') like '%apple%' then 1 else 0 end
from d;
I solved the problem by applying this logic.
Substitute Apple with '|'
Delete from the sentence all characters but '|' and '"'
Substitute '"' with '""'. To handle the cases where one quote belongs to 2 words '"Apple"Apple"'
Delete from the sentence all characters '|' covered with quotes.
Select the sentence which contain '|'
First, we should create the function for point 2.
create function [dbo].[fn_KeepCharacters](#String varchar(2000), #KeepValues varchar(255))
returns varchar(2000)
as
begin
set #KeepValues = '%['+#KeepValues+']%'
while patindex(#KeepValues, #String) > 0
set #String = stuff(#String, patindex(#KeepValues, #String), 1, '')
return #String
end
go
Full code:
with d as (
select 'big company "Apple"' col
union select 'Apple, start'
union select 'comp Apple" computers'
union select 'inc " int Apple ap'
union select '"i Apple""mac Apple" aa'
union select 'book "pen Apple"pineApple"'
union select 'leaf Apple"Apple"'
)
--, b as (
select col, case when replace(replace(dbo.fn_KeepCharacters(replace(col, 'Apple', '|'), '^"|'), '"', '""'), '"|"', '""')
like '%|%' then 1 else null end
col_sec
from d
I give thanks to stackoverflow members for help. Especially, to #Stu for nested replace advice.
The problem, that fn_KeepCharacters contains while circle which is very slow. I will appreciate the faster solutions.

Looking for some SQL help I have a column that I want to pull from and cant create the select for it any advice would be helpful

Can someone help me figure out how to write a select statement that grabs the word after the # symbol in a column like this any help would be appreciated I am using sql runner in looker the # is attached to the word I just couldn't do that in the post
So I will never know how many words ill need to pull could be 1 or 50 and would potentially like the column to look like revjahwar,nhl ect... I made this union but still there may be a ton of #'s id have to pull out so its not too efficient
# revjahwar #51&Done #21reasons #21dayswithPrime #IminmyPurpose #Purpose=Peace #iBelieve #Tiredofplayinggames #2019AintNobodyCare # NFL
so far if you follow the comments there is a way to pull out the first one so far
Snowflake's REGEXP_SUBSTR() function "Returns the substring that matches a regular expression within a string," which seems to be what you want to do here. Here is an example.
with INSTAGRAM_POST_METRICS as (select $1 caption from values('# revjahwar #51&Done #21reasons #21dayswithPrime #IminmyPurpose #Purpose=Peace #iBelieve #Tiredofplayinggames #2019AintNobodyCare # NFL'))
select regexp_substr(
caption,
'# ([^ ]+)',
1,
1,
'e'
) word from INSTAGRAM_POST_METRICS;
word
revjahwar
Here is a way to get all of the #words
with INSTAGRAM_POST_METRICS as (select $1 caption from values('# revjahwar #51&Done #21reasons #21dayswithPrime #IminmyPurpose #Purpose=Peace #iBelieve #Tiredofplayinggames #2019AintNobodyCare # NFL'))
SELECT array_to_string(array_agg(word), ',') word_list
FROM (
SELECT caption,
split_part(t.value, ' ', 2) word
FROM instagram_post_metrics,
lateral flatten(split(caption, '#')) t
WHERE t.value != '')
GROUP BY caption;

Find the names of employees whose name has more than two ā€˜aā€™ in it and ends with ā€˜sā€™. ORACLE SQL

I'm attempting to select from a table the names of employees whose name has two 'a' in it and end with s. Heres what I have so far
select NAME from CLASS where NAME LIKE '%s'
I know how to find names where they end with s but not sure how to search for names having atleast two 'a'.
Am I missing something, or could you just not just write
select NAME from CLASS where LOWER(NAME) LIKE '%a%a%a%s'
?
This selects every name that has at least three (i.e. more than two) as, and ends with an s.
One option might be
where regexp_count(name, 'a', 1, 'i') = 2
and substr(lower(name), -1) = 's'
number of 'a' letters - starting at position 1, performing case insensitive search ('i') = 2
the last character is 's'
Found a solution:
select NAME from CLASS where NAME LIKE '%s' and REGEXP_COUNT(NAME, 'a') > 2;
Try this:
select NAME from test where regexp_like(NAME,'[a]{2}[a-z]*[s]$');

Regex - SQL - Query for finding all words that contain at-least 3 capital letters (Does not have to be in order)

I want to basically catch all of the words that contain at-least 3 capital letters anywhere throughout the word.
Example words that I am trying to catch:
sksDDKDeS4Ataow,
dS19DsA2NTbpctK
My bad regex:
regexp_like(word, '[A-Z]{1,4}?+[a-z]{1,16}+[A-Z]{1,4}?+[a-z]{1,16}+[A-Z]{1,4}?')
Try this one - it matches where the word says they should match ...
WITH
words(word) AS (
SELECT 'noMatch'
UNION ALL SELECT 'onlYtwoNomatch'
UNION ALL SELECT 'thrEECapsmatch'
UNION ALL SELECT 'ThReeCapsmatch'
UNION ALL SELECT 'FourMatcHToo'
)
SELECT
*
FROM words
WHERE REGEXP_LIKE(word,'([A-Z]\w*){3}')
;

MS-SQL List of email addresses LIKE statement/regex

I have a column in my table called TO which is a comma separated list of email addresses. (1-n)
I am not concerned with a row if it ONLY contains addresses to Whatever#mycompany.com and want to flag that as 0. However, if a row contains a NON mycompany address (even if there are mycompany addresses present) I'd like to flag it as 1. Is this possible using one LIKE statement?
I've tried;
AND
[To] like '%#%[^m][^y][^c][^o][^m][^p][^a][^n][^y]%.%'
The ideal output will be:
alice#mycompany.com, bob#mycompany.com, malory#yourcompany.com 1
alice#mycompany.com, bob#mycompany.com 0
malory#yourcompany.com 1
Would it be better to write some kind of parsing function to split out addresses into a table if this isnt possible? I don't have an exhaustive list of other domains in the data.
It's ugly but it works. Case statement compares number of occurences of # symbol with number of occurences of #mycompany.com (XXX.. is just for keeping the length of the string):
select
*
, flag = case when len(field) - len(replace(replace(field,'#mycompany.com','XXXXXXXXXXXXXX'),'#','')) > 0 then 1 else 0 end
from (
select 'alice#mycompany.com, bob#mycompany.com, malory#yourcompany.com' as field union all
select 'alice#mycompany.com, bob#mycompany.com' union all
select 'malory#yourcompany.com'
) x
I would suggest a simple counting approach. Count the number of times that "#mycompany" appears and count the number of commas. If these differ, then you have an issue:
select emails,
(case when len(emails) - len(replace(emails, ',', '')) =
len(emails) - len(replace(emails, '#mycompany.com', 'mycompany.com'))
then 0
else 1
end) as HasNonCompanyEmail
from t
To simplify the arithmetic, I replace "#mycompany.com" with "mycompany.com". This removes exactly one character.