TSQL: how to select a sentence by complex logic - sql

Would you help me to select a sentence by complex logic.
Platform: TSQL.
Initial data:
sentence
result
company "Apple corp" has an apple on its logotype
1
company "Apple computers" is a large company
0
Apple company
1
conditions:
must have %Apple%
not take into account %"%Apple%"%
This means: if the sentence has only %"%Apple%"%, condition not met
But if the sentence has both %Apple% AND %"%Apple%"% - condition met
I tried to apply some kinds of logic:
First:
Substitute the word "Apple" with some rare symbol. Eg. "|"
Delete in the sentence all the symbols but | and quotes
To look for the "|" symbol and to look left and right from it. If the quote is absent on one of the sides, condition met.
Second:
Split the sentence on the basis of the word Apple
Third:
Split the sentence on the basis of the quotes
But I whether don't know how to technically fulfill the logic, or the logic doesn't meet the goal.

If you really have to use sql, just use multiple conditions in your WHERE clause. This way, you don't have to call a function for replacements or other manipulations.
You can rephrase your conditions like this:
Text contains only Apple but not "Apple"
OR
Text contains both Apple and "Apple"
Possibility 1: First apple, then "apple"
Possibility 2: First "apple", then apple
WHERE
(Col LIKE '%apple%' AND Col NOT LIKE '%"%apple%"%') -- APPLE, but not "APPLE"
OR Col LIKE '%apple%"%apple%"%' -- APPLE .. "APPLE" ..
OR Col LIKE '%"%apple%"%apple%' -- "APPLE" .. APPLE ..
db<>fiddle

Your sample data and explanation appears to just require the following, does this work for you?
with d as (
select 'company "Apple corp" has an apple on its logotype 1' sentence union
select 'company "Apple computers" is a large company 0' union
select 'Apple company'
)
select * , case when Replace(Replace(sentence,'"apple',''),'apple"','') like '%apple%' then 1 else 0 end
from d;

I solved the problem by applying this logic.
Substitute Apple with '|'
Delete from the sentence all characters but '|' and '"'
Substitute '"' with '""'. To handle the cases where one quote belongs to 2 words '"Apple"Apple"'
Delete from the sentence all characters '|' covered with quotes.
Select the sentence which contain '|'
First, we should create the function for point 2.
create function [dbo].[fn_KeepCharacters](#String varchar(2000), #KeepValues varchar(255))
returns varchar(2000)
as
begin
set #KeepValues = '%['+#KeepValues+']%'
while patindex(#KeepValues, #String) > 0
set #String = stuff(#String, patindex(#KeepValues, #String), 1, '')
return #String
end
go
Full code:
with d as (
select 'big company "Apple"' col
union select 'Apple, start'
union select 'comp Apple" computers'
union select 'inc " int Apple ap'
union select '"i Apple""mac Apple" aa'
union select 'book "pen Apple"pineApple"'
union select 'leaf Apple"Apple"'
)
--, b as (
select col, case when replace(replace(dbo.fn_KeepCharacters(replace(col, 'Apple', '|'), '^"|'), '"', '""'), '"|"', '""')
like '%|%' then 1 else null end
col_sec
from d
I give thanks to stackoverflow members for help. Especially, to #Stu for nested replace advice.
The problem, that fn_KeepCharacters contains while circle which is very slow. I will appreciate the faster solutions.

Related

Trouble filtering out plural words

I have a table with the most frequent words in the English language which looks like this:
word count
cat 43534889
dog 34584357
hat 4343878
...
hats 44747
I'd like to exclude all the plural words like 'hats' if they already exist in singular form.
So I wrote this query
SELECT
word,
CASE WHEN CONCAT(word,'s') IN (
SELECT freq.word from `words.freq` as freq
WHERE freq.word LIKE '%s' AND LENGTH(freq.word) > 4
)
THEN 'plural'
ELSE 'sing'
END AS plural
FROM `words.freq` LIMIT 1000
My logic is: if the word 'hat' + 's' is found among words ending in 's' (subquery), it means it's just the plural form of that noun. Somehow the function CONCAT doesn't seem just to add 's' to each word, but it changes it so for example when I run this query, words like 'that' are somehow displayed as 'plural' as if they were longer than 4 characters and contained 's' at the end. I am really confused. Can anyone help?
This (in MySQL syntax) should do what you're looking for: as you say, this doesn't capture all the ways that English can make plurals, and it will also get some false positives ("hiss" would be considered as plural because "his" exists).
The idea is to look for words of >=4 characters ending in 's', and check whether the corresponding word with the final 's' removed exists:
SELECT word,
CASE
WHEN CHAR_LENGTH(word) >= 4 AND word LIKE '%s' AND LEFT(word, CHAR_LENGTH(word)-1) IN (SELECT word FROM words) THEN 'plural'
ELSE 'singular'
END AS plurality
FROM words;
I thinks you can sort words in alphabetical descending order and compare a word with next word to check if it's singular form of it.
WITH sample_table AS (
SELECT 'cat' word, 43534889 count UNION ALL
SELECT 'dog', 34584357 UNION ALL
SELECT 'hat', 4343878 UNION ALL
SELECT 'dogs', 38738 UNION ALL
SELECT 'hats', 44747
)
SELECT *,
IF(CONCAT(LEAD(word) OVER (ORDER BY word DESC), 's') = word, 'plural', 'singular') is_plural
FROM sample_table;

Conditional regexp_replace Oracle / PLSQL

I'm trying to do a conditional replace within one regexp_replace statement.
For example, if I have the string, 'Dog Cat Donkey', I would like to be able to replace 'Dog' with 'BigDog', 'Cat' with 'SmallCat' and 'Donkey' with 'MediumDonkey' to get the following:
'BigDog SmallCat MediumDonkey'
I can do it where all are prefixed with the word Big but can't seem to make it replace conditionally.
I currently have this
select regexp_replace('Dog Cat Donkey', '(Cat)|(Dog)|(Donkey)', ' Big\1\2\3')
from dual
but of course this only returns 'BigDog BigCat BigDonkey'.
I'm aware this isn't the best way of doing this but is it possible?
Have you considered just doing multiple replace()s?
select replace(replace(replace(str, 'Dog', 'BigDog'), 'Cat', 'SmallCat'), 'Donkey', 'MediumDonkey')
I get that regexp_replace() is really powerful. And it might be able to do this. But I'm not sure that's a better solution in terms of expressing what you are doing.
Query -
select listagg(final_str,' ') within group (order by sort_str) as output from (
SELECT
CASE LST
WHEN 'Dog' THEN 'BigDog'
WHEN 'Cat' THEN 'SmallCat'
WHEN 'Donkey' THEN 'MediumDonkey'
END AS final_str,
CASE LST
WHEN 'Dog' THEN 1
WHEN 'Cat' THEN 2
WHEN 'Donkey' THEN 3
END AS sort_str
from (
SELECT
trim(REGEXP_SUBSTR('Dog Cat Donkey', '(\S*)(\s*)', 1, LEVEL)) AS LST
FROM
DUAL
CONNECT BY
REGEXP_SUBSTR('Dog Cat Donkey', '(\S*)(\s*)', 1, LEVEL) IS NOT NULL
));
Output -
BigDog SmallCat MediumDonkey
For conditional replacement via REGEX_REPLACE?
Then currently you can do this by repeating it for each different replacement string.
But you could still use the | (OR) within the 1 capture group to change more than 1 word for the same replacement string.
And as Gordon Linoff pointed out.
You don't really need a REGEX_REPLACE when a normal REPLACE is sufficient to match a single word.
select regexp_replace(
regexp_replace(
regexp_replace( str,
'(Dog|Snoopy)', 'Big\1')
,'(Cat|Feline)', 'Small\1')
,'(Donkey|Ass)', 'Medium\1')
from (select 'You Ass, that is not a Dog, but a Cat on a Donkey.' as str from dual);
Returns:
You MediumAss, that is not a BigDog, but a SmallCat on a MediumDonkey.
Do note however that when using the pipe in a regex, that the order matters.
So if some words start the same then better put them in order of descending length.
Example:
select
regexp_replace(str, '(foo|foobar)', '[\1]') as foo_foobar,
regexp_replace(str, '(foobar|foo)', '[\1]') as foobar_foo
from (select 'foo foobar' as str from dual);
Returns:
FOO_FOOBAR FOOBAR_FOO
--------------- ---------------
[foo] [foo]bar [foo] [foobar]

PL SQL replace conditionally suggestion

I need to replace the entire word with 0 if the word has any non-digit character. For example, if digital_word='22B4' then replace with 0, else if digital_word='224' then do not replace.
SELECT replace_funtion(digital_word,'has non numeric character pattern',0,digital_word)
FROM dual;
I tried decode, regexp_instr, regexp_replace but could not come up with the right solution.
Please advise.
Thank you.
the idea is simple - you need check if the value is numeric or not
script:
with nums as
(
select '123' as num from dual union all
select '456' as num from dual union all
select '7A9' as num from dual union all
select '098' as num from dual
)
select n.*
,nvl2(LENGTH(TRIM(TRANSLATE(num, ' +-.0123456789', ' '))),'0',num)
from nums n
result
1 123 123
2 456 456
3 7A9 0
4 098 098
see more articles below to see which way is better to you
How can I determine if a string is numeric in SQL?
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:15321803936685
How to tell if a value is not numeric in Oracle?
You might try the following:
SELECT CASE WHEN REGEXP_LIKE(digital_word, '\D') THEN '0' ELSE digital_word END
FROM dual;
The regular expression class \D matches any non-digit character. You could also use [^0-9] to the same effect:
SELECT CASE WHEN REGEXP_LIKE(digital_word, '\D') THEN '0' ELSE digital_word END
FROM dual;
Alternately you could see if the value of digital_word is made up of nothing but digits:
SELECT CASE WHEN REGEXP_LIKE(digital_word, '^\d+$') THEN digital_word ELSE '0' END
FROM dual;
Hope this helps.
The fastest way is to replace all digits with null (to simply delete them) and see if anything is left. You don't need regular expressions (slow!) for this, you just need the standard string function TRANSLATE().
Unfortunately, Oracle has to work around their own inconsistent treatment of NULL - sometimes as empty string, sometimes not. In the case of the TRANSLATE() function, you can't simply translate every digit to nothing; you must also translate a non-digit character to itself, so that the third argument is not an empty string (which is treated as a real NULL, as in relational theory). See the Oracle documentation for the TRANSLATE() function. https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions216.htm#SQLRF06145
Then, the result can be obtained with a CASE expression (or various forms of NULL handling functions; I prefer CASE, which is SQL Standard):
with
nums ( num ) as (
select '123' from dual union all
select '-56' from dual union all
select '7A9' from dual union all
select '0.9' from dual
)
-- End of simulated inputs (for testing only, not part of the solution).
-- SQL query begins BELOW THIS LINE. Use your own table and column names.
select num,
case when translate(num, 'z0123456789', 'z') is null
then num
else '0'
end as result
from nums
;
NUM RESULT
--- ------
123 123
-56 0
7A9 0
0.9 0
Note: everything here is in varchar2 data type (or some other kind of string data type). If the results should be converted to number, wrap the entire case expression within TO_NUMBER(). Note also that the strings '-56' and '0.9' are not all-digits (they contain non-digits), so the result is '0' for both. If this is not what you needed, you must correct the problem statement in the original post.
Something like the following update query will help you:
update [table] set [col] = '0'
where REGEXP_LIKE([col], '.*\D.*', 'i')

Sorting (or usage of ORDER BY clause) in T-SQL / SQL SERVER without considering some words

i'm wondering whether it is possible to use ORDER BY clause (or any other clause(s)) to do sorting without considering some words.
For ex, article 'the':
Bank of Switzerland
Bank of America
The Bank of England
should be sorted into:
Bank of America
The Bank of England
Bank of Switzerland
and NOT
Bank of America
Bank of Switzerland
The Bank of England
select * from #test
order by
case when test like 'The %' then substring(test, 5, 8000) else test end
If you have a limited number of words that you wish to eliminate, then you might be able to remove them by judicious use of REPLACE, e.g.
ORDER BY REPLACE(REPLACE(' ' + Column + ' ',' the ',' '),' and ',' ')
However, as the number of words add up, you'll have more and more nested REPLACE calls. In addition, this ORDER BY will be unable to benefit from any indexes, and doesn't cope with punctuation marks.
If this sort is frequent and the queries would otherwise be able to benefit from an index, you might consider making the above a computed column, and creating an index over it (You would then order by the computed column).
You need to encode a method of turning one string into another and then ordering by that.
For example, if the method is just to strip away starting occurances of 'The '...
ORDER BY
CASE WHEN LEFT(yourField, 4) = 'The ' THEN RIGHT(yourField, LEN(yourField)-4) ELSE yourField END
Or, if you want to ignore all occurrences of 'the', where ever it occurs, just use REPLACE...
ORDER BY
REPLACE(yourField, 'The', '')
You may end up with a fairly complex transposition, in which case you can do things like this...
SELECT
*
FROM
(
SELECT
<complex transposition> AS new_name,
*
FROM
whatever
)
AS data
ORDER BY
new_name
No, not really because the is arbitrary in this case. The closest you can do is modify the field value, such as below:
SELECT field1
FROM table
ORDER BY REPLACE(field1, 'The ', '')
The problem is that to replace two words, you have to next REPLACE statements, which becomes a huge issue if you have more than about five words:
SELECT field1
FROM table
ORDER BY REPLACE(REPLACE(field1, 'of ', ''), 'The ', '')
Update: You don't really need to check if the or of appears at the beginning of the field because you are only wanting to sort by important words anyway. For example, Bank of America should appear before Bank England (the of shouldn't make it selected after).
My Solution a little bit shorter
DECLARE #Temp TABLE ( Name varchar(100) );
INSERT INTO #Temp (Name)
SELECT 'Bank of Switzerland'
UNION ALL
SELECT 'Bank of America'
UNION ALL
SELECT 'The Bank of England'
SELECT * FROM #Temp
ORDER BY LTRIM(REPLACE(Name, 'The ', ''))

Is it possible to query a comma separated column for a specific value?

I have (and don't own, so I can't change) a table with a layout similar to this.
ID | CATEGORIES
---------------
1 | c1
2 | c2,c3
3 | c3,c2
4 | c3
5 | c4,c8,c5,c100
I need to return the rows that contain a specific category id. I starting by writing the queries with LIKE statements, because the values can be anywhere in the string
SELECT id FROM table WHERE categories LIKE '%c2%';
Would return rows 2 and 3
SELECT id FROM table WHERE categories LIKE '%c3%' and categories LIKE '%c2%'; Would again get me rows 2 and 3, but not row 4
SELECT id FROM table WHERE categories LIKE '%c3%' or categories LIKE '%c2%'; Would again get me rows 2, 3, and 4
I don't like all the LIKE statements. I've found FIND_IN_SET() in the Oracle documentation but it doesn't seem to work in 10g. I get the following error:
ORA-00904: "FIND_IN_SET": invalid identifier
00904. 00000 - "%s: invalid identifier"
when running this query: SELECT id FROM table WHERE FIND_IN_SET('c2', categories); (example from the docs) or this query: SELECT id FROM table WHERE FIND_IN_SET('c2', categories) <> 0; (example from Google)
I would expect it to return rows 2 and 3.
Is there a better way to write these queries instead of using a ton of LIKE statements?
You can, using LIKE. You don't want to match for partial values, so you'll have to include the commas in your search. That also means that you'll have to provide an extra comma to search for values at the beginning or end of your text:
select
*
from
YourTable
where
',' || CommaSeparatedValueColumn || ',' LIKE '%,SearchValue,%'
But this query will be slow, as will all queries using LIKE, especially with a leading wildcard.
And there's always a risk. If there are spaces around the values, or values can contain commas themselves in which case they are surrounded by quotes (like in csv files), this query won't work and you'll have to add even more logic, slowing down your query even more.
A better solution would be to add a child table for these categories. Or rather even a separate table for the catagories, and a table that cross links them to YourTable.
You can write a PIPELINED table function which return a 1 column table. Each row is a value from the comma separated string. Use something like this to pop a string from the list and put it as a row into the table:
PIPE ROW(ltrim(rtrim(substr(l_list, 1, l_idx - 1),' '),' '));
Usage:
SELECT * FROM MyTable
WHERE 'c2' IN TABLE(Util_Pkg.split_string(categories));
See more here: Oracle docs
Yes and No...
"Yes":
Normalize the data (strongly recommended) - i.e. split the categorie column so that you have each categorie in a separate... then you can just query it in a normal faschion...
"No":
As long as you keep this "pseudo-structure" there will be several issues (performance and others) and you will have to do something similar to:
SELECT * FROM MyTable WHERE categories LIKE 'c2,%' OR categories = 'c2' OR categories LIKE '%,c2,%' OR categories LIKE '%,c2'
IF you absolutely must you could define a function which is named FIND_IN_SET like the following:
CREATE OR REPLACE Function FIND_IN_SET
( vSET IN varchar2, vToFind IN VARCHAR2 )
RETURN number
IS
rRESULT number;
BEGIN
rRESULT := -1;
SELECT COUNT(*) INTO rRESULT FROM DUAL WHERE vSET LIKE ( vToFine || ',%' ) OR vSET = vToFind OR vSET LIKE ('%,' || vToFind || ',%') OR vSET LIKE ('%,' || vToFind);
RETURN rRESULT;
END;
You can then use that function like:
SELECT * FROM MyTable WHERE FIND_IN_SET (categories, 'c2' ) > 0;
For the sake of future searchers, don't forget the regular expression way:
with tbl as (
select 1 ID, 'c1' CATEGORIES from dual
union
select 2 ID, 'c2,c3' CATEGORIES from dual
union
select 3 ID, 'c3,c2' CATEGORIES from dual
union
select 4 ID, 'c3' CATEGORIES from dual
union
select 5 ID, 'c4,c8,c5,c100' CATEGORIES from dual
)
select *
from tbl
where regexp_like(CATEGORIES, '(^|\W)c3(\W|$)');
ID CATEGORIES
---------- -------------
2 c2,c3
3 c3,c2
4 c3
This matches on a word boundary, so even if the comma was followed by a space it would still work. If you want to be more strict and match only where a comma separates values, replace the '\W' with a comma. At any rate, read the regular expression as:
match a group of either the beginning of the line or a word boundary, followed by the target search value, followed by a group of either a word boundary or the end of the line.
As long as the comma-delimited list is 512 characters or less, you can also use a regular expression in this instance (Oracle's regular expression functions, e.g., REGEXP_LIKE(), are limited to 512 characters):
SELECT id, categories
FROM mytable
WHERE REGEXP_LIKE('c2', '^(' || REPLACE(categories, ',', '|') || ')$', 'i');
In the above I'm replacing the commas with the regular expression alternation operator |. If your list of delimited values is already |-delimited, so much the better.