How to regex match multiple items - sql

I have a reviews table as follows:
r_id
comment
1
Weight cannot exceed 40 kg
2
You must not make the weight go over 31 k.g
3
Don't excel above 94kg
4
Optimal weight is 45 kg
5
Don't excel above 62 kg
6
Weight cannot exceed 7000g
What I want to select is the weight a r_id's cannot exceed. So my desired output is
r_id
max weight
1
40
2
31
3
94
5
62
As you can see, r_id 4 wasn't included since it wasn't taking about the maximum weight and 6 wasn't included because it is in grams. I am struggling with two things.
There are multiple phrases, how can I do a OR operator check in my regex column.
Sometimes the kg number is written like 40kg, 40 KG, 40 k.g or 40kilos. While all things are kilograms, the kg is written in different ways. How can I only extract the number (but ensuring the kg is written in one of the above ways, so I don't accidentally extract something like 4000g.
SELECT
r_id,
REGEX_SUBSTR(REGEX_SUBSTR('cannot exceed [0-9]+ kg'), '[0-9]+ kg')) as "max weight"
FROM reviews;
My statement only checks for one particular type of sentence and doesn't check if the number is in kilograms.

You could just extract the number from the string. There only appears to be one and then check if the string looks like certain patterns:
select regexp_substr(comm, '[0-9]+')
from reviews
where regexp_like(comm, '(exceed|go over|above).*[0-9]+ ?(kg|k.g)');
Here is a db<>fiddle.

You can use a more robust regex expression to extract the number.
I don't have an oracle DB, but try something like:
SELECT
r_id,
REGEX_SUBSTR(comment, '([0-9]+) ?(k\.?g\.?|kilos)', 1, 1, 'i') as "max weight"
FROM reviews;
You can see this regex matching the given string in action at https://regex101.com/r/07Rstk/1. This also explains what the regex means.
We also turn on the case insensitive flag in order to properly handle any capitalization. https://docs.oracle.com/cd/E18283_01/olap.112/e17122/dml_functions_2069.htm
Edit: To do checks for exceed, go over, etc. Note that we have changed the position parameter from 1 to 2 since we now care about the second capture group.
SELECT
r_id,
REGEX_SUBSTR(comment, '(exceed|go over|above)\h*([0-9]+) ?(k\.?g\.?|kilos)', 1, 2, 'i') as "max weight"
FROM reviews;

Related

SQL get numbers that contain certain digits

How do I get number field that contain 878 as reflected in the result. It should contain two 8 and one 7.
Edit : using TSQL
Number
878
780
788
700
Result
878
788
There are only 3 combinations so what about the crudest of them all?
Number = 788 OR Number = 878 OR Number = 887
Or even:
Number IN (788,878,887)
If numbers are not just 3 digits then cast the number as string and then:
NumberAsString LIKE '%887%' OR NumberAsString LIKE '%878%' OR NumberAsString LIKE '%788%'
A three digit number consisting of exactly one seven and two eights:
where number in (788, 878, 887)
Everything else would be overkill for this simple task.
If the task were different, say, the number can have more digits and must contain exactly one seven and two eights, then we could use an appropriate combination of LIKE and NOT LIKE to get what we are looking for. E.g:
where number like '%7%' -- contains a 7
and number like '%8%8%' -- contains two 8s
and number not like '%7%7%' -- doesn't contain more than one 7
and number not like '%8%8%8%' -- doesn't contain more than two 8s
UPDATE:
This is not a good solution, in stead I would go for the next solution suggested here.
If you are using MySQL REGEX is your friend:
SELECT * FROM _TABLE_
WHERE `Number` REGEXP "[7]{1}" AND `Number` REGEXP "[8]{2}";
More info: https://dev.mysql.com/doc/refman/8.0/en/regexp.html
I think in SQL Server should be something like this:
SELECT * FROM _TABLE_
WHERE `Number` LIKE '%[7]{1}%' AND `Number` LIKE '%[2]{1}%';
More info: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/like-transact-sql?redirectedfrom=MSDN&view=sql-server-ver15

Generate a progressive number when new record are inserted (some record need to have the same number)

the Title can be a little confused. Let me explain the problem. I have a pipeline that loads new record daily. This record contain sales. The key is <date, location, ticket, line>. This data are loaded into a redshift table and than are exposed through a view that is read by a system. This system have a limit, the column for the ticket is a varchar(10) but the ticket is a string of 30 char. If the system take only the first 10 character will generate duplicate. The ticket number can be a "fake" number. Doesn't matter if it isn't equal to the real number. So I'm thinking to add a new column on the redshift table that contain a progressive number. The problem is that I cannot use an identity column because the record belonging to the same ticket must have the same "progressive number". Then I will expose this new column (ticket_id) instead of the original one.
That is what I want:
day
location
ticket
line
amount
ticket_id
12/12/2020
67
123...GH
1
10
1
12/12/2020
67
123...GH
2
5
1
12/12/2020
67
123...GH
3
23
1
12/12/2020
23
123...GB
1
13
2
12/12/2020
23
123...GB
2
45
2
...
...
...
...
...
...
12/12/2020
78
123...AG
5
100
153
The next day when new data will be loaded I want start with the ticket_id 154 and so on.
Each row have a column which specify the instant in which it was inserted. Rows inserted the same day have the same insert_time.
My solution is:
insert the record with ticket_id as a dense_rank. But each time (that I load new record, so each day) the ticket_id start by one, so...
... update the rows just inserted as ticket_id = ticket_id + the max number that I find under the ticket_id column where insert_time != max(insert_time)
Do you think that there is a better solution? It would be very nice if a hash function existed that take <day, location, ticket> as input and return a number of max 10 characters.
So from the comments it sounds like you cannot add a dimension table to just look up the number or 10 character string that identifies each ticket as this would be a data model change. This is likely the best and most accurate way to do this.
You asked about a hash function to do this and there are several. But first let's talk about hashes - these take strings of varying length and make a signature out of them. Since this process can significantly reduce the number of characters there is a possibility that 2 different string will generate the same hash. The longer the hash value is the lower the odds are for having such a collision but the odds are never zero. Since you can only have 10 chars this sets the odds of a hash collision.
The md5() function on Redshift will take a string and make a 32 character string (base 16 characters) out of it. md5(day::text || location || ticket:text) will make such a hash out of the columns you mentioned. This process can make 16^32 possible different strings which is a big number.
But you only want a string of 10 character. The good news is that hash functions like md5() spread the differences between strings across the whole output so you can just pick any 10 characters to use. Doing this will reduce the number of unique values to 16^10 or about 1.1 trillion - still a big number but if you have billions of rows you could see a collision. One way to improve this would be to base64 encode the md5() output and then truncate to 10 characters. Doing this will require a UDF but would improve the number of possible hashes to 1.1E18 - a million times larger. If you want the output to be an integer you can convert hex strings to integers with strtol() but a 10 digit number only has 10 billion possible values.
So if you are sure you want to use a hash this is quite possible. Just remember what a hash does.

display non-printable ascii characters in SQL as :ascii: or :print: does not work

I am trying to fetch all non-printable ASCII characters from DESCRIPTION field in a table using SQL in TOAD however the below query is not working .
select
regexp_instr(a.description,'[^[:ascii:]]') as description from
poline a where a.ponum='XXX' and a.siteid='YYY' and
regexp_instr(a.description,'[^[:ascii:]]') > 0
the above query bought error ORA-127729: invalid character class in regular expression. I tried :print: instead of :ascii: however it didn't bring any result. Below is the description for this record which has non-printable characters.
Sherlock 16 x 6.5” Wide Wheelbarrow wheel .M100P.10R – Effluent care bacteria and enzyme formulation
:ascii: is not a valid character class, and even if it were, it doesn't appear to be what you are trying to get here (ascii does contain non-printable characters). Valid classes can be found here.
Actually if you replace :ascii: with :print: in your original query, it will indeed return the first position in each POLINE.DESCRIPTION that is a non-printable character. (If it returns nothing for you, it may be because your DESCRIPTION data is actually all printable.)
But as you stated you want to identify Every non-printable char in each DESCRIPTION in POLINE, some changes would be needed. I'll include an example that gets every match as a starting place.
In this example, each DESCRIPTION will be decomposed to its individual constituent characters, and each char will be checked for printability. The location within the DESCRIPTION string along with the ASCII number of the non-printable character will be returned.
This example assumes there is a unique identifier for each row in POLINE, here called POLINE_ID.
First, create the test table:
CREATE TABLE POLINE(
POLINE_ID NUMBER PRIMARY KEY,
PONUM VARCHAR2(32),
SITEID VARCHAR2(32),
DESCRIPTION VARCHAR2(256)
);
And load some data. I inserted a couple non-printing chars in the example Sherlock string you provided, #23 and #17. An example string composed of only the first 64 ASCII chars (of which the first 31 are not in :print:) is also included, and some fillers to fall through the PONUM and SITEID predicates.
INSERT INTO POLINE VALUES (1,'XXX','YYY','Sherlock'||CHR(23)||' 16 x 6.5” Wide Wheelbarrow wheel .M100P.10R –'||CHR(17)||' Effluent care bacteria and enzyme formulation');
DECLARE
V_STRING VARCHAR2(64) := CHR(1);
BEGIN
FOR POINTER IN 2..64 LOOP
V_STRING := V_STRING||CHR(POINTER);
END LOOP;
INSERT INTO POLINE VALUES (2, 'XXX','YYY',V_STRING);
INSERT INTO POLINE VALUES (3, 'AAA','BBB',V_STRING);
END;
/
INSERT INTO POLINE VALUES(4,'XXX','YYY','VOLTRON');
Now we have 4 rows total. Three of them contain (multiple) non-printable characters, but only two of them should match all the restrictions.
Then run a query. There are two example queries below--the first uses REGEXP_INSTR with as in your initial example query (substituting :cntrl: for :print:). But for an alternative, a 2nd, variant is also included that just checks whether each char is in the first 31 ascii chars.
Both example queries, will index every char of each DESCRIPTION, and check whether it is printable, and collect the ascii number and location of each non-printable character in each candidate DESCRIPTION. The example table here has DESCRIPTIONs that are 256 chars long, so this is used as the max index in the cartesian join.
Please note, these are not efficient, and are designed to get EVERY match. If you end up only needing the first match afterall, your original query replaced with :print: will perform much better. Also, this could also be tuned by dropping into PL/SOL or perhaps going recursive (if PL/SQL is allowed in your use case, or you are 11gR2+, etc.). Also some predicates here such as REGEXP_LIKE do not impact the end result and serve only to allow preliminary filtration. These could be superfluous (or worse) for you, depending on your data set.
First example, using regex and :print:
SELECT
POLINE_ID,
STRING_INDEX AS NON_PRINTABLE_LOCATION,
ASCII(REGEXP_SUBSTR(SUBSTR(DESCRIPTION, STRING_INDEX, 1), '[[:cntrl:]]', 1, 1)) AS NON_PRINTABLE_ASCII_NUMBER
FROM POLINE
CROSS JOIN (SELECT LEVEL AS STRING_INDEX
FROM DUAL
CONNECT BY LEVEL < 257) CANDIDATE_LOCATION
WHERE PONUM = 'XXX'
AND SITEID = 'YYY'
AND REGEXP_LIKE(DESCRIPTION, '[[:cntrl:]]')
AND REGEXP_INSTR(SUBSTR(DESCRIPTION, STRING_INDEX, 1), '[[:cntrl:]]', 1, 1, 0) > 0
AND STRING_INDEX <= LENGTH(DESCRIPTION)
ORDER BY 1 ASC, 2 ASC;
Second example, using ASCII numbers:
SELECT
POLINE_ID,
STRING_INDEX AS NON_PRINTABLE_LOCATION,
ASCII(SUBSTR(DESCRIPTION, STRING_INDEX, 1)) AS NON_PRINTABLE_ASCII_NUMBER
FROM POLINE
CROSS JOIN (SELECT LEVEL AS STRING_INDEX
FROM DUAL
CONNECT BY LEVEL < 257) CANDIDATE_LOCATION
WHERE PONUM = 'XXX'
AND SITEID = 'YYY'
AND REGEXP_LIKE(DESCRIPTION, '[[:cntrl:]]')
AND ASCII(SUBSTR(DESCRIPTION, STRING_INDEX, 1)) BETWEEN 1 AND 31
AND STRING_INDEX <= LENGTH(DESCRIPTION)
ORDER BY 1 ASC, 2 ASC;
In our test data, these queries will produce equivalent output. We should expect this to have two hits (for chrs 17 and 23) in the Sherlock DESCRIPTION, and 31 hits for the first-64-ascii DESCRIPTION.
Result:
POLINE_ID NON_PRINTABLE_LOCATION NON_PRINTABLE_ASCII_NUMBER
1 9 23
1 56 17
2 1 1
2 2 2
2 3 3
2 4 4
2 5 5
2 6 6
2 7 7
2 8 8
2 9 9
2 10 10
2 11 11
2 12 12
2 13 13
2 14 14
2 15 15
2 16 16
2 17 17
2 18 18
2 19 19
2 20 20
2 21 21
2 22 22
2 23 23
2 24 24
2 25 25
2 26 26
2 27 27
2 28 28
2 29 29
2 30 30
2 31 31
33 rows selected.
EDIT In response to comments, here is some elaboration on what we can expect from [[:cntrl:]] and [^[:cntrl:]] with regexp_instr.
[[:cntrl:]] will match any of the first 31 ascii characters, while [^[:cntrl:]] is the logical negation of [[:cntrl:]], so it will match anything except the first 31 ascii characters.
To compare these, we can start with the simplest case of only one character, ascii #31. Since there's only one character, the result can only be either match or miss. One will expect the following to return 1 for the match:
SELECT REGEXP_INSTR(CHR(31),'[[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
1
But 0 for the miss with negating [^[:cntrl:]] :
SELECT REGEXP_INSTR(CHR(31),'[^[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
0
Now if we include two (or more) characters that are a mix of printable and non-printnable, there are more possible outcomes. Both [[:cntrl:]] and [^[:cntrl:]] can match, but they can only match different things. If we move from only ascii #31 to ascii #64#31, we will still expect [[:cntrl:]] to match (since there is a non-printable character in the second position) but it should now return 2, since the non-printable is in the second position.
SELECT REGEXP_INSTR(CHR(64)||CHR(31),'[[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
2
And now [^[:cntrl:]] also has the opportunity to match (at the first position):
SELECT REGEXP_INSTR(CHR(64)||CHR(31),'[^[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
1
When there are a mix of printable and control characters, both [[:cntrl:]] and [^[:cntrl:]] can match, but they will match at different indices.

SQL search with like

I,m trying to make a query for this table which it have the following columns.
from , to, Range with values like
1, 100, A:
101,200, B:
201,300, C:
The columns are integer.
a user is going to give a number and I have to get on which rate is. Let say, the user send 105, I know that with a query I can get that it is on range B. But the problem is that sometimes users do not know the complete number that is going to be sent. Let say they only know the first two digits of the number, something like 10. I have to return all the possibilities that could involve l0. Let say, 10-101-1001-10001. The problem is that If I use LIKE I will not receive all the values because I do not have them in a column.
Any ideas how i can do this?
Just for your case (give first two digits), you can use substr to pick the first two digits of FROM and corresponding head of TO (length(to||'')-length(from||'')+2). Then compare them with user input to find the range.
With this query, you can get all ranges which contain number like what user send.
Use user input of 12 as an example, result will be ranges contain number like '12%':
select from, to, range from
(select Integer(substr(from,1,2)) substr_from,
Integer(substr(to,1,length(to||'')-length(from||'')+2) substr_to,
from, to, range
from your_table)
where (substr_from<=12 and substr_to>=12)
or (from<=12 and to>=12)
Below table will show the values of from, to, substr_from, substr_to
1 100 1 100
101 200 10 20
201 300 20 30
...................
901 1000 90 100
1001 1100 10 11
1101 1200 11 12
...................
9901 10000 99 100
10001 10100 10 10
10101 10200 10 10
...................
Since input is 12, these ranges will be returned: 1-100,101-200,1101-1200,1201-1300...
Of course this won't work if given digits is in the middle.

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.