Split a string to its subwords - sql

Every letter has a value
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
TableA
String Length Value Subwords
exampledomain 13 132 #example-domain#example-do-main#
creditcard 10 85 #credit-card#credit-car-d#
TableB
Words Length Value
example 7 76
do 2 19
main 4 37
domain 6 56
credit 6 59
card 4 26
car 3 22
d 1 4
Explanation
TableA has string based over milion rows, and it will be new added 100k rows/daily to tableA.
And also "string" column has no whitespaces
TableB has words based over milion rows,there is every letter and words in 1-2 languages
What i want to do
i want to split strings in TableA to its subwords, as you see in example; "creditcard" i search in TableB all words and try to find which words when comes together matches the string
What i did,and couldnt solve my question
i took the string and JOIN the TableB with INNER JOINS i made 2-3 times INNER JOINS because there can be 3word 4word strings too, and that WORKED!! but it takes too much time even doing it for 100-200 strings. Guess i want to do it for 100k/everyday???
Now what i try to do
i gave values to everyletter as you see above,
Took the strings one by one and from their including letters i count the value of strings..
And the same for the words too in TableB..
Now i have every string in TableA and everyword in TableB with their VALUES..
_
1- i will take the string,length and value of it (Exmple; creditcard - 10 - 85)
2- and make a search in TableB to find the possible words when they come together, with their SUM(length), and SUM(value) matches the strings length and value, and write theese possibilities to a new column.
At last even their sum of length and sum of values matches each other there can be some posibilities that doesnt match the whole string i will elliminate theese ones (Example; "doma-in" can be "moda-in" too and their lengths and values are same but not same words)
I dont know but,i guess with that value method i can solve the time proplem??? , or if there is another ways to do that, i will be gratefull taking your advices.
Thanks

You could try to find the solutions recursively by looking always at the next letter. For example for the word DOMAIN
D - no
DO - is a word!
M - no
MA - no
MAI - no
MAIN - is a word!
No more letters --> DO + MAIN
DOM - is a word!
A - no
AI - no
AIN - no
Finished without result
DOMA - no
DOMAI - no
DOMAIN - is a word!
No more letters --> DOMAIN

Related

SQL Max Consecutive Values in a number set using recursion

The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.
Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.

Regex: extracting a house number from an address

I have following patterns:
13 R 2
48 B / 5
42 B
42B
303 Box 15
303 Bte 15
303 B Bt 15
and only want to have the following results (because Box 15, Bte 15 are the box numbers, and I only want the house nbr + potentially the letter attached to the house number):
13 R 2
48 B / 5
42 B
42B
303
303
303 B
Is this possible using a regular expression? I tried the following: REGEXP_SUBSTR(my_string_variable, '^\d+(\s*\w$)?'). This however only works for the patterns 3-5, and not for the first 2 and last patterns. Dropping the $ from the regex would incorrectly 'strip' the first letter for patterns 5 and 6.
I am basically assuming that if the letter behind the numeric is more than 1 character, that it belongs to the box number. For example, BTE is the French abbreviation for Boite which means Box. I realise this might be invalid if a house number has 2 letters (e.g.: 11 AA), but I would not know a solution for this and I don't think it occurs much.
This will remove: a space followed by an uppercase letter followed by at least one lowercase letter followed by an optional space followed by any number of digits:
RegExp_Replace(house_number, '\s[A-Z][a-z]+\s+\d+$')
See regex101.com

Extracting a word from string from n rows and append that word as a new col in SQL Server

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

SQL find all rows with assigned values

MSSQL: i have this example data:
NAME AValue BValue
A 1 11
B 1 11
C 2 11
D 2 21
E 3 21
F 3 21
G 4 31
H 4 31
I 5 41
J 5 NULL
...
I am looking for algorhitm which looks for all the Names closed by values by different seed (AValue and Bvalue, in this case seed is given by 2 for AValue and by 3 for Bvalue, but this can be skipped and given later and so on, not only looking for smallest multiple). In this case output should be 1,2,3,4,11,21,31 as a first group/result. Then all the Names with these values can be updated etc.
I need to find out all the Names in "closed circle" of values by different seed.
EDIT:
(try of simplier example)
Imagine that you have list of names. Each name is given two numbers. In most cases these numbers are given by some seed (in this example AValue is given twice, BValue three times) but some numbers can be skipped, so you cannot just count smallest multiple of these different seeds(in this case it would be 2x3, ever 6 names you have closed group where no Name contains AValue or BValue from next/different group). For example Name A have 1 and 11. 1 is given for A and B, 11 for A, B, C. These Names have 1,2,11,21. So you check for 2 and 21 and then you get E and F in addition and then the loop of checking should continue, but as long as no more Names are contained there should be output 1,2,3,11,21. "Closed circle"

In postgresql How to get all rows that ends with 1?

Suppose I have a table as follows:
id name length
1 A 21.5
2 B 12.4
3 C 0
4 D 17
5 E 1
I wish to get:
id name length
1 A 21.5
5 E 1
Meaning all rows that hase length that ends up with 1.
length is a numeric column.
It's very simple thing to do with programing languages but it seems quite not natural for SQL. How can I do that efficiently and simply?
My only thought is to convert the field to Text and then lose eveything after the . then convert it to array and choose the letter in the position of array length. This will probebly work but it seems like a very bad solution.
You can use FLOOR and modulo division:
SELECT *
FROM tab
WHERE FLOOR(length) % 10 = 1;
SqlFiddleDemo