Comparing two tables and finding partial match (SQL / Oracle)

Comparing two tables and finding partial match (SQL / Oracle) - sql

I haven't quite found an answer to this problem, it seems a bit tricky (and yes, I am a beginner). I have two tables; eb_site and eb_register and they both have the column id_glo which connects them. The values within these fields are not quite the same though, the number is the connecting factor. An example:
eb_site = kplus.hs.dlsn.3074823
eb_register = kplus.hs.register.3074823-1"
How could I select the ones ie make a list where the number in eb_register deviates from the number in eb_site (and disregard the mismatch between dlsn/register).
And also where the eb_register has a -1 at the end as in the example (the fixed ones don't have the -1 at the end).
Thanks for any replies.
edit: oops sorry guys, worded it badly, have edited
Rgds,
Steinar

the quality of the solution will depend on the possible id_glo values and the sql dialect you can use.
as a start, try
select s.id_glo
, r.id_glo
from eb_site s
inner join eb_register r on ( replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') <> replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '')
and replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') not like replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') || '-%'
)
;
this query assumes that:
there are no more different prefixes as the ones you've given
number complements will only occur in records from eb_register

If the numbers match, then the reverse of the numbers match. The following extracts the number (and final decimal point) from each key, using SQL Server syntax:
select *
from eb_site s join
eb_register r
on left(REVERSE(s.id_glo), charindex('.', reverse(s.id_glo))) =
left(REVERSE(r.id_glo), charindex('.', reverse(r.id_glo)))
In other databases the charindex() might need to be replaced by another function, such as instr(), location(), or position().

Related

SQL - Version comparison

this is my first question here.
I am building an SQL query in which I need to verify that the version of the object B is always lower or equal than the version of the object A. This is a link table, here is an example :
The query is :
SELECT *
FROM TABLE
WHERE B_VERSION <= A_VERSION
As you can see, it works for the 2 first rows, but not the third, because AA0 is detected as smaller than H08 while it shouldn't (when we arrive at Z99 the next version number is AA0 so the <= operator doesn't work anymore).
So I would like to do something like to parse the version to compare on how many letters are they in the versions, and only if both versions have the same number of letters then I use the <= operator.
I don't know however how to do that in an SQL query. Didn't find anything usefull on google neither. Do you have a solution ?
Thanks in advance

The key for solving this problem is the function PATINDEX. You can find more information here.
This query takes the value of A_VERSION and finds the first occurrence of a number. Then uses this position to divide the value in two parts. The first part is padded to the right with spaces because it is alphabetic, while the second part is padded to the right with zeros ('0') because it is numeric.
The same process occurs for B_VERSION.
Noticed that in this example, each part is assumed to be of maximum 5 characters, so this will work in your case for versions ranging from A0 to ZZZZZ99999. Feel free to adjust as you need.
SELECT *
FROM TABLE
WHERE RIGHT(SPACE(5)
+ SUBSTRING(A_VERSION,
1,
PATINDEX('%[0-9]%', A_VERSION) - 1), 5)
+ RIGHT(REPLICATE('0', 5)
+ SUBSTRING(A_VERSION,
PATINDEX('%[0-9]%', A_VERSION),
LEN(A_VERSION)), 5)
<= RIGHT(SPACE(5)
+ SUBSTRING(B_VERSION,
1,
PATINDEX('%[0-9]%', B_VERSION) - 1), 5)
+ RIGHT(REPLICATE('0', 5)
+ SUBSTRING(B_VERSION,
PATINDEX('%[0-9]%', B_VERSION),
LEN(B_VERSION)) ,5)
If you are going to do this operation in many places, you might consider creating a function for this operation.
Hope this helps.

Many thanks! It helped a lot however I am using sql developer and I cannot use PATINDEX with this software, I found the equivalent which is REGEXP_INSTR, it works very similarly.
I used this alrogithm that filters out the lines where there are more letters in VERSION_B than VERSION_A and then filter out the lines where VERSION_B is bigger than VERSION_A when they have both the same quantity of letters:
WHERE
(REGEXP_INSTR(VERSION_B, '[0-9]') < REGEXP_INSTR(VERSION_A, '[0-9]')) OR
(REGEXP_INSTR(VERSION_B, '[0-9]') = REGEXP_INSTR(VERSION_A, '[0-9]') AND VERSION_B <= VERSION_A)

Count occurences of a pattern in SQL Server column

I have a varchar column in SQL Server 2012 with 3-letter patterns that are concatenated, like this value:
DECLARE #str VARCHAR(MAX) = 'POKPOKPOKHRSPOKPOKPOKPOKPOKPOIHEFHEFPOKPOHHRTHRT'
I need a query to search and count the occurrences of the pattern POK in that string. The trick is, all POK that are together must be counted as one. So, in the string above there are 3 "chains" of POK:
POKPOKPOK, interrupted by a HRS
POKPOKPOKPOKPOK, interrupted by a POI
POK, interrupted by a POH
So, my desired result is 3. If I use the following query, I get 9, that are the total POKs in string, which is not what I need.
SELECT (LEN(#str) - LEN(REPLACE(#str, 'POK', '')))/LEN('POK')
I think I need some sort of regexp to isolate the POKs and then count, but couldn't find a way to apply that in SQL Server. Any help much appreciated.

This is really not something that you want to do in SQL. But you can. Here is one method to reduce the adjacent 'POK's to a single POK:
select replace(replace(#str, 'POK', '<POK>'), 'POK><', '')
Well, this actually creates a '<POK>', but that is fine for our purposes.
Now, you can search in that:
select (len(replace(replace(#str, 'POK', '<POK>'), 'POK><', '')) -
len(replace(replace(replace(#str, 'POK', '<POK>'), 'POK><', ''), 'POK', ''))
) / 3
Here is a SQL Fiddle.

BigQuery Standard performance of REGEXP_REPLACE vs RTRIM

I was wondering about performances related to some RTRIM and REGEXP_REPLACE in BigQuery Standard SQL.
Which of the following two would be more performant:
DISTINCT RTRIM("12367e","abcdefghijklmnopqrstuvwxyz")
versus
REGEXP_REPLACE("12367e", r"\D$", "")
I am not sure if there is a big performance change between these two approaches.

It doesn't look like there is much difference unless you have a significant amount of data. I tried a few queries over the bigquery-public-data.github_repos.commits table, applying these string transformations to the commits column, which has values like 0000120032a071dcd7e4bb1c8d418ca7a0028431.
The queries that I tried were:
SELECT COUNTIF(RTRIM(commit,'abcdefghijklmnopqrstuvwxyz') = '')
FROM `bigquery-public-data`.github_repos.commits;
SELECT COUNTIF(REGEXP_REPLACE(commit, r'\D$', '') = '')
FROM `bigquery-public-data`.github_repos.commits;
SELECT COUNT(*)
FROM `bigquery-public-data`.github_repos.commits
WHERE RTRIM(commit,'abcdefghijklmnopqrstuvwxyz') = '';
SELECT COUNT(*)
FROM `bigquery-public-data`.github_repos.commits
WHERE REGEXP_REPLACE(commit, r'\D$', '') = '';
These all process 7.91 GB of data (from just the string column) and take between two to three seconds to run, without any query being that much faster than the rest. I intentionally filtered the data such that the results would be empty, since I didn't want to include write time.

How to substring records with variable length

I have a table which has a column with doc locations, such as AA/BB/CC/EE
I am trying to get only one of these parts, lets say just the CC part (which has variable length). Until now I've tried as follows:
SELECT RIGHT(doclocation,CHARINDEX('/',REVERSE(doclocation),0)-1)
FROM Table
WHERE doclocation LIKE '%CC %'
But I'm not getting the expected result

Use PARSENAME function like this,
DECLARE #s VARCHAR(100) = 'AA/BB/CC/EE'
SELECT PARSENAME(replace(#s, '/', '.'), 2)

This is painful to do in SQL Server. One method is a series of string operations. I find this simplest using outer apply (unless I need subqueries for a different reason):
select *
from t outer apply
(select stuff(t.doclocation, 1, patindex('%/%/%', t.doclocation), '') as doclocation2) t2 outer apply
(select left(tt.doclocation2), charindex('/', tt.doclocation2) as cc
) t3;

The PARSENAME function is used to get the specified part of an object name, and should not used for this purpose, as it will only parse strings with max 4 objects (see SQL Server PARSENAME documentation at MSDN)
SQL Server 2016 has a new function STRING_SPLIT, but if you don't use SQL Server 2016 you have to fallback on the solutions described here: How do I split a string so I can access item x?

The question is not clear I guess. Can you please specify which value you need? If you need the values after CC, then you can do the CHARINDEX on "CC". Also the query does not seem correct as the string you provided is "AA/BB/CC/EE" which does not have a space between it, but in the query you are searching for space WHERE doclocation LIKE '%CC %'
SELECT SUBSTRING(doclocation,CHARINDEX('CC',doclocation)+2,LEN(doclocation))
FROM Table
WHERE doclocation LIKE '%CC %'

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.
Char[0] = A..Z, a..z
Char[1] = 0..9
Char[2] = 0..9
Char[3] = 0..9
Char[4] = 0..9
{No blanks allowed}
Basically, a clean record will look like this:
T1234, U2468, K123, P50054 (4 record examples)
Junk data looks like this:
T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)
Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?
A function would be great though!

The following should work in a few different systems:
SELECT *
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'
This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.
Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):
SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5
To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:
CREATE FUNCTION ufn_IsProperFormat(#data VARCHAR(50))
RETURNS BIT
AS
BEGIN
RETURN
CASE
WHEN SUBSTRING(#Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(#Data, 2, LEN(#Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(#Data) >= 5 THEN 1
ELSE 0
END
END
...and call into it like so:
SELECT *
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1
...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:
SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')
This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.
EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.
The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.
Added a custom SQL Server 2008 approach which implements these techniques.

Depends on your database. Many have regex functions (note examples not tested so check)
e.g. Oracle
SELECT x
FROM table
WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')
Sybase uses LIKE

Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:
SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
-- everything from 2nd character onwards is a number
AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
-- number doesn't have a decimal place
AND Data NOT LIKE '%.%'
For more information look at the ISNUMERIC function on MSDN.
Also note that:
I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others
A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).

You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...
WHERE
(
Data LIKE '[A-Za-z][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
)
David's answer allows "D1234junk" through
You also only need "[A-Z]" if you don't have case sensitivity

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Comparing two tables and finding partial match (SQL / Oracle) - sql

Related

SQL - Version comparison

Count occurences of a pattern in SQL Server column

BigQuery Standard performance of REGEXP_REPLACE vs RTRIM

How to substring records with variable length

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

Categories

Resources