Count occurences of a pattern in SQL Server column

Count occurences of a pattern in SQL Server column - sql

I have a varchar column in SQL Server 2012 with 3-letter patterns that are concatenated, like this value:
DECLARE #str VARCHAR(MAX) = 'POKPOKPOKHRSPOKPOKPOKPOKPOKPOIHEFHEFPOKPOHHRTHRT'
I need a query to search and count the occurrences of the pattern POK in that string. The trick is, all POK that are together must be counted as one. So, in the string above there are 3 "chains" of POK:
POKPOKPOK, interrupted by a HRS
POKPOKPOKPOKPOK, interrupted by a POI
POK, interrupted by a POH
So, my desired result is 3. If I use the following query, I get 9, that are the total POKs in string, which is not what I need.
SELECT (LEN(#str) - LEN(REPLACE(#str, 'POK', '')))/LEN('POK')
I think I need some sort of regexp to isolate the POKs and then count, but couldn't find a way to apply that in SQL Server. Any help much appreciated.

This is really not something that you want to do in SQL. But you can. Here is one method to reduce the adjacent 'POK's to a single POK:
select replace(replace(#str, 'POK', '<POK>'), 'POK><', '')
Well, this actually creates a '<POK>', but that is fine for our purposes.
Now, you can search in that:
select (len(replace(replace(#str, 'POK', '<POK>'), 'POK><', '')) -
len(replace(replace(replace(#str, 'POK', '<POK>'), 'POK><', ''), 'POK', ''))
) / 3
Here is a SQL Fiddle.

Related

How to substring records with variable length

I have a table which has a column with doc locations, such as AA/BB/CC/EE
I am trying to get only one of these parts, lets say just the CC part (which has variable length). Until now I've tried as follows:
SELECT RIGHT(doclocation,CHARINDEX('/',REVERSE(doclocation),0)-1)
FROM Table
WHERE doclocation LIKE '%CC %'
But I'm not getting the expected result

Use PARSENAME function like this,
DECLARE #s VARCHAR(100) = 'AA/BB/CC/EE'
SELECT PARSENAME(replace(#s, '/', '.'), 2)

This is painful to do in SQL Server. One method is a series of string operations. I find this simplest using outer apply (unless I need subqueries for a different reason):
select *
from t outer apply
(select stuff(t.doclocation, 1, patindex('%/%/%', t.doclocation), '') as doclocation2) t2 outer apply
(select left(tt.doclocation2), charindex('/', tt.doclocation2) as cc
) t3;

The PARSENAME function is used to get the specified part of an object name, and should not used for this purpose, as it will only parse strings with max 4 objects (see SQL Server PARSENAME documentation at MSDN)
SQL Server 2016 has a new function STRING_SPLIT, but if you don't use SQL Server 2016 you have to fallback on the solutions described here: How do I split a string so I can access item x?

The question is not clear I guess. Can you please specify which value you need? If you need the values after CC, then you can do the CHARINDEX on "CC". Also the query does not seem correct as the string you provided is "AA/BB/CC/EE" which does not have a space between it, but in the query you are searching for space WHERE doclocation LIKE '%CC %'
SELECT SUBSTRING(doclocation,CHARINDEX('CC',doclocation)+2,LEN(doclocation))
FROM Table
WHERE doclocation LIKE '%CC %'

Comparing two tables and finding partial match (SQL / Oracle)

I haven't quite found an answer to this problem, it seems a bit tricky (and yes, I am a beginner). I have two tables; eb_site and eb_register and they both have the column id_glo which connects them. The values within these fields are not quite the same though, the number is the connecting factor. An example:
eb_site = kplus.hs.dlsn.3074823
eb_register = kplus.hs.register.3074823-1"
How could I select the ones ie make a list where the number in eb_register deviates from the number in eb_site (and disregard the mismatch between dlsn/register).
And also where the eb_register has a -1 at the end as in the example (the fixed ones don't have the -1 at the end).
Thanks for any replies.
edit: oops sorry guys, worded it badly, have edited
Rgds,
Steinar

the quality of the solution will depend on the possible id_glo values and the sql dialect you can use.
as a start, try
select s.id_glo
, r.id_glo
from eb_site s
inner join eb_register r on ( replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') <> replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '')
and replace(replace(r.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') not like replace(replace(s.id_glo, 'kplus.hs.register.', ''), 'kplus.hs.dlsn.', '') || '-%'
)
;
this query assumes that:
there are no more different prefixes as the ones you've given
number complements will only occur in records from eb_register

If the numbers match, then the reverse of the numbers match. The following extracts the number (and final decimal point) from each key, using SQL Server syntax:
select *
from eb_site s join
eb_register r
on left(REVERSE(s.id_glo), charindex('.', reverse(s.id_glo))) =
left(REVERSE(r.id_glo), charindex('.', reverse(r.id_glo)))
In other databases the charindex() might need to be replaced by another function, such as instr(), location(), or position().

SQL Server substring

I need a good expression in order to select correctly parts of a field.
For example, the field can be of the type: "google_organic" or "google_campaign_HereGoesMyCode" . The part I am interested in is "organic" or "campaign" without any other addition.
So far I select with this:
substring(Referer, charIndex('_',Referer)+1, len(Referer))
But in the case of "campaign" I select the whole thing... I don't know how to manage the existence or non-existence of the second underscore...
thank you

One way is to basically create a lastIndex type search using the below SQL and use the result as the length:
len(Referer) – (charindex('_', reverse(Referer))-1)
You can then rewrite your query as follows, although you need the result of the first charIndex so this is fairly intense:
substring(Referer, charIndex('_',Referer)+1, (len(Referer) – (charindex('_', reverse(Referer))-1) - (charIndex('_',Referer)+1))-1 )
I realize that this will now only work if you have 2 underscores. But you can filter which query to run based off a CASE/WHEN statement.

Parse a string before the Last Index Of a character in SQL Server

I started with this but is it the best way to perform the task?
select
reverse(
substring(reverse(some_field),
charindex('-', reverse(some_field)) + 1,
len(some_field) - charindex('-', reverse(some_field))))
from SomeTable
How does SQL Server treat the
multiple calls to
reverse(some_field)?
Besides a UDF and iterating through
the string looking for charindex
of the '-' and storing the last
index of it, is there a more
efficient way to perform this task in T-SQL?
Note that what I have works, I just am really wondering if it is the best way about it.
Below are some sample values for some_field.
s2-st, s1-st, s3-st, s3-sss-zzz, s4-sss-zzzz
EDIT:
Sample output for this would be...
s1, s2, s3-sss, s3, s4-sss
The solution ErikE wrote is actually getting the end of the string so everything after the last hyphen. I just modified his version to get everything before it instead using a similar method with the left function. Thanks for all of your your help.
select left(some_field, abs(charindex('-', reverse(some_field)) - len(some_field)))
from (select 's2-st' as some_field
union select 's1-st'
union select 's3-st'
union select 's3-sss-zzz'
union select 's4-sss-zzzz') as SomeTable

May I suggest this simplification of your expression:
select right(some_field, charindex('-', reverse(some_field)) - 1)
from SomeTable
Also, there's no harm, as far as I know, in specifying 8000 characters in length with the substring function when you want the rest of the string. As long as it's not varchar(max), it works just fine.
If this is something you have to do all the time, over and over, how about #1 splitting out the data into separate columns and storing it that way, or #2 adding a calculated column with an index on it, which will perform the calculation once at update/insert time and not again later.
Last, I don't know if SQL Server is smart enough to reverse(some_field) only once and inject it into the other instance. When I get some time I'll try to figure it out.
Update
Oops, somehow I got backwards what you wanted. Sorry about that. The new expression you showed can still be simplified a little:
select left(some_field, len(some_field) - charindex('-', reverse(some_field)))
from (
select 's2-st'
union all select 's1-st'
union all select 's3-st'
union all select 's3-sss-zzz'
union all select 's4-sss-zzzz'
union all select 's5'
) X (some_field)
The abs() in your expression was just reversing the sign. So I put + len - charindex instead of + charindex - len and all is well now. It even works for strings without dashes.
One more thing to mention: your UNION SELECTs should be UNION ALL SELECT because without the ALL, the engine has to remove duplicates just as if you'd indicated SELECT DISTINCT. Simply get in the habit of using ALL and you'll be much better off. :)

Not sure about #1, but I would say that you might be better off doing this in code. Is there a reason you have to do it in the database?
Are you experiencing performance problems because of some similar code or is this purely hypothetical.

I am also not sure how SQL Server handles the multiple calls to REVERSE and CHARINDEX.
You can eliminate the last call to CHARINDEX since you want to take everything to the end of the string:
select
reverse(
substring(reverse(some_field),
charindex('-', reverse(some_field)) + 1,
len(some_field)))
from SomeTable
Although I would recommend against it, you could also replace the LEN function call with the size of the column:
select
reverse(
substring(reverse(some_field),
charindex('-', reverse(some_field)) + 1,
1024))
from SomeTable
I am curious how much of a difference either of these changes would make.

The 3 inner reverses are discrete from each other. The outer reverse will reverse anything that is already reversed by the inner ones.
ErikE's approach is best as a pure TSQL solution. You don't need LEN

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.
Char[0] = A..Z, a..z
Char[1] = 0..9
Char[2] = 0..9
Char[3] = 0..9
Char[4] = 0..9
{No blanks allowed}
Basically, a clean record will look like this:
T1234, U2468, K123, P50054 (4 record examples)
Junk data looks like this:
T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)
Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?
A function would be great though!

The following should work in a few different systems:
SELECT *
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'
This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.
Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):
SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5
To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:
CREATE FUNCTION ufn_IsProperFormat(#data VARCHAR(50))
RETURNS BIT
AS
BEGIN
RETURN
CASE
WHEN SUBSTRING(#Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(#Data, 2, LEN(#Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(#Data) >= 5 THEN 1
ELSE 0
END
END
...and call into it like so:
SELECT *
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1
...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:
SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')
This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.
EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.
The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.
Added a custom SQL Server 2008 approach which implements these techniques.

Depends on your database. Many have regex functions (note examples not tested so check)
e.g. Oracle
SELECT x
FROM table
WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')
Sybase uses LIKE

Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:
SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
-- everything from 2nd character onwards is a number
AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
-- number doesn't have a decimal place
AND Data NOT LIKE '%.%'
For more information look at the ISNUMERIC function on MSDN.
Also note that:
I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others
A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).

You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...
WHERE
(
Data LIKE '[A-Za-z][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
)
David's answer allows "D1234junk" through
You also only need "[A-Z]" if you don't have case sensitivity

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count occurences of a pattern in SQL Server column - sql

Related

How to substring records with variable length

Comparing two tables and finding partial match (SQL / Oracle)

SQL Server substring

Parse a string before the Last Index Of a character in SQL Server

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

Categories

Resources