Remove invalid data based on particular pattern SQL Server

Remove invalid data based on particular pattern SQL Server - sql

I have a sample data like shown below
------------------------------------------------
| ID | Column 1 | Column 2 |
------------------------------------------------
| 1 | 0229-10010 |Valid |
------------------------------------------------
| 2 | 20483 |InValid |
------------------------------------------------
| 3 | 319574R06-STAT |Valid |
------------------------------------------------
| 4 | ,,,,,,,,,,,,,,1,,,,,,, |InValid |
------------------------------------------------
| 5 | "PBOM-SSE, CHAMBER" |Valid |
------------------------------------------------
| 6 | ""PBOM-SSE, CHAMBER |InValid |
------------------------------------------------
| 7 | "PBOM-SSE CHAMBER", |InValid |
------------------------------------------------
| 8 | #DRM-1102.Z |InValid |
------------------------------------------------
| 9 | DRM#1102.Z |Valid |
------------------------------------------------
| 10 |OEM-2-202 4079 KALREZ |Valid |
------------------------------------------------
| 11 |-OEM2202 4079 KALREZ# |InValid |
------------------------------------------------
What i want to do is i need to create a pattern in such a way that i need to fetch only invalid data. Just for representation i have mentioned Valid and Invalid. In my table i don't have any flag as such.
Here the trick is same, wildcard characters appearing at different places makes different sense. Consider record ID-5 and Id-6. In both the cases wildcard characters are same, but the position decides whether its valid or not. Again position is also not so clear. I guess you can make out why particular record in column 1 is valid and invalid. In record 8, '#' before that item doesn't makes sense, where as # after Alphabet makes sense (in record 9).
In record 2, there are lot of blank spaces before number, that's why its invalid, but that doesn't mean that space itself is wild card.
I have written query like below.
SELECT [PartNumber]
FROM [IBSSSystems].[dbo].[Part]
WHERE (PartNumber LIKE '%[?;.,$^#&*{}:"<>/|\ %'']%'
OR PartNumber LIKE '%[%'
OR PartNumber LIKE '%]%')
The above query understands that whenever it see any wildcard character in a record , it fetches that. But I need the query in such a way that it understands and fetches only invalid data. I guess there will be lot of And and Or in the resulting query, but i'm confused. I hope you can help me out. Thanks in advance.

SELECT [PartNumber]
FROM [IBSSSystems].[dbo].[Part]
WHERE (PartNumber LIKE '[^A-Za-z0-9"]%' ESCAPE '\' -- When the First character is special charater its InValid ( " is an exception)
OR PartNumber LIKE '%[^A-Za-z0-9" ]' ESCAPE '\' -- When the Last character is special charater its InValid ( " is an exception, also trailing spaces are exception)
OR PartNumber LIKE '%[^A-Za-z0-9 ][^A-Za-z0-9 ]%' -- When there are two or more consecutive special charaters its InValid
OR PartNumber LIKE '%[\^\[\]\\_?;$#&*{}:<>/|''~`]%' ESCAPE '\' -- Add characters here which do not allowed to have any occurrence in the string
)

Related

Filtering records not containing numbers

I have a table that has numbers in string format. Ideally the table should contain 10 digit number in string format, but it has many junk values. I wanted to filter out the records that are not ideal in nature.
Below is the sample table that I have:
+---------------+--------+----------------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 3423122334 | 10 | As expected, 10 character number |
| 6758439239 | 10 | As expected, 10 character number |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------------+
Expected Output:
+---------------+--------+----------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------+
Basically I want all the records that dont have 10 digit numbers in them.
I have tried out below query:
SELECT *
FROM t1
WHERE ID_UID LIKE '%[^0-9]%'
But this does not returns any records.
Have created a fiddle for the same.
P.S. The columns length and ##Comment are illustrative in nature.

You want RLIKE not LIKE:
SELECT *
FROM t1
WHERE ID_UID RLIKE '[^0-9]'
Note that % is a LIKE wildcard, not a regular expression wildcard. Also, regular expressions match the pattern anywhere it occurs, so no wildcards are needed for the beginning and end of the string.
If you want to find values that are not ten digits, then be explicit:
SELECT *
FROM t1
WHERE ID_UID NOT RLIKE '^[0-9]{10}$'

Generate rows from input array

Let's assume I have a table with many records called comments, and each record includes only a text body:
CREATE TABLE comments(id INT NOT NULL, body TEXT NOT NULL, PRIMARY KEY(id));
INSERT INTO comments VALUES (generate_series(1,100), md5(random()::text));
Now, I have an input array with N substrings, with arbitrary length. For example:
abc
xyzw
123456
not_found
For each input value, I want to return all rows that match a certain condition.
For example, given that the table includes the following records:
| id | body |
| -- | ----------- |
| 11 | abcd1234567 |
| 22 | unkown12 |
| 33 | abxyzw |
| 44 | 12345abc |
| 55 | found |
I need a query that returns the following result:
| substring | comments.id | comments.body |
| --------- | ----------- | ------------- |
| abc | 11 | abcd1234567 |
| abc | 44 | 12345abc |
| xyzw | 33 | abxyzw |
| 123456 | 11 | abcd1234567 |
So far, I have this SQL query:
SELECT substrings, comments.id, comments.body
FROM unnest(ARRAY[
'abc',
'xyzw',
'123456',
'not_found'
]) AS substrings
JOIN comments ON comments.id IN (
SELECT id
FROM comments as inner_comments
WHERE inner_comments.body LIKE ('%' || substrings || '%')
);
But the database client gets stuck for more than 10 minutes. And I missing something about joins?
Please note that this is a simplified example of my problem. My current check on the comment is not a LIKE statement, but a complex switch-case statement of different functions (fuzzy matching).

The detour with the IN is unnecessary and unless the optimizer can rewrite this and it likely cannot, adds overhead. Try if it gets better without.
SELECT un.substring,
comments.id,
comments.body
FROM unnest(ARRAY['abc',
'xyzw',
'123456',
'not_found']) un (substring)
INNER JOIN comments
ON comments.body LIKE ('%' || un.substring || '%');
But still indexes cannot be used here because of the wildcard at the beginning. You might want to look at Full Text Search and see what options you have with it to improve the situation.

Basically you are performing FULLTEXT search in a column that most likely doesn't have a FULLTEXT index.
A first step you could try would be to have your column "body" FULLTEXT indexed. See details here and then perform the search using CONTAINS but, quite honestly, since you want to perform fuzzy matching you cannot rely on SQL server to perform the search - it would just not work properly. You will need an indexing service such as ElasticSearch, CloudSearch, Azure Search, etc

Match a set of characters from one table into the records of an other table

I have two tables (T-SQL):
tblInvalidCharactersList tblMonthsRecords
+-----------+-----------+ +--------+-------------+
| CodePoint | Character | | RecRef | Name |
+-----------+-----------+ +--------+-------------+
| 38 | & | | 21 | Firs> name |
+-----------+-----------+ +--------+-------------+
| 64 | # | | 89 | #Second name|
+-----------+-----------+ +--------+-------------+
| 62 | > | | 321 | Third n«me |
+-----------+-----------+ +--------+-------------+
| 171 | « | | 381 | Fourth name |
+-----------+-----------+ +--------+-------------+
I want to find those records of the tblMonthsRecords which have at least one (or more) character(s) from the Character column of the tblInvalidCharactersList table.
I tried:
SELECT
[RecRef],
[Name]
FROM [tblMonthsRecords]
WHERE [Name] IN (SELECT Character FROM [tblInvalidCharactersList])
and it returns no results at all.
I even tried the NOT IN clause and as you may guess, returns all records.
The reason why I am not hardcoding the characters list within a LIKE clause is because I want the list to be dynamically updated.
You can think the tblInvalidCharactersList as a characters "black list".

I would use exists:
select mr.*
from tblMonthsRecords mr
where exists (select 1
from tblInvalidCharactersList icl
where charindex(icl.Character, mr.name) > 0
);
You don't seem to care about the actual invalid character.

IN will look for exact character match in Name column it will not search for the character in Name column
Use LIKE operator
select Distinct a.*
from tblMonthsRecords a
join tblInvalidCharactersList b
on a.Name like '%' + b.Character + '%'
Another way using charindex
charindex(b.Character,a.Name) > 0

Oracle SQL regex extraction

I have data as follows in a column
+----------------------+
| my_column |
+----------------------+
| test_PC_xyz_blah |
| test_PC_pqrs_bloh |
| test_Mobile_pqrs_bleh|
+----------------------+
How can I extract the following as columns?
+----------+-------+
| Platform | Value |
+----------+-------+
| PC | xyz |
| PC | pqrs |
| Mobile | pqrs |
+----------+-------+
I tried using REGEXP_SUBSTR
Default first pattern occurrence for platform:
select regexp_substr(my_column, 'test_(.*)_(.*)_(.*)') as platform from table
Getting second pattern occurrence for value:
select regexp_substr(my_column, 'test_(.*)_(.*)_(.*)', 1, 2) as value from table
This isn't working, however. Where am I going wrong?

For Non-empty tokens
select regexp_substr(my_column,'[^_]+',1,2) as platform
,regexp_substr(my_column,'[^_]+',1,3) as value
from my_table
;
For possibly empty tokens
select regexp_substr(my_column,'^.*?_(.*)?_.*?_.*$',1,1,'',1) as platform
,regexp_substr(my_column,'^.*?_.*?_(.*)?_.*$',1,1,'',1) as value
from my_table
;
+----------+-------+
| PLATFORM | VALUE |
+----------+-------+
| PC | xyz |
+----------+-------+
| PC | pqrs |
+----------+-------+
| Mobile | pqrs |
+----------+-------+

(.*) is greedy by nature, it will match all character including _ character as well, so test_(.*) will match whole of your string. Hence further groups in pattern _(.*)_(.*) have nothing to match, whole regex fails. The trick is to match all characters excluding _. This can be done by defining a group ([^_]+). This group defines a negative character set and it will match to any character except for _ . If you have better pattern, you can use them like [A-Za-z] or [:alphanum]. Once you slice your string to multiple sub strings separated by _, then just select 2nd and 3rd group.
ex:
SELECT REGEXP_SUBSTR( my_column,'(([^_]+))',1,2) as platform, REGEXP_SUBSTR( my_column,'(([^_]+))',1,3) as value from table;
Note: AFAIK there is no straight forward method to Oracle to exact matching groups. You can use regexp_replace for this purpose, but it unlike capabilities of other programming language where you can exact just group 2 and group 3. See this link for example.

How to get records from a table where some field's value is in camel-case

I have a table like this,
+----+-----------+
| Id | Value |
+----+-----------+
| 1 | ABC_DEF |
| 31 | AcdEmc |
| 44 | AbcDef |
| 2 | BAA_CC_CD |
| 55 | C_D_EE |
+----+-----------+
I need a query to get the records which Value is only in camelcase (ex: AcdEmc, AbcDef etc. not ABC_DEF).
Please note that this table has only these two types of string values.

You can use UPPER() for this
select * from your_table
where upper(value) <> value COLLATE Latin1_General_CS_AS
If your default collation is case-insensitive you can force a case-sensitive collation in your where clause. Otherwise you can remove that part from your query.

Based on the sample data, the following will work. I think the issue we're dealing with is checking whether the string contains underscores.
SELECT * FROM [Foo]
WHERE Value NOT LIKE '%[_]%';
See Fiddle
UPDATE: Corrected error. I forgot '_' meant "any character".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove invalid data based on particular pattern SQL Server - sql

Related

Filtering records not containing numbers

Generate rows from input array

Match a set of characters from one table into the records of an other table

Oracle SQL regex extraction

How to get records from a table where some field's value is in camel-case

Categories

Resources