Oracle SQL regex extraction - sql

I have data as follows in a column
+----------------------+
| my_column |
+----------------------+
| test_PC_xyz_blah |
| test_PC_pqrs_bloh |
| test_Mobile_pqrs_bleh|
+----------------------+
How can I extract the following as columns?
+----------+-------+
| Platform | Value |
+----------+-------+
| PC | xyz |
| PC | pqrs |
| Mobile | pqrs |
+----------+-------+
I tried using REGEXP_SUBSTR
Default first pattern occurrence for platform:
select regexp_substr(my_column, 'test_(.*)_(.*)_(.*)') as platform from table
Getting second pattern occurrence for value:
select regexp_substr(my_column, 'test_(.*)_(.*)_(.*)', 1, 2) as value from table
This isn't working, however. Where am I going wrong?

For Non-empty tokens
select regexp_substr(my_column,'[^_]+',1,2) as platform
,regexp_substr(my_column,'[^_]+',1,3) as value
from my_table
;
For possibly empty tokens
select regexp_substr(my_column,'^.*?_(.*)?_.*?_.*$',1,1,'',1) as platform
,regexp_substr(my_column,'^.*?_.*?_(.*)?_.*$',1,1,'',1) as value
from my_table
;
+----------+-------+
| PLATFORM | VALUE |
+----------+-------+
| PC | xyz |
+----------+-------+
| PC | pqrs |
+----------+-------+
| Mobile | pqrs |
+----------+-------+

(.*) is greedy by nature, it will match all character including _ character as well, so test_(.*) will match whole of your string. Hence further groups in pattern _(.*)_(.*) have nothing to match, whole regex fails. The trick is to match all characters excluding _. This can be done by defining a group ([^_]+). This group defines a negative character set and it will match to any character except for _ . If you have better pattern, you can use them like [A-Za-z] or [:alphanum]. Once you slice your string to multiple sub strings separated by _, then just select 2nd and 3rd group.
ex:
SELECT REGEXP_SUBSTR( my_column,'(([^_]+))',1,2) as platform, REGEXP_SUBSTR( my_column,'(([^_]+))',1,3) as value from table;
Note: AFAIK there is no straight forward method to Oracle to exact matching groups. You can use regexp_replace for this purpose, but it unlike capabilities of other programming language where you can exact just group 2 and group 3. See this link for example.

Related

Redshift skip the first character of split_part()

I have a table column like below:
| cloumn_a |
| ------------------ |
| Alpha_Black_1 |
| Alpha_Black_2323 |
| Alpha_Red_100 |
| Alpha_Blue_2344 |
| Alpha_Orange_33333 |
| Alpha_White_2 |
| |
Usually, when I want to split with any symbol or character I am using the split_part(text, text, integer) so split_part(column_a, '_', 1)
I need to remove the numeric part of each variable and keep only the text part like Alpha_Black.
I cannot use the trim function because the numeric part can change
How can I skip the first underscore and split from the second one?
I would suggest using REGEXP_REPLACE here:
SELECT
column_a,
REGEXP_REPLACE(column_a, '_\\d+$', '') AS column_a_out
FROM yourTable;
Demo

How can I combine Postgresgl's ArrayField ANY option with LIKE

I'm trying to filter a queryset on the first characters of an element in an ArrayField in postgresql.
Data
--------------------
| id | registration_date | sbi_codes |
| 1 | 2007-11-13 | {9002, 1002, 85621} |
| 2 | 2010-10-11 | {1002, 9022, 9033 |
| 3 | 2019-02-02 | {9001, 8921} |
| 4 | 2012-02-02 | {120} |
I've tried the following (which obviously don't work), but I think clearly indicates what I'm trying to achieve.
select count(*)
from administrations_administration
where '90' = left(any(sbi_codes),2)
or
select count(*)
from administrations_administration
where '90%' like any(sbi_codes
So the sbi_codes can be for example 9002 or 9045, And I'm trying to filter all the records that contain an element that starts with 90.
expected result
____
| count | sbi_codes |
| 3 | 90 |
Thanks!
The thing on the left hand side of LIKE is the string, in which % is just a %. The thing on the right hand side is the pattern, in which % is a wildcard. Using ANY does't change these semantics, the pattern still goes the right.
To solve this, you could create your own operator which is like LIKE, but has its arguments reversed.

Generate rows from input array

Let's assume I have a table with many records called comments, and each record includes only a text body:
CREATE TABLE comments(id INT NOT NULL, body TEXT NOT NULL, PRIMARY KEY(id));
INSERT INTO comments VALUES (generate_series(1,100), md5(random()::text));
Now, I have an input array with N substrings, with arbitrary length. For example:
abc
xyzw
123456
not_found
For each input value, I want to return all rows that match a certain condition.
For example, given that the table includes the following records:
| id | body |
| -- | ----------- |
| 11 | abcd1234567 |
| 22 | unkown12 |
| 33 | abxyzw |
| 44 | 12345abc |
| 55 | found |
I need a query that returns the following result:
| substring | comments.id | comments.body |
| --------- | ----------- | ------------- |
| abc | 11 | abcd1234567 |
| abc | 44 | 12345abc |
| xyzw | 33 | abxyzw |
| 123456 | 11 | abcd1234567 |
So far, I have this SQL query:
SELECT substrings, comments.id, comments.body
FROM unnest(ARRAY[
'abc',
'xyzw',
'123456',
'not_found'
]) AS substrings
JOIN comments ON comments.id IN (
SELECT id
FROM comments as inner_comments
WHERE inner_comments.body LIKE ('%' || substrings || '%')
);
But the database client gets stuck for more than 10 minutes. And I missing something about joins?
Please note that this is a simplified example of my problem. My current check on the comment is not a LIKE statement, but a complex switch-case statement of different functions (fuzzy matching).
The detour with the IN is unnecessary and unless the optimizer can rewrite this and it likely cannot, adds overhead. Try if it gets better without.
SELECT un.substring,
comments.id,
comments.body
FROM unnest(ARRAY['abc',
'xyzw',
'123456',
'not_found']) un (substring)
INNER JOIN comments
ON comments.body LIKE ('%' || un.substring || '%');
But still indexes cannot be used here because of the wildcard at the beginning. You might want to look at Full Text Search and see what options you have with it to improve the situation.
Basically you are performing FULLTEXT search in a column that most likely doesn't have a FULLTEXT index.
A first step you could try would be to have your column "body" FULLTEXT indexed. See details here and then perform the search using CONTAINS but, quite honestly, since you want to perform fuzzy matching you cannot rely on SQL server to perform the search - it would just not work properly. You will need an indexing service such as ElasticSearch, CloudSearch, Azure Search, etc

Remove invalid data based on particular pattern SQL Server

I have a sample data like shown below
------------------------------------------------
| ID | Column 1 | Column 2 |
------------------------------------------------
| 1 | 0229-10010 |Valid |
------------------------------------------------
| 2 | 20483 |InValid |
------------------------------------------------
| 3 | 319574R06-STAT |Valid |
------------------------------------------------
| 4 | ,,,,,,,,,,,,,,1,,,,,,, |InValid |
------------------------------------------------
| 5 | "PBOM-SSE, CHAMBER" |Valid |
------------------------------------------------
| 6 | ""PBOM-SSE, CHAMBER |InValid |
------------------------------------------------
| 7 | "PBOM-SSE CHAMBER", |InValid |
------------------------------------------------
| 8 | #DRM-1102.Z |InValid |
------------------------------------------------
| 9 | DRM#1102.Z |Valid |
------------------------------------------------
| 10 |OEM-2-202 4079 KALREZ |Valid |
------------------------------------------------
| 11 |-OEM2202 4079 KALREZ# |InValid |
------------------------------------------------
What i want to do is i need to create a pattern in such a way that i need to fetch only invalid data. Just for representation i have mentioned Valid and Invalid. In my table i don't have any flag as such.
Here the trick is same, wildcard characters appearing at different places makes different sense. Consider record ID-5 and Id-6. In both the cases wildcard characters are same, but the position decides whether its valid or not. Again position is also not so clear. I guess you can make out why particular record in column 1 is valid and invalid. In record 8, '#' before that item doesn't makes sense, where as # after Alphabet makes sense (in record 9).
In record 2, there are lot of blank spaces before number, that's why its invalid, but that doesn't mean that space itself is wild card.
I have written query like below.
SELECT [PartNumber]
FROM [IBSSSystems].[dbo].[Part]
WHERE (PartNumber LIKE '%[?;.,$^#&*{}:"<>/|\ %'']%'
OR PartNumber LIKE '%[%'
OR PartNumber LIKE '%]%')
The above query understands that whenever it see any wildcard character in a record , it fetches that. But I need the query in such a way that it understands and fetches only invalid data. I guess there will be lot of And and Or in the resulting query, but i'm confused. I hope you can help me out. Thanks in advance.
SELECT [PartNumber]
FROM [IBSSSystems].[dbo].[Part]
WHERE (PartNumber LIKE '[^A-Za-z0-9"]%' ESCAPE '\' -- When the First character is special charater its InValid ( " is an exception)
OR PartNumber LIKE '%[^A-Za-z0-9" ]' ESCAPE '\' -- When the Last character is special charater its InValid ( " is an exception, also trailing spaces are exception)
OR PartNumber LIKE '%[^A-Za-z0-9 ][^A-Za-z0-9 ]%' -- When there are two or more consecutive special charaters its InValid
OR PartNumber LIKE '%[\^\[\]\\_?;$#&*{}:<>/|''~`]%' ESCAPE '\' -- Add characters here which do not allowed to have any occurrence in the string
)

How to get records from a table where some field's value is in camel-case

I have a table like this,
+----+-----------+
| Id | Value |
+----+-----------+
| 1 | ABC_DEF |
| 31 | AcdEmc |
| 44 | AbcDef |
| 2 | BAA_CC_CD |
| 55 | C_D_EE |
+----+-----------+
I need a query to get the records which Value is only in camelcase (ex: AcdEmc, AbcDef etc. not ABC_DEF).
Please note that this table has only these two types of string values.
You can use UPPER() for this
select * from your_table
where upper(value) <> value COLLATE Latin1_General_CS_AS
If your default collation is case-insensitive you can force a case-sensitive collation in your where clause. Otherwise you can remove that part from your query.
Based on the sample data, the following will work. I think the issue we're dealing with is checking whether the string contains underscores.
SELECT * FROM [Foo]
WHERE Value NOT LIKE '%[_]%';
See Fiddle
UPDATE: Corrected error. I forgot '_' meant "any character".