Similar to with regex in Postgresql - sql

In Postgresql database I have a column called names where I have some names which need to be parsed using regex to clean up punctuation parts. I am able to get a clean name using regexp_replace as follows:
select regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g')
from tableA
However, I would like to compare with some strings that are also cleaned of punctuation. How can I use similar to with the formed regular expression?
select name
from tableA
where (lower(name) ~ '\.COM|''[A-Za-z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as nameParsed similar to '(fg )%' and
(lower(name) ~ '\.COM|''[A-Za-z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as nameParsed similar to '%( cargo| carrier| cartage )%'
With the previous query I am getting this error:
LINE 3: ...-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as namePar...
I have tried in where clause like this and it seems to be working:
select name
from tableA
where (select lower(regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g'))) similar to '(fg )%'
Is this the best approach? The execution time went to 46 seconds :(
Thanks in advance

You're trying to get a column name in a WHERE clause (is a comparison, not a column). So, you can use as follows:
SELECT name
FROM "tableA"
WHERE (regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g') similar to '(fg )%'
OR regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g') similar to '%( cargo| carrier| cartage )%');
Alternatively, you can use ilike instead of similar to if you want to find a specific word.

Related

How run Select Query with LIKE on thousands of rows

Newbie here. Been searching for hours now but I can seem to find the correct answer or properly phrase my search.
I have thousands of rows (orderids) that I want to put on an IN function, I have to run a LIKE at the same time on these values since the columns contains json and there's no dedicated table that only has the order_id value. I am running the query in BigQuery.
Sample Input:
ORD12345
ORD54376
Table I'm trying to Query: transactions_table
Query:
SELECT order_id, transaction_uuid,client_name
FROM transactions_table
WHERE JSON_VALUE(transactions_table,'$.ordernum') LIKE IN ('%ORD12345%','%ORD54376%')
Just doesn't work especially if I have thousands of rows.
Also, how do I add the order id that I am querying so that it appears under an order_id column in the query result?
Desired Output:
Option one
WITH transf as (Select order_id, transaction_uuid,client_name , JSON_VALUE(transactions_table,'$.ordernum') as o_num from transactions_table)
Select * from transf where o_num like '%ORD12345%' or o_num like '%ORD54376%'
Option two
split o_num by "-" as separator , create table of orders like (select 'ORD12345' as num
Union
Select 'ORD54376' aa num) and inner join it with transf.o_num
One method uses OR:
WHERE JSON_VALUE(transactions_table, '$.ordernum') LIKE IN '%ORD12345%' OR
JSON_VALUE(transactions_table, '$.ordernum') LIKE '%ORD54376%'
An alternative method uses regular expressions:
WHERE REGEXP_CONTAINS(JSON_VALUE(transactions_table, '$.ordernum'), 'ORD12345|ORD54376')
According to the documentation, here, the LIKE operator works as described:
Checks if the STRING in the first operand X matches a pattern
specified by the second operand Y. Expressions can contain these
characters:
A percent sign "%" matches any number of characters or
bytes.
An underscore "_" matches a single character or byte.
You can escape "\", "_", or "%" using two backslashes. For example, "\%". If
you are using raw strings, only a single backslash is required. For
example, r"\%".
Thus , the syntax would be like the following:
SELECT
order_id,
transaction_uuid,
client_name
FROM
transactions_table
WHERE
JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD12345%'
OR JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD54376%
Notice that we specify two conditions connected with the OR logical operator.
As a bonus information, when querying large datasets it is a good pratice to select only the columns you desire in your out output ( either in a Temp Table or final view) instead of using *, because BigQuery is columnar, one of the reasons it is faster.
As an alternative for using LIKE, you can use REGEXP_CONTAINS, according to the documentation:
Returns TRUE if value is a partial match for the regular expression, regex.
Using the following syntax:
REGEXP_CONTAINS(value, regex)
However, it will also work if instead of a regex expression you use a STRING between single/double quotes. In addition, you can use the pipe operator (|) to allow the searched components to be logically ordered, when you have more than expression to search, as follows:
where regexp_contains(email,"gary|test")
I hope if helps.

SQL LIKE - Using square bracket (character range) matching to match an entire word

In REGEX you can do something like [a-c]+, which will match on
aaabbbccc
abcccaabc
cbccaa
b
aaaaaaaaa
In SQL LIKE it seems that one can either do the equivalent of ".*" which is "%", or [a-c]. Is it possible to use the +(at least one) quantifier in SQL to do [a-c]+?
EDIT: Just to clarify, the desired end-query would look something like
SELECT * FROM table WHERE column LIKE '[a-c]+'
which would then match on the list above, but would NOT match on e.g "xxxxxaxxxx"
As a general rule, SQL Server's LIKE patterns are much weaker than regular expressions. For your particular example, you can do:
where col not like '%[^a-c]%'
That is, the column contains no characters that are not a, b, or c.
You can use regex in SQL with combination of LIKE e.g :
SELECT * FROM Table WHERE Field LIKE '%[^a-z0-9 .]%'
This works in SQL
Or in your case
SELECT * FROM Table WHERE Field LIKE '%[^a-c]%'
I seems you want some data from database, That is you don't know exactly, You must show your column and the all character that you want in that filed.

SELECT middle part of a String if it exists. Postgresql

i've got a problem with transferring "real-World" data into my schema.
It's actually a "project" for my Database course and they gave us ab table with dog race results. This Table has a column which contains the name of the Dog (which itself consists of the actuall name and the name of the breeder) and informations about the Birthcountry, actual living Country and the birth year.
Example filed are "Lillycette [AU 2012]" or "Black Bear Lee [AU/AU 2013]" or "Lemon Ralph [IE/UK 1998]".
I've managed it to get out the first word and save it in the right column with split_part like this:
INSERT INTO tblHund (rufname)
SELECT
split_part(name, ' ', 1) AS rufname,
FROM tblimport;
tblimport is a table where I dumped the data from the csv file.
That works just as it should.
Accessing the second part of the Name with this fails because sometimes there isn't a second part and sometimes times there second part consists of two words.
And this is the where iam stuck right now.
I tried it with substring and regular expressions:
INSERT INTO tblZwinger (Name)
SELECT
substring(vatertier from E'[^ ]*\\ ( +)$')AS Name
FROM tblimport
WHERE substring(vatertier from E'[^ ]*\\ ( +)$') != '';
The above code is executed without errors but actually does nothing because the SELECT statement just give empty strings back.
It took me more then 3h to understand a bit of this regular Expressions but I still feel pretty stupid when I look at them.
Is there any other way of doing this. If so just give me a hint.
If not what is wrong with my expression above?
Thanks for your help.
You need to use atom ., which matches any single character inside capturing group:
E'[^ ]*\\ (.+)$'
SELECT
tblimport.*,
ti.parts[1] as f1,
ti.parts[2] as f2, -- It should be the "middle part"
ti.parts[3] as f3
FROM
tblimport,
regexp_matches(tblimport.vatertier, '([^\s]+)\s*(.*)\s+\[(.*)\]') as ti(parts)
WHERE
nullif(ti.parts[2], '') is not null
Something like above.

Select query that displays Joined words separately, not using a function

I require a select query that adds a space to the data based on the placement of the capital letters i.e. 'HelpMe' using this query would be displayed as 'Help Me' . Note i cannot use a stored function to do this the it must be done in the query itself. The Data is of variable length and query must be in SQL. Any Help will be appreciated.
Thanks
You need to use user defined function for this until MS give us support for regular expressions. Solution would be something like:
SELECT col1, dbo.RegExReplace(col1, '([A-Z])',' \1') FROM Table
Aldo this would produce leading space that you can remove with TRIM.
Replace regular expresion function:
http://connect.microsoft.com/SQLServer/feedback/details/378520
About dbo.RegexReplace you can read at:
TSQL Replace all non a-z/A-Z characters with an empty string
Assume if you are using Oracle RDBMS, you use the following,
REGEX_REPLACE
SELECT REGEXP_REPLACE('ILikeToWatchCSIMiami',
'([A-Z.])', ' \1')
AS RX_REPLACE
FROM dual
;
Managed to get this output: * SQLFIDDLE
But as you see it doesn't treat well on words such as CSI though.

regarding like query operator

For the below data (well..there are many more nodes in the team foundation server table which i need to refer to..below is just a sample)
Nodes
------------------------
\node1\node2\node3\
\node1\node2\node5\
\node1\node2\node3\node4\
\node1\node2\node3\node4\node5\
I was wondering if i can apply something like (below query does not give the required results)
select * from table_a where nodes like '\node1\node2\%\'
to get the below data
\node1\node2\node3\
\node1\node2\node5\
and something like (below does not give the required results)
select * from table_a where nodes like '\node1\node2\%\%\'
to get
\node1\node2\node3\
\node1\node2\node5\
\node1\node2\node3\node4\
Can the above be done with like operator? Pls. suggest.
Thanks
You'll need to combine two terms, LIKE and NOT LIKE:
select * from table_a where
nodes like '\node1\node2\%\' AND
nodes NOT like '\node1\node2\%\%\'
for the first query, and a similar solution for the second. That's with "plain SQL". There are probably SQL Server specific functions which will count the number of "\" characters in the column, for instance.
maybe use the delimiter to get the resutls.
it is unclear what you are actually trying to get, but you could use the
substr
function to either count or find the position of the delimiter '/' character.
It seems like this would work (basically just eliminating the last backslash):
select * from table_a where nodes like '\node1\node2\%\%'
EDIT
You could also try this:
select * from table_a where
nodes like '\node1\node2\%\' or
nodes like '\node1\node2\%\%\'
A little late to the party, but it appears that the problem is still open. Could it be that the backslashes are escaping the wildcard meaning of the percent signs? And the backslash n could be getting interpreted as well.
Doesn't sql-server know a wildcard for a single character?
select * from table_a
where nodes LIKE '#node1#node2#node_#';
nodes
---------------------
#node1#node2#node5#
#node1#node2#node3#
I testet this on postgresql, where it is hard to insert a backslash, which is the reason why I replaced them with #.
Here is another possibility - negate more than one backslash (# used for my convenience):
SELECT * FROM table_a
WHERE (nodes LIKE '#node1#node2#%#'
AND NOT nodes LIKE '#node1#node2#%#%#');
On postgresql there is too the possibility to match against patterns, with SIMILAR TO, or ~:
SELECT * FROM table_a
WHERE nodes SIMILAR TO '#node1#node2#[^#]*#';
nodes
---------------------
#node1#node2#node5#
#node1#node2#node3#
[] encapsulates a group of alternatively allowed characters, for example [aeiou] would be a lowercase vocal. But when the caret is the first sign in the brackets, the sign(s) are negated so [^aeiou] would mean anything but a lowercase vocal, and [^#] means anything but a #.
The asterix behind that expression means that the preceding sign can occur as often as you like, 0 to million times. (+ would mean at least one times, ? would mean 0 or 1 times).
So '#node1#node2#[^#]*#' means '#node1#node2#', followed by anything but a hash, 0 or single or multiple times, and then, finally a hash.