How run Select Query with LIKE on thousands of rows - sql

Newbie here. Been searching for hours now but I can seem to find the correct answer or properly phrase my search.
I have thousands of rows (orderids) that I want to put on an IN function, I have to run a LIKE at the same time on these values since the columns contains json and there's no dedicated table that only has the order_id value. I am running the query in BigQuery.
Sample Input:
ORD12345
ORD54376
Table I'm trying to Query: transactions_table
Query:
SELECT order_id, transaction_uuid,client_name
FROM transactions_table
WHERE JSON_VALUE(transactions_table,'$.ordernum') LIKE IN ('%ORD12345%','%ORD54376%')
Just doesn't work especially if I have thousands of rows.
Also, how do I add the order id that I am querying so that it appears under an order_id column in the query result?
Desired Output:

Option one
WITH transf as (Select order_id, transaction_uuid,client_name , JSON_VALUE(transactions_table,'$.ordernum') as o_num from transactions_table)
Select * from transf where o_num like '%ORD12345%' or o_num like '%ORD54376%'
Option two
split o_num by "-" as separator , create table of orders like (select 'ORD12345' as num
Union
Select 'ORD54376' aa num) and inner join it with transf.o_num

One method uses OR:
WHERE JSON_VALUE(transactions_table, '$.ordernum') LIKE IN '%ORD12345%' OR
JSON_VALUE(transactions_table, '$.ordernum') LIKE '%ORD54376%'
An alternative method uses regular expressions:
WHERE REGEXP_CONTAINS(JSON_VALUE(transactions_table, '$.ordernum'), 'ORD12345|ORD54376')

According to the documentation, here, the LIKE operator works as described:
Checks if the STRING in the first operand X matches a pattern
specified by the second operand Y. Expressions can contain these
characters:
A percent sign "%" matches any number of characters or
bytes.
An underscore "_" matches a single character or byte.
You can escape "\", "_", or "%" using two backslashes. For example, "\%". If
you are using raw strings, only a single backslash is required. For
example, r"\%".
Thus , the syntax would be like the following:
SELECT
order_id,
transaction_uuid,
client_name
FROM
transactions_table
WHERE
JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD12345%'
OR JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD54376%
Notice that we specify two conditions connected with the OR logical operator.
As a bonus information, when querying large datasets it is a good pratice to select only the columns you desire in your out output ( either in a Temp Table or final view) instead of using *, because BigQuery is columnar, one of the reasons it is faster.
As an alternative for using LIKE, you can use REGEXP_CONTAINS, according to the documentation:
Returns TRUE if value is a partial match for the regular expression, regex.
Using the following syntax:
REGEXP_CONTAINS(value, regex)
However, it will also work if instead of a regex expression you use a STRING between single/double quotes. In addition, you can use the pipe operator (|) to allow the searched components to be logically ordered, when you have more than expression to search, as follows:
where regexp_contains(email,"gary|test")
I hope if helps.

Related

BigQuery - Using regexp with LIKE operator (?)

I'd like to get productids from url and I've almost finetuned a query to do it but still there is an issue I cannot solve.
The url usually looks like this:
/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/
or
/harry-potter-es-a-tuz-serlege-2019-m19247107/
As you can see there are two types of ids:
in general, ids start with '-p'
ids of some special products start with '-m'
I created this case when statement:
CASE
WHEN MAX(hits.page.pagePath) LIKE '%-p%'
THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(
hits.page.pagePath, '-p[0-9]+/'), '\\-|p|/', ''))
WHEN MAX(hits.page.pagePath) LIKE '%-m%'
THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(
hits.page.pagePath, '-m[0-9]+/'), '\\-|m|/', ''))
ELSE NULL
END AS productId
It's a little complicated at the first look but I really needed a regexp_replace and a regexp_extract because '-p' or '-m' characters doesn't appear only before the id but it can be multiplied times in a url.
The problem with my code is that there are some special cases when the url looks like this:
/elveszett-profeciak-2019-m17855487/
As you can see the id starts with '-m' but the url also contains '-p'. In this case the result is empty value in the query.
I think it could be solved by modifying the like operator in the when part of the case when statement: LIKE '%-p%' or LIKE '%-m%'
It would be great to have a regexp expression after or instead of the LIKE operator. Something similar to the parameter of '-p[0-9]+/' what I used in regexp_extract function.
So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls
I'm not sure it's possible to do or not in BQ.
So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls
I think you want '-p' and '-m' followed by digits. If so, I think this does what you want:
select regexp_extract(url, '-[pm][0-9]+')
from (select '/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/' as url union all
select '/elveszett-profeciak-2019-m17855487/' union all
select '/harry-potter-es-a-tuz-serlege-2019-m19247107/'
) x

Similar to with regex in Postgresql

In Postgresql database I have a column called names where I have some names which need to be parsed using regex to clean up punctuation parts. I am able to get a clean name using regexp_replace as follows:
select regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g')
from tableA
However, I would like to compare with some strings that are also cleaned of punctuation. How can I use similar to with the formed regular expression?
select name
from tableA
where (lower(name) ~ '\.COM|''[A-Za-z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as nameParsed similar to '(fg )%' and
(lower(name) ~ '\.COM|''[A-Za-z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as nameParsed similar to '%( cargo| carrier| cartage )%'
With the previous query I am getting this error:
LINE 3: ...-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)') as namePar...
I have tried in where clause like this and it seems to be working:
select name
from tableA
where (select lower(regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g'))) similar to '(fg )%'
Is this the best approach? The execution time went to 46 seconds :(
Thanks in advance
You're trying to get a column name in a WHERE clause (is a comparison, not a column). So, you can use as follows:
SELECT name
FROM "tableA"
WHERE (regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g') similar to '(fg )%'
OR regexp_replace(name,'\.COM|''[A-Z]|[^a-zA-Z0-9 -]+|\s(?=&)|(?<!\w\w)(?:\s+|-)(?!\w\w)','','g') similar to '%( cargo| carrier| cartage )%');
Alternatively, you can use ilike instead of similar to if you want to find a specific word.

How to use an underscore character in a LIKE filter give me all the results from a column

I am trying to filter my sql query with a like condition using underscore in my where clause. However, when i am filtering i want all values with just TS_AW19 and not all values which include TS which is what my query is currently giving me. Could someone assist me with the correct syntax to use for my query below?
SELECT
date,
creative_name,
SUM(revenue)*2 as spend
FROM `crate-media-group-client-data.DV360_ALL.GGLDV360BM_CREATIVE_*`
WHERE advertiser LIKE '%Topshop/Topman%' AND creative_name LIKE '%TS_AW19%' AND date = '2019-09-20'
GROUP BY 1,2
ORDER by creative_name
Note: i am using big query syntax for this query
Underscore (and percent) has a special meaning when used with LIKE, and means a wildcard for any single character. To workaround this, from the BigQuery documentation for LIKE:
You can escape "\", "_", or "%" using two backslashes. For example, "\%". If you are using raw strings, only a single backslash is required. For example, r"\%".
You may try double-escaping the underscore, if you intend for it be literal in your LIKE expression:
SELECT
date,
creative_name,
SUM(revenue)*2 AS spend
FROM `crate-media-group-client-data.DV360_ALL.GGLDV360BM_CREATIVE_*`
WHERE
advertiser LIKE '%Topshop/Topman%' AND
creative_name LIKE '%TS\\_AW19%' AND
date = '2019-09-20'
GROUP BY 1,2
ORDER BY
creative_name;

Too big number for repeat range when using regexp_like in where clause

I tried to run the following query:
select * from table where regexp_like('^{{', text_field)
And got the following error:
too big number for repeat range
Thinking perhaps regexp_like is confusing { for the repeat count operator, I also tried the following variations:
select * from table where regexp_like('^\{\{', text_field)
select * from table where regexp_like('^[{][{]', text_field)
select * from table where regexp_like('^[[:punct:]]{2}', text_field)
None of which worked. For now, text_field like '{{' suffices, but I may want to include a more flexible version of this that would require regular expressions. What's wrong with my approach here? And what does this error message mean?
You are using the prestodb regex_like function in the wrong way:
regexp_like(string, pattern)
Evaluates the regular expression pattern and determines if it is
contained within string. This function is similar to the LIKE
operator, expect that the pattern only needs to be contained within
string, rather than needing to match all of string. In other words,
this performs a contains operation rather than a match operation. You
can match the entire string by anchoring the pattern using ^ and $:
SELECT regexp_like('1a 2b 14m', '\d+b'); -- true

regarding like query operator

For the below data (well..there are many more nodes in the team foundation server table which i need to refer to..below is just a sample)
Nodes
------------------------
\node1\node2\node3\
\node1\node2\node5\
\node1\node2\node3\node4\
\node1\node2\node3\node4\node5\
I was wondering if i can apply something like (below query does not give the required results)
select * from table_a where nodes like '\node1\node2\%\'
to get the below data
\node1\node2\node3\
\node1\node2\node5\
and something like (below does not give the required results)
select * from table_a where nodes like '\node1\node2\%\%\'
to get
\node1\node2\node3\
\node1\node2\node5\
\node1\node2\node3\node4\
Can the above be done with like operator? Pls. suggest.
Thanks
You'll need to combine two terms, LIKE and NOT LIKE:
select * from table_a where
nodes like '\node1\node2\%\' AND
nodes NOT like '\node1\node2\%\%\'
for the first query, and a similar solution for the second. That's with "plain SQL". There are probably SQL Server specific functions which will count the number of "\" characters in the column, for instance.
maybe use the delimiter to get the resutls.
it is unclear what you are actually trying to get, but you could use the
substr
function to either count or find the position of the delimiter '/' character.
It seems like this would work (basically just eliminating the last backslash):
select * from table_a where nodes like '\node1\node2\%\%'
EDIT
You could also try this:
select * from table_a where
nodes like '\node1\node2\%\' or
nodes like '\node1\node2\%\%\'
A little late to the party, but it appears that the problem is still open. Could it be that the backslashes are escaping the wildcard meaning of the percent signs? And the backslash n could be getting interpreted as well.
Doesn't sql-server know a wildcard for a single character?
select * from table_a
where nodes LIKE '#node1#node2#node_#';
nodes
---------------------
#node1#node2#node5#
#node1#node2#node3#
I testet this on postgresql, where it is hard to insert a backslash, which is the reason why I replaced them with #.
Here is another possibility - negate more than one backslash (# used for my convenience):
SELECT * FROM table_a
WHERE (nodes LIKE '#node1#node2#%#'
AND NOT nodes LIKE '#node1#node2#%#%#');
On postgresql there is too the possibility to match against patterns, with SIMILAR TO, or ~:
SELECT * FROM table_a
WHERE nodes SIMILAR TO '#node1#node2#[^#]*#';
nodes
---------------------
#node1#node2#node5#
#node1#node2#node3#
[] encapsulates a group of alternatively allowed characters, for example [aeiou] would be a lowercase vocal. But when the caret is the first sign in the brackets, the sign(s) are negated so [^aeiou] would mean anything but a lowercase vocal, and [^#] means anything but a #.
The asterix behind that expression means that the preceding sign can occur as often as you like, 0 to million times. (+ would mean at least one times, ? would mean 0 or 1 times).
So '#node1#node2#[^#]*#' means '#node1#node2#', followed by anything but a hash, 0 or single or multiple times, and then, finally a hash.