BigQuery SPLIT() and grouping by result - google-bigquery

Using SPLIT() & NTH(), I'm splitting a string value, and taking the 2nd substring as the result. I then want to group on that result. However, when I use SPLIT() in conjunction with a GROUP BY, it keeps giving the error:
Error: (L1:55): Cannot group by an aggregate
The result is a string, so why is it not possible to group on it?
For example, this works and returns the correct string:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10
But then grouping on the result does not work:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] GROUP BY second_part limit 10

My best guess would be that you can get an equivalent result by using a subquery. Something like :
SELECT * FROM (Select NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10) GROUP BY second_part
The system returns Nth in an aggregate internally I guess

If there are always just 2 values separated by a delimeter, then a simpler approach would be to use REGEXP_EXTRACT:
SELECT REGEXP_EXTRACT('FIRST-SECOND','-(.*)') as second_part
from [FOO.bar]
GROUP BY second_part
limit 10

I like David's answer - sometimes splitting can get a bit more complicated using RegEx. Extracting the first option from a split command, then GROUPing BY is a very common operation. The way I normally do this in BigQuery is using a REGEXP_EXTRACT as follows:
In this simple example, the column "splitme" is pipe-delimited (|).
SELECT REGEXP_EXTRACT(splitme, r'(?U)^(.*)\|') AS title, COUNT(*) as c
FROM [my_table]
GROUP BY title;
This means, extract the string from the beginning of "splitme" to the first occurrence of a pipe (|). The "(?U)" is the "un-greedy" match flag in the re2 RegEx engine's syntax. Without this flag, if there are multiple pipe-delimited values, this RegEx would match everything up until the last pipe.

In my practice, I am usually using something like below with N being number of values in "list" to skip.
SELECT REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){N}(.*)\|') AS substring
So if I would be interested in third value in list I would use:
SELECT
REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){2}(.*)\|') AS substring,
COUNT(1) AS weight
FROM yourtable
GROUP BY 1
More details on re2 syntax here

Related

How run Select Query with LIKE on thousands of rows

Newbie here. Been searching for hours now but I can seem to find the correct answer or properly phrase my search.
I have thousands of rows (orderids) that I want to put on an IN function, I have to run a LIKE at the same time on these values since the columns contains json and there's no dedicated table that only has the order_id value. I am running the query in BigQuery.
Sample Input:
ORD12345
ORD54376
Table I'm trying to Query: transactions_table
Query:
SELECT order_id, transaction_uuid,client_name
FROM transactions_table
WHERE JSON_VALUE(transactions_table,'$.ordernum') LIKE IN ('%ORD12345%','%ORD54376%')
Just doesn't work especially if I have thousands of rows.
Also, how do I add the order id that I am querying so that it appears under an order_id column in the query result?
Desired Output:
Option one
WITH transf as (Select order_id, transaction_uuid,client_name , JSON_VALUE(transactions_table,'$.ordernum') as o_num from transactions_table)
Select * from transf where o_num like '%ORD12345%' or o_num like '%ORD54376%'
Option two
split o_num by "-" as separator , create table of orders like (select 'ORD12345' as num
Union
Select 'ORD54376' aa num) and inner join it with transf.o_num
One method uses OR:
WHERE JSON_VALUE(transactions_table, '$.ordernum') LIKE IN '%ORD12345%' OR
JSON_VALUE(transactions_table, '$.ordernum') LIKE '%ORD54376%'
An alternative method uses regular expressions:
WHERE REGEXP_CONTAINS(JSON_VALUE(transactions_table, '$.ordernum'), 'ORD12345|ORD54376')
According to the documentation, here, the LIKE operator works as described:
Checks if the STRING in the first operand X matches a pattern
specified by the second operand Y. Expressions can contain these
characters:
A percent sign "%" matches any number of characters or
bytes.
An underscore "_" matches a single character or byte.
You can escape "\", "_", or "%" using two backslashes. For example, "\%". If
you are using raw strings, only a single backslash is required. For
example, r"\%".
Thus , the syntax would be like the following:
SELECT
order_id,
transaction_uuid,
client_name
FROM
transactions_table
WHERE
JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD12345%'
OR JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD54376%
Notice that we specify two conditions connected with the OR logical operator.
As a bonus information, when querying large datasets it is a good pratice to select only the columns you desire in your out output ( either in a Temp Table or final view) instead of using *, because BigQuery is columnar, one of the reasons it is faster.
As an alternative for using LIKE, you can use REGEXP_CONTAINS, according to the documentation:
Returns TRUE if value is a partial match for the regular expression, regex.
Using the following syntax:
REGEXP_CONTAINS(value, regex)
However, it will also work if instead of a regex expression you use a STRING between single/double quotes. In addition, you can use the pipe operator (|) to allow the searched components to be logically ordered, when you have more than expression to search, as follows:
where regexp_contains(email,"gary|test")
I hope if helps.

Get total number of user where username have defferrent case

I have SQL table where username have different cases for example "ACCOUNTS\Ninja.Developer" or "ACCOUNTS\ninja.developer"
I want to find the how many records where username where first in first and last name capitalize ? how can use Regex to find the total ?
x table
User
"ACCOUNTS\James.McAvoy"
"ACCOUNTS\michael.fassbender"
"ACCOUNTS\nicholas.hoult"
"ACCOUNTS\Oscar.Isaac"
Do you want something like this?
select count(*)
from t
where name rlike 'ACCOUNTS\[A-Z][a-z0-9]*[.][A-Z][a-z0-9]*'
Of course, different databases implement regular expressions differently, so the actual comparator may not be rlike.
In SQL Server, you can do:
select count(*)
from t
where name like 'ACCOUNTS\[A-Z][^.][.][A-Z]%';
You might need to be sure that you have a case-sensitive collation.
In most cases in MS SQL string collation is case insensitive so we need some trick. Here is an example:
declare #accts table(acct varchar(100))
--sample data
insert #accts values
('ACCOUNTS\James.McAvoy'),
('ACCOUNTS\michael.fassbender'),
('ACCOUNTS\nicholas.hoult'),
('ACCOUNTS\Oscar.Isaac')
;with accts as (
select
--cleanup and split values
left(replace(acct,'ACCOUNTS\',''),charindex('.',replace(acct,'ACCOUNTS\',''),0)-1) frst,
right(replace(acct,'ACCOUNTS\',''),charindex('.',replace(acct,'ACCOUNTS\',''),0)) last
from #accts
)
,groups as (--add comparison columns
select frst, last,
case when CAST(frst as varbinary(max)) = CAST(lower(frst) as varbinary(max)) then 'lower' else 'Upper' end frstCase, --circumvert case insensitive
case when CAST(last as varbinary(max)) = CAST(lower(last) as varbinary(max)) then 'lower' else 'Upper' end lastCase
from accts
)
--and gather fruit
select frstCase, lastCase, count(frst) cnt
from groups
group by frstCase,lastCase
Your question is a little vague but;
You might be looking for the DISTINCT command.
REF
I don't think you need regex.
Maybe do something like:
Get distinct names from Table X as Table A
Use inputs table A as where clause on Table X
count
union
I hope this helps,
Rhys
Given your example set you can use a combination of techniques. First if the user name always begins with "ACCOUNTS\" then you can use substr to select the characters that start after the "\" character.
For the first name:
Then you can use a regex function to see if it matches against [A-Z] or [a-z] assuming your username must start with an alpha character.
For the last name:
Use the instr function on the substr and search for the character '.' and again apply the regex function to match against [A-Z] or [a-z] to see if the last name starts with an upper or a lower character.
To total:
Select all matches where both first and last match against upper and do a count. Repeat for the lower matches and you'll have both totals.

oracle get the char of the longest charsequence from a "characterlist" in select

I think that would be a good question :)
So, I have a characterlist like '111122333334458888888888'
and I want to get only the char of the longest sequence.(it's '8' in that example)
It's a maxsearch of course, but I need to do it in the SELECT statement.
You can try something like this:
select character
from
(
select character, count(1)
from
(
select substr('111122333334458888888888', level, 1) as character
from dual
connect by level <= length('111122333334458888888888')
)
group by character
order by 2 desc
)
where rownum = 1
This uses the inner query to split the starting string into single characters, then counts the occurrence of every character ordering to get the character with the greatest number of occurrences.
You can rewrite this in different ways, with analytic functions; I believe this way is a one of the most readable.
If you have more than one character with the maximum number of occurrences, this will return one of them, in unpredictable way; if you need to chose, for example, the mimimum char, you can edit the ORDER BY clause accordingly.

Remove last two characters from each database value

I run the following query:
select * from my_temp_table
And get this output:
PNRP1-109/RT
PNRP1-200-16
PNRP1-209/PG
013555366-IT
How can I alter my query to strip the last two characters from each value?
Use the SUBSTR() function.
SELECT SUBSTR(my_column, 1, LENGTH(my_column) - 2) FROM my_table;
Another way using a regular expression:
select regexp_replace('PNRP1-109/RT', '^(.*).{2}$', '\1') from dual;
This replaces your string with group 1 from the regular expression, where group 1 (inside of the parens) includes the set of characters after the beginning of the line, not including the 2 characters just before the end of the line.
While not as simple for your example, arguably more powerful.

Oracle: a query, which counts occurrences of all non alphanumeric characters in a string

What would be the best way to count occurrences of all non alphanumeric characters that appear in a string in an Oracle database column.
When attempting to find a solution I realised I had a query that was unrelated to the problem, but I noticed I could modify it in the hope to solve this problem. I came up with this:
SELECT COUNT (*), SUBSTR(TITLE, REGEXP_INSTR(UPPER(TITLE), '[^A-Z,^0-9]'), 1)
FROM TABLE_NAME
WHERE REGEXP_LIKE(UPPER(TITLE), '[^A-Z,^0-9]')
GROUP BY SUBSTR(TITLE, REGEXP_INSTR(UPPER(TITLE), '[^A-Z,^0-9]'), 1)
ORDER BY COUNT(*) DESC;
This works to find the FIRST non alphanumeric character, but I would like to count the occurrences throughout the entire string, not just the first occurrence. E. g. currently my query analysing "a (string)" would find one open parenthesis, but I need it to find one open parenthesis and one closed parenthesis.
There is an obscure Oracle TRANSLATE function that will let you do that instead of regexp:
select a.*,
length(translate(lower(title),'.0123456789abcdefghijklmnopqrstuvwxyz','.'))
from table_name a
Try this:
SELECT a.*, LENGTH(REGEXP_REPLACE(TITLE, '[^a-zA-Z0-9]'), '')
FROM TABLE_NAME a
The best option, as you discovered is to use a PL/SQL procedure. I don't think there's any way to create a regex expression that will return multiple counts like you're expecting (at least, not in Oracle).
One way to get around this is to use a recursive query to examine each character individually, which could be used to return a row for each character found. The following example will work for a single row:
with d as (
select '(1(2)3)' as str_value
from dual)
select char_value, count(*)
from (select substr(str_value,level,1) as char_value
from d
connect by level <= length(str_value))
where regexp_instr(upper(char_value), '[^A-Z,^0-9]'), 1) <> 0
group by char_value;