Extract particular character using StandardSQL - google-bigquery

I would like to extract particular character from strings using StandardSQL.
I would like to extract the character after limit=.
For instance, from below strings I would like to extract 10, 3 and null. For everything that has null I also would like to make all null = 1.
partner=&limit=10
partner=aex&limit=3&filters%5Bpartner%5D
partner=aex&limit=&filters%5Bpartner%5D
I only know how to use substring function but the problem here is the positions of limit= are not always the same.

You can use REGEXP_EXTRACT. For example:
SELECT REGEXP_EXTRACT('partner=aex&limit=3&filters%5Bpartner%5D', 'limit=(\\d+)');
+-------+
| $col1 |
+-------+
| 3 |
+-------+

Related

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))
You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.
Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.
Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Format a number to NOT have commas (1,000,000 -> 1000000) in Google BigQuery

In Bigquery: How do we format a number that will be part of the result set that should be not having commas: like 1,000,000 to 1000000 ?
I am assuming that your data type is string here.
You can use the REGEXP_REPLACE function to remove certain symbols from strings.
SELECT REGEXP_REPLACE("1,000,000", r',', '') AS Output
Returns:
+-----+---------+
| Row | Output |
+-----+---------+
| 1 | 1000000 |
+-----+---------+
If your data contains strings with and without commas, this function will return the ones without as they are so you don't need to worry about filtering the input.
Documentation for this function can be found here.

Concat multiple rows with a delimiter in Hive

I need to concat string values row wise with '~' as delimiter.
I have the following data:
I need to concat 'Comment' column for each 'id' in the ascending order of 'row_id' with '~' as delimiter.
Expected output is as below:
GROUP_CONCAT is not an option since its not recognized in my Hive version.
I can use collect_set or collect_list, but I won't be able to insert delimiter in between.
Is there any workaround?
collect_list returns array, not string.
Array can be converted to delimited string using concat_ws.
This will work, with no specific order of comments.
select id
,concat_ws('~',collect_list(comment)) as comments
from mytable
group by id
;
+----+-------------+
| id | comments |
+----+-------------+
| 1 | ABC~PRQ~XYZ |
| 2 | LMN~OPQ |
+----+-------------+

Query to search substring in column

I have a table that has a substring value in the column and I want to write a query that checks if input string has the substring.
My table looks like:
| company | host |
| ------- | ---------- |
| ebay | ebay.com |
| google | google.com |
| yahoo | yahoo.com |
My input will be like www.ebay.com or https://www.ebay.com or www.qa.ebay.com or www.dev.ebay.com..
If I get any of the inputs I want to return the first record.
I tried looking at the CHARINDEX, INSTR but they are work in reverse. My scenario is I have substring to be searched in table and the actual string as input.
Any help is appreciated.
You can use like for this, but you also need string concatenation. In ANSI standard SQL, this looks like:
select t.*
from t
where #inputstring like concat('%.', t.host)
where #inputstring is the string you are inputting.
Note: You can also use the concatenation infix operation, which is typically || (standard) or +.
You can use the SQL wildcard like so:
SELECT * FROM table WHERE host LIKE '%ebay.com';
Go for this:
SELECT * FROM table WHERE host LIKE '%SearchString%'
It will pull all rows containing the SearchString.
You can achieve this using like operator.
Select * from yourtable
where ? like concat('%', company, '%');
parameter ? with your input.

Wildcard of Number in SQL Server

How to match numbers in SQL Server 'LIKE'.
SpaceName
------------
| New_Space_1
| .
| .
| New_Space_8
| New_Space_9
| New_Space_10
| New_Space_11
| New_Space_SomeString
| New_Space_SomeString1
Above is my table contents.
I want to get only records ending with Numeric chars, ie I want the records from New_Space_1 to New_Space_11.
Don't want New_Space_SomeString and New_Space_SomeString1
I have some query like this.
SELECT SpaceName FROM SpaceTable
WHERE SpaceName LIKE 'New_Space_%'
But this returns all records.
what about
SELECT SpaceName FROM SpaceTable
WHERE SpaceName LIKE 'New[_]Space[_][0-9]%'
The reason I put underscore in brackets is because in a regular expression _ means Any single character. Read up on like here http://msdn.microsoft.com/en-us/library/ms179859.aspx
This solution from #SteveKass works perfect.
SELECT SpaceName FROM SpaceTable WHERE SpaceName LIKE 'New[_]Space[_]%' AND SpaceName NOT LIKE 'New[_]Space[_]%[^0-9]%'