Google Bigquery Complex UDF - sql

I am creating a UDF in bigquery to call in a more powerful query. The input to the UDF is a string made up of numbers and different length units. There are three main cases. See below for an explanation of the cases and also an example for each case. The "#" represents a number. "Unit" represents one of 3 distance units (Miles, Yards, and Furlongs). Case 3 is comprised of two different units that are added together. The end goal of the UDF is to normalize the input to one unit (yards), remove any alphabetic characters, and convert any complex fractions to floats. The UDF would then return back a string.
Cases:
'# Unit'; Example: Input '350 Y', Output '350'
'# #/#Unit'; Example: Input '5 1/2F', Output '1210'
'#Unit1 #Unit2'; Example:Input '4F 70Y', Output '950.002'
I have tried to do this using If statements. In my first attempt at this, I could only get rid of the complex fraction and two units that get added to each other. Is there a way to do many if statements to hit all possible combinations? I haven't found a way to use else-if conditional statements. Any advice, guidance, or code would be very much appreciated. I am relatively new to using SQL/bigquery so please let me know if I am doing this in a bad way. Below is my first attempt code:
CREATE OR REPLACE FUNCTION `location`(str STRING) AS (
(
if(
REGEXP_CONTAINS(str, r' ') AND REGEXP_CONTAINS(str, r'/')=FALSE #does not contain /
, (1760 * SAFE_CAST(SPLIT(REGEXP_REPLACE(str, 'M',''),' ')[OFFSET(0)] AS FLOAT64)) + SAFE_CAST(SPLIT(str,' ')[OFFSET(1)] AS FLOAT64)
,if(
REGEXP_CONTAINS(str, r'/'),
SAFE_CAST(SPLIT(str,' ')[OFFSET(0)] AS FLOAT64) + SAFE_CAST(SPLIT(SPLIT(str,' ')[OFFSET(1)], '/')[OFFSET(0)] AS INT64) / SAFE_CAST(SPLIT(SPLIT(str,' ')[OFFSET(1)], '/')[OFFSET(1)] AS INT64),
IFNULL(SAFE_CAST(REGEXP_REPLACE(str, r'FYM','') AS FLOAT64), -1)
)
)
)
);

Consider below
create temp function eval (str string) returns float64
language js as r"""
return eval(str);
""";
select str,
(
select sum(
case right(x,1)
when 'M' then 1760
when 'F' then 220
when 'Y' then 1
end * eval(replace(trim(translate(x, 'MFY', '')), ' ', '+')))
from unnest(regexp_extract_all(str, r'[^MFY]+(?:M|F|Y)')) val,
unnest([struct(trim(val) as x)])
) yards
from your_table
if applied to sample data in your question - output is
Update (per recent comments): you can package whole stuff into js udf and sql udf as in below example
create temp function eval (str string) returns float64
language js as r"""
return eval(str);
""";
create temp function to_yards(str string) as ((
select sum(
case right(x,1)
when 'M' then 1760
when 'F' then 220
when 'Y' then 1
end * eval(replace(trim(translate(x, 'MFY', '')), ' ', '+')))
from unnest(regexp_extract_all(str, r'[^MFY]+(?:M|F|Y)')) val,
unnest([struct(trim(val) as x)])
));
select str, to_yards(str) as yards
from your_table
with same output as above

Related

How to add up a string of numbers using SQL (BigQuery)?

I have a string of numbers like this:
670000000000100000000000000000000000000000000000000000000000000
I want to add up these numbers which in the above example would result in 14: 6+7+0+...+1+0+...+0+0+0=14
How would I do this in BigQuery?
Consider below approach
with example as (
select '670000000000100000000000000000000000000000000000000000000000000' as s
)
select s, (select sum(cast(num as int64)) from unnest(split(s,'')) num) result
from example
with output
Yet another [fun] option
create temp function sum_digits(expression string)
returns int64
language js as """
return eval(expression);
""";
with example as (
select '670000000000100000000000000000000000000000000000000000000000000' as s
)
select s, sum_digits(regexp_replace(replace(s, '0', ''), r'(\d)', r'+\1')) result
from example
with output
What it does is -
first it transform initial long string into shorter one - 671.
then it transforms it into expression - +6+7+1
and finally pass it to javascript eval function (unfortunatelly BigQuery does not have [hopefully yet] eval function)

Proper Case in Big Query

I have this sentence "i want to buy bananas" across column 'Bananas' in Big Query.
I want to get "I Want To Buy Bananas". How do I it? I was expecting PROPER(Bananas) function when I saw LOWER and UPPER but it seems like PROPER case is not supported?
DZ
October 2020 Update:
BigQuery now support INITCAP function - which takes a STRING and returns it with the first character in each word in uppercase and all other characters in lowercase. Non-alphabetic characters remain the same.
So, below type of fancy-shmancy UDF is not needed anymore - instead you just use
#standradSQL
SELECT str, INITCAP(str) proper_str
FROM `project.dataset.table`
-- ~~~~~~~~~~~~~~~~~~
Below example is for BigQuery Standrad SQL
#standradSQL
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT STRING_AGG(CONCAT(UPPER(SUBSTR(w,1,1)), LOWER(SUBSTR(w,2))), ' ' ORDER BY pos)
FROM UNNEST(SPLIT(str, ' ')) w WITH OFFSET pos
));
WITH `project.dataset.table` AS (
SELECT 'i Want to buy bananas' str
)
SELECT str, PROPER(str) proper_str
FROM `project.dataset.table`
result is
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas
I expanded on Mikhail Berlyant's answer to also capitalise after hypens (-) as I needed to use proper case for place names. Had to switch from the SPLIT function to using a regex to do this.
I test for an empty string at the start and return an empty string (as opposed to null) to match the behaviour of the native UPPER and LOWER functions.
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT
IF(str = '', '',
STRING_AGG(
CONCAT(
UPPER(SUBSTR(single_words,1,1)),
LOWER(SUBSTR(single_words,2))
),
'' ORDER BY position
)
)
FROM UNNEST(REGEXP_EXTRACT_ALL(str, r' +|-+|.[^ -]*')) AS single_words
WITH OFFSET AS position
));
WITH test_table AS (
SELECT 'i Want to buy bananas' AS str
UNION ALL
SELECT 'neWCASTle upon-tyne' AS str
)
SELECT str, PROPER(str) AS proper_str
FROM test_table
Output
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas
2 neWCASTle upon-tyne Newcastle Upon-Tyne

How to get the first field from an anonymous row type in PostgreSQL 9.4?

=# select row(0, 1) ;
row
-------
(0,1)
(1 row)
How to get 0 within the same query? I figured the below sort of working but is there any simple way?
=# select json_agg(row(0, 1))->0->'f1' ;
?column?
----------
0
(1 row)
No luck with array-like syntax [0].
Thanks!
Your row type is anonymous and therefore you cannot access its elements easily. What you can do is create a TYPE and then cast your anonymous row to that type and access the elements defined in the type:
CREATE TYPE my_row AS (
x integer,
y integer
);
SELECT (row(0,1)::my_row).x;
Like Craig Ringer commented in your question, you should avoid producing anonymous rows to begin with, if you can help it, and type whatever data you use in your data model and queries.
If you just want the first element from any row, convert the row to JSON and select f1...
SELECT row_to_json(row(0,1))->'f1'
Or, if you are always going to have two integers or a strict structure, you can create a temporary table (or type) and a function that selects the first column.
CREATE TABLE tmptable(f1 int, f2 int);
CREATE FUNCTION gettmpf1(tmptable) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL;
SELECT gettmpf1(ROW(0,1));
Resources:
https://www.postgresql.org/docs/9.2/static/functions-json.html
https://www.postgresql.org/docs/9.2/static/sql-expressions.html
The json solution is very elegant. Just for fun, this is a solution using regexp (much uglier):
WITH r AS (SELECT row('quotes, "commas",
and a line break".',null,null,'"fourth,field"')::text AS r)
--WITH r AS (SELECT row('',null,null,'')::text AS r)
--WITH r AS (SELECT row(0,1)::text AS r)
SELECT CASE WHEN r.r ~ '^\("",' THEN ''
WHEN r.r ~ '^\("' THEN regexp_replace(regexp_replace(regexp_replace(right(r.r, -2), '""', '\"', 'g'), '([^\\])",.*', '\1'), '\\"', '"', 'g')
ELSE (regexp_matches(right(r.r, -1), '^[^,]*'))[1] END
FROM r
When converting a row to text, PostgreSQL uses quoted CSV formatting. I couldn't find any tools for importing quoted CSV into an array, so the above is a crude text manipulation via mostly regular expressions. Maybe someone will find this useful!
With Postgresql 13+, you can just reference individual elements in the row with .fN notation. For your example:
select (row(0, 1)).f1; --> returns 0.
See https://www.postgresql.org/docs/13/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

PostgreSQL count number of times substring occurs in text

I'm writing a PostgreSQL function to count the number of times a particular text substring occurs in another piece of text. For example, calling count('foobarbaz', 'ba') should return 2.
I understand that to test whether the substring occurs, I use a condition similar to the below:
WHERE 'foobarbaz' like '%ba%'
However, I need it to return 2 for the number of times 'ba' occurs. How can I proceed?
Thanks in advance for your help.
I would highly suggest checking out this answer I posted to "How do you count the occurrences of an anchored string using PostgreSQL?". The chosen answer was shown to be massively slower than an adapted version of regexp_replace(). The overhead of creating the rows, and the running the aggregate is just simply too high.
The fastest way to do this is as follows...
SELECT
(length(str) - length(replace(str, replacestr, '')) )::int
/ length(replacestr)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr);
Here we
Take the length of the string, L1
Subtract from L1 the length of the string with all of the replacements removed L2 to get L3 the difference in string length.
Divide L3 by the length of the replacement to get the occurrences
For comparison that's about five times faster than the method of using regexp_matches() which looks like this.
SELECT count(*)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr)
CROSS JOIN LATERAL regexp_matches(str, replacestr, 'g');
How about use a regular expression:
SELECT count(*)
FROM regexp_matches('foobarbaz', 'ba', 'g');
The 'g' flag repeats multiple matches on a string (not just the first).
There is a
str_count( src, occurence )
function based on
SELECT (length( str ) - length(replace( str, occurrence, '' ))) / length( occurence )
and a
str_countm( src, regexp )
based on the #MikeT-mentioned
SELECT count(*) FROM regexp_matches( str, regexp, 'g')
available here: postgres-utils
Try with:
SELECT array_length (string_to_array ('1524215121518546516323203210856879', '1'), 1) - 1
--RESULT: 7

sql function to return table of names and values given a querystring

Anyone have a t-sql function that takes a querystring from a url and returns a table of name/value pairs?
eg I have a value like this stored in my database:
foo=bar&baz=qux&x=y
and I want to produce a 2-column (key and val) table (with 3 rows in this example), like this:
name | value
-------------
foo | bar
baz | qux
x | y
UPDATE: there's a reason I need this in a t-sql function; I can't do it in application code. Perhaps I could use CLR code in the function, but I'd prefer not to.
UPDATE: by 'querystring' I mean the part of the url after the '?'. I don't mean that part of a query will be in the url; the querystring is just used as data.
create function dbo.fn_splitQuerystring(#querystring nvarchar(4000))
returns table
as
/*
* Splits a querystring-formatted string into a table of name-value pairs
* Example Usage:
select * from dbo.fn_splitQueryString('foo=bar&baz=qux&x=y&y&abc=')
*/
return (
select 'name' = SUBSTRING(s,1,case when charindex('=',s)=0 then LEN(s) else charindex('=',s)-1 end)
, 'value' = case when charindex('=',s)=0 then '' else SUBSTRING(s,charindex('=',s)+1,4000) end
from dbo.fn_split('&',#querystring)
)
go
Which utilises this general-purpose split function:
create function dbo.fn_split(#sep nchar(1), #s nvarchar(4000))
returns table
/*
* From https://stackoverflow.com/questions/314824/
* Splits a string into a table of values, with single-char delimiter.
* Example Usage:
select * from dbo.fn_split(',', '1,2,5,2,,dggsfdsg,456,df,1,2,5,2,,dggsfdsg,456,df,1,2,5,2,,')
*/
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 4000 END) AS s
FROM Pieces
)
go
Ultimately letting you do something like this:
select name, value
from dbo.fn_splitQuerystring('foo=bar&baz=something&x=y&y&abc=&=whatever')
I'm sure TSQL could be coerced to jump through this hoop for you, but why not parse the querystring in your application code where it most probably belongs?
Then you can look at this answer for what others have done to parse querystrings into name/value pairs.
Or this answer.
Or this.
Or this.
Please don't encode your query strings directly in URLs, for security reasons: anyone can easily substitute any old query to gain access to information they shouldn't have -- or worse, "DROP DATABASE;". Checking for suspicious "keywords" or things like quote characters is not a solution -- creative hackers will work around these measures, and you'll annoy everyone whose last name is "O'Reilly."
Exceptions: in-house-only servers or public https URLS. But even then, there's no reason why you can't build the SQL query on the client side and submit it from there.