How to create a BQ SQL UDF that iterates over a string? - sql

TL;DR:
Is there a way to do string manipulation in BQ only with SQL UDF?
Eg:
____________________________________________________
id | payload
----------------------------------------------------
1 | key1=val1&key2=val2&key3=val3=&key4=val4
----------------------------------------------------
2 | key5=val5&key6=val6=
select removeExtraEqualToFromPayload(payload) from table
should give
____________________________________________________
payload
----------------------------------------------------
key1=val1&key2=val2&key3=val3&key4=val4
----------------------------------------------------
key5=val5&key6=val6
Long version:
My goal is to iterate over a string that is part of one of the columns
This is our table structure
____________________________________________________
id | payload
----------------------------------------------------
1 | key1=val1&key2=val2&key3=val3=&key4=val4
----------------------------------------------------
2 | key5=val5&key6=val6=
As you see, key3 in first row has an = after val3 and key6 in second row has an = after val6 which is not desired for us
So the goal is to iterate over the string and remove these extra =
I had gone through https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions that explains how to use custom functions in BQ. As of now SQL UDF only supports SQL query, where as with JS UDF we can write our custom logic to add loops etc
Since JS UDF is very slow, using it has been ruled out and we only had to rely on SQL UDF.
I thought of using BQ Scripting(https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting) in combination with SQL UDF but that doesn't seem to work. Looks like script has to be altogether different
I had explored stored procedures with BQ for the same, however, that is also not working. I'm not sure if I am doing it right
I've created a procedure like this:
CREATE PROCEDURE test.AddDelta(INOUT x INT64, delta INT64)
BEGIN
SET x = x + delta;
END;
I'm not able to use the above procedure like this:
with ta as (select 1 id union all select 2 id)
select id from ta;
call test.AddDelta(id, 1);
select id;
I'm wondering if there is a way to parse strings like this without using Javascript UDF

Disclaimer: My regex-fu is not good. definitely have a look at the re2 syntax
You should be able to do it with REGEXP_REPLACE
SELECT
payload,
REGEXP_REPLACE(payload,r'=(&)|=$','\\1') AS payload_clean
FROM
`myproject.mydataset.mytable`
example output:
payload
payload_clean
key1=val1&key2=val2&key3=val3=&key4=val4=
key1=val1&key2=val2&key3=val3&key4=val4
Executable example:
WITH
payload_table AS (
SELECT "key1=val1&key2=val2&key3=val3=&key4=val4" AS payload UNION ALL
SELECT "key5=val5&key6=val6=" AS payload UNION ALL
SELECT "key1=val1&key2=val2&key3=val3=&key4=val4=" AS payload UNION ALL
SELECT "key3=val3=abc&key4=val4" AS payload
)
SELECT
payload,
REGEXP_REPLACE(payload,r'(=val\pN)=(\pL*&)|=(&)|=$','\\1\\2') AS payload_clean
FROM
payload_table
Of course (=val\pN)=(\pL*&) in the pattern won't necessarily work for you since you probably have different patterns. If there are no patterns to match then I'm not sure how you will remove the extra '=' from your strings automatically.

Related

Parsing within a field using SQL

We are receiving data in one column where further parsing is needed. In this example the separator is ~.
Goal is to grab the pass or fail value from its respective pair.
SL
Data
1
"PARAM-0040,PASS~PARAM-0045,PASS~PARAM-0070,PASS"
2
"PARAM-0040,FAIL~PARAM-0045,FAIL~PARAM-0070,PASS"
Required outcome:
SL
PARAM-0040
PARAM-0045
PARAM-0070
1
PASS
PASS
PASS
2
FAIL
FAIL
PASS
This will be a part of a bigger SQL query where we are selecting many other columns, and these three columns are to be picked up from the source as well and passed in the query as selected columns.
E.g.
Select Column1, Column2, [ Parse code ] as PARAM-0040, [ Parse code ] as PARAM-0045, [ Parse code ] as PARAM-0070, Column6 .....
Thanks
You can do that with a regular expression. But regexps are non-standard.
This is how it is done in postgresql: REGEXP_MATCHES()
https://www.postgresqltutorial.com/postgresql-regexp_matches/
In postgresql regexp_matches returns zero or more values. So then it has to be broken down (thus the {})
A simpler way, also in postgresql is to use substring.
substring('foobar' from 'o(.)b')
Like:
select substring('PARAM-0040,PASS~PARAM-0045,PASS~PARAM-0070,PASS' from 'PARAM-0040,([^~]+)~');
substring
-----------
PASS
(1 row)
You may use the str_to_map function to split your data and subsequently extract each param's value. This example will first split each param/value pair by ~ before splitting the parameter and value by ,.
Reproducible example with your sample data:
WITH my_table AS (
SELECT 1 as SL, "PARAM-0040,PASS~PARAM-0045,PASS~PARAM-0070,PASS" as DATA
UNION ALL
SELECT 2 as SL, "PARAM-0040,FAIL~PARAM-0045,FAIL~PARAM-0070,PASS" as DATA
),
param_mapped_data AS (
SELECT SL, str_to_map(DATA,"~",",") param_map FROM my_table
)
SELECT
SL,
param_map['PARAM-0040'] AS PARAM0040,
param_map['PARAM-0045'] AS PARAM0045,
param_map['PARAM-0070'] AS PARAM0070
FROM
param_mapped_data
Actual code assuming your table is named my_table
WITH param_mapped_data AS (
SELECT SL, str_to_map(DATA,"~",",") param_map FROM my_table
)
SELECT
SL,
param_map['PARAM-0040'] AS PARAM0040,
param_map['PARAM-0045'] AS PARAM0045,
param_map['PARAM-0070'] AS PARAM0070
FROM
param_mapped_data
Outputs:
sl
param0040
param0045
param0070
1
PASS
PASS
PASS
2
FAIL
FAIL
PASS

Evaluating a variable using the IN() Function

I'm trying to resolve a datastep variable in the in() function. I have a dataset that looks like the following:
|Run|Sample Level|Samples Tested|
| 1 | 1 | 1-5 |
| 1 | 2 | 1-5 |
...etc
| 1 | 5 | 1-5 |
---------------------------------
| 2 | 1 | 1-4 |
| 2 | 2 | 1-4 |
The samples tested vary by run. Normally the only sample levels in the dataset are the ones in the range provided by "Samples Tested". However occasionally this is not the case, and it can get messy. For example the last one I worked on looked like this:
|Run|Sample Level|Samples Tested|
| 1 | 1 |2-9, 12-35, 37-40|
In this case I'd want to drop all rows with sample levels that were not included in Samples Tested, which I did by manually adding the code:
Data Want;
set Have;
if sample_level not in (2:9, 12:35, 37:40) then delete;
run;
But what I want to do is have this done automatically by looking at the samples tested column. It's easy enough to turn a "-" into a ":", but where I'm stuck is getting the IN() function to recognize or resolve a variable. I would like code that looks like this: if sample_level not in(Samples_Tested) then delete; where samples_tested has been transformed to be something that the IN() function can handle. I'm also not opposed to using proc sql; if anyone has a solution that they think will work. I know you can do things like
Proc sql; Create table want as select * from HAVE where Sample_Level in (Select Samples_Tested from Have); Quit;
But the problem is that the samples tested varies by run and there could be 16 different runs. Hopefully I've explained the challenge clearly enough. Thanks for taking the time to read this and thanks in advance for your help!
Assuming the values of SAMPLES_TESTED is constant for each value of RUN you could use it to generate the selection criteria. For example you could use a data _null_ step to write a WHERE statement to a file and then %include that code into another data step.
filename code temp;
data _null_;
file code;
if eof then put ';';
set have end=eof;
by run;
if first.run;
if _n_=1 then put 'where ' # ;
else put ' or ' # ;
samples_tested=translate(samples_tested,':','-');
put '(' run= 'and sample_level in (' samples_tested '))';
run;
data want;
set have;
%include code;
run;
Note: IN is an operator and not a function.
Good to see SAS code ;-)
That would work with one range:
select * from HAVE where level in (tested);
For multiple ranges I would use SUBSTRING_INDEX in MySQL or just combination of SUBSTRING and INDEX to find next condition.
select * from HAVE where level in (tested1) or level in (tested2) or level in (tested3);
Where you replace tested1 for example as substr(tested,1, index(tested,',')
I used the following to generate sample:
create table have
(run int,
level int,
tested varchar(20));
INSERT INTO have (run, level, tested)
VALUES (1, 1, "3-5");
INSERT INTO have (run, level, tested)
VALUES (1, 3, "3-5, 12:35");
INSERT INTO have (run, level, tested)
VALUES (1, 20, "3-5, 12-35");

How to read and modify the data in oracle database (instead of using Replace function)

I have just started to learn and work with Oracle SQL a few months ago, and I have a question that I could not find similar problems on Stack Overflow.
In SQL Oracle,
I am trying to find a way that I can read the data from a column and modify (add/subtract) the data. What I have got so far is using replace like here, but I do not want to use multiple replace function to make it work. I am not sure whether you guys understand my question, so I have listed what I have so far below, and I used multiple replace function.
COMMOD_CODE (Given) | MODEL(Desired_result)
|
X2-10GB-LR | X2-10GB-LR (same)
15454-OSC-CSM | 15454-OSC
15454-PP64LC | 15454-PP_64-LC
CAT3550 | WS-C3550-48-SMI
CAT3560G-48 | WS-C3560G-48PS-S
CAT3550 | WS-C3550-48-SMI
DWDM-GBIC-30 | DWDM-GBIC-30.33
Select
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(commod.COMMODITY_CODE,
'15454-OSC-CSM', '15454-OSC'),
'15454-PP64LC','15454-PP_64-LC'),
'CAT3550','WS-C3550-48-SMI'),
'CAT3560G-48','WS-C3560G-48PS-S'),
'CAT3550','WS-C3550-48-SMI'),
'DWDM-GBIC-30','DWDM-GBIC-30.33')
MODEL,
NVL(commod.COMMODITY_CODE, ' ') as COMMOD_CODE
FROM tablename.table commod
I got the the answer. However, I think I used a lot of ** REPLACE ** to get it right. So, my question is if there is any easier way to do that instead of using replace multiple times, and make your script look awful.
Is someone able to please give me some guidance?
Thanks in advance,
Use DECODE or CASE for this, I think. Or, better yet, maybe a mapping table.
You can use the DECODE function in this case:
with
test_data as (
select '15454-OSC-CSM' as COMMODITY_CODE from dual
union all select '15454-PP64LC' from dual
union all select 'CAT3550' from dual
union all select 'CAT3560G-48' from dual
union all select 'CAT3550' from dual
union all select 'DWDM-GBIC-30' from dual
)
select
decode(COMMODITY_CODE,
'15454-OSC-CSM', '15454-OSC',
'15454-PP64LC', '15454-PP_64-LC',
'CAT3550', 'WS-C3550-48-SMI',
'CAT3560G-48', 'WS-C3560G-48PS-S',
'CAT3550', 'WS-C3550-48-SMI',
'DWDM-GBIC-30', 'DWDM-GBIC-30.33')
from test_Data
;
Result:
COL
------------------
15454-OSC
15454-PP_64-LC
WS-C3550-48-SMI
WS-C3560G-48PS-S
WS-C3550-48-SMI
DWDM-GBIC-30.33
What the DECODE function does: it checks its first argument - if it is equal to the second argument, then it returns the third argument, otherwise, if it is equal to the 4th argument, it returns the 5th argument, and so on.

How to cast float to string with no exponents in BigQuery

I have float data in a BigQuery table like 5302014.2 and 5102014.4.
I'd like to run a BigQuery SQL that returns the values in String format, but the following SQL yields this result:
select a, string(a) from my_table
5302014.2 "5.30201e+06"
5102014.4 "5.10201e+06"
How can I rewrite my SQL to return:
5302014.2 "5302014.2"
5102014.4 "5102014.4"
use standardSQL doesn't have the problem
$ bq query '#standardSQL
SELECT a, CAST(a AS STRING) AS a_str FROM UNNEST(ARRAY[530201111114.2, 5302014.4]) a'
+-------------------+----------------+
| a | a_str |
+-------------------+----------------+
| 5302014.4 | 5302014.4 |
| 5.302011111142E11 | 530201111114.2 |
+-------------------+----------------+
SELECT STRING(INTEGER(f)) + '.' + SUBSTR(STRING(f-INTEGER(f)), 3)
FROM (SELECT 5302014.5642 f)
(not a nice hack, but a better method would be a great feature request to post at https://code.google.com/p/google-bigquery/issues/list?can=2&q=label%3DFeature-Request)
Converting your legacy sql to standard sql is really the best way going forward as far as working with GBQ is concerned. Standard sql is much faster and have way better implementation of features.
For your use case, going with standard sql with CAST(a AS STRING) would be best.

Searching a column containing CSV data in a MySQL table for existence of input values

I have a table say, ITEM, in MySQL that stores data as follows:
ID FEATURES
--------------------
1 AB,CD,EF,XY
2 PQ,AC,A3,B3
3 AB,CDE
4 AB1,BC3
--------------------
As an input, I will get a CSV string, something like "AB,PQ". I want to get the records that contain AB or PQ. I realized that we've to write a MySQL function to achieve this. So, if we have this magical function MATCH_ANY defined in MySQL that does this, I would then simply execute an SQL as follows:
select * from ITEM where MATCH_ANY(FEAURES, "AB,PQ") = 0
The above query would return the records 1, 2 and 3.
But I'm running into all sorts of problems while implementing this function as I realized that MySQL doesn't support arrays and there's no simple way to split strings based on a delimiter.
Remodeling the table is the last option for me as it involves lot of issues.
I might also want to execute queries containing multiple MATCH_ANY functions such as:
select * from ITEM where MATCH_ANY(FEATURES, "AB,PQ") = 0 and MATCH_ANY(FEATURES, "CDE")
In the above case, we would get an intersection of records (1, 2, 3) and (3) which would be just 3.
Any help is deeply appreciated.
Thanks
First of all, the database should of course not contain comma separated values, but you are hopefully aware of this already. If the table was normalised, you could easily get the items using a query like:
select distinct i.Itemid
from Item i
inner join ItemFeature f on f.ItemId = i.ItemId
where f.Feature in ('AB', 'PQ')
You can match the strings in the comma separated values, but it's not very efficient:
select Id
from Item
where
instr(concat(',', Features, ','), ',AB,') <> 0 or
instr(concat(',', Features, ','), ',PQ,') <> 0
For all you REGEXP lovers out there, I thought I would add this as a solution:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]';
and for case sensitivity:
SELECT * FROM ITEM WHERE FEATURES REGEXP BINARY '[[:<:]]AB|PQ[[:>:]]';
For the second query:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]' AND FEATURES REGEXP '[[:<:]]CDE[[:>:]];
Cheers!
select *
from ITEM where
where CONCAT(',',FEAURES,',') LIKE '%,AB,%'
or CONCAT(',',FEAURES,',') LIKE '%,PQ,%'
or create a custom function to do your MATCH_ANY
Alternatively, consider using RLIKE()
select *
from ITEM
where ','+FEATURES+',' RLIKE ',AB,|,PQ,';
Just a thought:
Does it have to be done in SQL? This is the kind of thing you might normally expect to write in PHP or Python or whatever language you're using to interface with the database.
This approach means you can build your query string using whatever complex logic you need and then just submit a vanilla SQL query, rather than trying to build a procedure in SQL.
Ben