BigQuery UDF to remove accents/diacritics in a string

BigQuery UDF to remove accents/diacritics in a string - google-bigquery

Using this javascript code we can remove accents/diacritics in a string.
var originalText = "éàçèñ"
var result = originalText.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(result) // eacen
If we create a BigQuery UDF it does not (even with double \).
CREATE OR REPLACE FUNCTION project.remove_accent(x STRING)
RETURNS STRING
LANGUAGE js AS """
return x.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
""";
SELECT project.remove_accent("éàçèñ") --"éàçèñ"
Any thoughts on that?

Consider below approach
select originalText,
regexp_replace(normalize(originalText, NFD), r"\pM", '') output
if applied to sample data in your question - output is
You can easily wrap it with SQL UDF if you wish

Related

Split by delimiter which is contained in a record

I have a column which I am splitting in Snowflake.
The format is as follows:
I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.
Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.
I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?
Thanks

This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:
create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
initialize: function (argumentInfo, context) {
},
processRow: function (row, rowWriter, context) {
var buffer = "";
var i;
const s = row.STR.split(row.DELIM);
for(i=0; i<s.length-1; i++) {
buffer += s[i];
if(s[i+1].includes(row.ROW_MUST_CONTAIN)) {
rowWriter.writeRow({VALUE: buffer});
buffer = "";
} else {
buffer += row.DELIM
}
}
rowWriter.writeRow({VALUE: s[i]})
},
}
$$;
select VALUE from
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;
Output:
VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying
This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.

We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)
set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';
select $str, *
from table(split_to_table(regexp_replace($str,',([^>,]+>)','\|\|\\1'),'||'))

Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.
It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.
It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:
select replace(t.value, 'my_replacement', ', ')
from table(
split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t

dynamic in Snowflake , to pass in to stage

I have a query that is select from an s3 bucket, but the value I need to select from changes each quarter - indicated below by {}. Is there anyway in snowflake I can write logic to be the most recent quarter
Select $1:date
from
'#lake.lake./s3key/{variable}/data.json.gzip' );
I would want to variable = 2021Q3, and then next quarter 2022Q1 ect
Is this possible? Or will I have to get python involved

I tried to use identifier for stages, but it does not seem to work.
I guess you can use Stored Procedure to achieve this.
create or replace procedure get_stage_data(quarter varchar)
returns string
language javascript
as
$$
var query = "Select $1::date from '#lake.lake./s3key/"+QUARTER+"/data.json.gzip";
var stmt = snowflake.createStatement({sqlText: query});
var res = stmt.execute();
var retVal = '';
while (res.next()) {
retVal += res.getColumnValue(1) + "\n"
}
return retVal;
$$;
Maybe write data to a table then you can analyze later.

DBArrayList to List<Map> Conversion after Query

Currently, I have a SQL query that returns information to me in a DBArrayList.
It returns data in this format : [{id=2kjhjlkerjlkdsf324523}]
For the next step, I need it to be in a List<Map> format without the id: [2kjhjlkerjlkdsf324523]
The Datatypes being used are DBArrayList, and List.
If it helps any, the next step is a function to collect the list and then to replace all single quotes if any [SQL-Injection prevention]. Using:
listMap = listMap.collect() { "'" + Util.removeSingleQuotes(it) + "'" }
public static String removeSingleQuotes(s) {
return s ? s.replaceAll(/'"/, '') : s
}

I spent this morning working on it, and I found out that I needed to actually collect the DBArrayList like this:
listMap = dbArrayList.collect { it.getAt('id')}
If you're in a bind like I was and restrained to a specific schema this might help, but #ou_ryperd has the correct answer!

While using a DBArrayList is not wrong, Groovy's idiom is to use the db result as a collection. I would suggest you use it that way directly from the db:
Map myMap = [:]
dbhandle.eachRow("select fieldSomeID, fieldSomeVal from yourTable;") { row ->
map[row.fieldSomeID] = row.fieldSomeVal.replaceAll(/'"/, '')
}

Lodash : Regex pattern stored in a variable

var s="Fred";
_.replace('Hi Fred', s, 'Barney');
Result : "Hi Barney"
I want to know how to use replace function when regex pattern is stored in a variable.
var s="Fred";
_.replace('Hi Fred', /s/, 'Barney');
Result : "Hi Fred"

This question isn't specific to lodash, really. You just need to create the regex with the RegExp constructor instead of the literal syntax:
var s="Fred";
var r=new RegExp(s);
var result = _.replace('Hi Fred', r, 'Barney');
console.log(result);
// "Hi Barney"
Check here for more help:
How do you use a variable in a regular expression?

JSON_EXTRACT in BigQuery Standard SQL?

I'm converting some SQL code from BigQuery to BigQuery Standard SQL.
I can't seem to find JSON_EXTRACT_SCALAR in Bigquery Standard SQL, is there an equivalent?

Edit: we implemented the JSON functions a while back. You can read about them in the documentation.

Not that I know of, but there is always workaround
Let's assume we want to mimic example from JSON_EXTRACT_SCALAR documentation
SELECT JSON_EXTRACT_SCALAR('{"a": ["x", {"b":3}]}', '$.a[1].b') as str
Below code does same
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return parsed.a[1].b;
""";
SELECT CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}') AS str
I think this can be good starting point to experiment with
see more for Scalar UDF in BigQuery Standard SQL
Quick update
After cup of coffee, decided to complete this "exercise" by myself
Look as a good short term solution to me :o)
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return eval(json_path.replace("$", "parsed"));
""";
SELECT
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[1].b') AS str1,
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[0]') AS str2,
CUSTOM_JSON_EXTRACT('{"a": 1, "b": [4, 5]}', '$.b') AS str3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery UDF to remove accents/diacritics in a string - google-bigquery

Consider below approach select originalText, regexp_replace(normalize(originalText, NFD), r"\pM", '') output if applied to sample data in your question - output is You can easily wrap it with SQL UDF if you wish

Related

Split by delimiter which is contained in a record

dynamic in Snowflake , to pass in to stage

DBArrayList to List<Map> Conversion after Query

Lodash : Regex pattern stored in a variable

JSON_EXTRACT in BigQuery Standard SQL?

Categories

Resources