Working with url-encoded values in BigQuery - google-bigquery

I work with gzipped log files which contain url-encoded columns. (a space character is encoded as "%20", etc).
My plan was to import these files directly from Google Cloud Storage into BigQuery.
I did not find any option in the Load config to automatically decode values during the import.
I guess you wouldn't advice using a series of REGEXP_REPLACE in all my queries.
Any idea which would avoid parsing all the logs and escape all these characters before importing them to BigQuery (which would be dangerous if one of them is the separator) ?

The accepted answer if for Legacy SQL.
For Standard SQL:
#standardSQL
CREATE TEMPORARY FUNCTION DECODE_URI_COMPONENT(path STRING)
RETURNS STRING
LANGUAGE js AS """
if (path == null) return null;
try {
return decodeURIComponent(path);
} catch (e) {
return path;
}
""";
WITH source AS (SELECT "/work.json?myfield=R%C3%A9gions%2CSport" AS path)
SELECT DECODE_URI_COMPONENT(REGEXP_EXTRACT(path, r"[?&]myfield=([^&]+)")) AS myfield FROM source
This returns:
myfield
---------------
Régions,Sport

Most likely you already ended up with something like below :o)
SELECT url FROM
js(
(SELECT url FROM
(SELECT 'http://example.com/query?q=my%20query%20string' AS url),
(SELECT 'http://example.com/query?q=your%20query%20string' AS url),
(SELECT 'http://example.com/query?q=his%20query%20string' AS url)
),
// Input columns.
url,
// Output schema.
"[
{name: 'url', type:'string'}]",
// The function.
"function(r, emit) {
var url = decodeURI(r.url);
emit({
url: url
});
}"
)
https://cloud.google.com/bigquery/user-defined-functions

Related

Karate: Unable to replace embedded expression value inside a XML chunk read from a JS function

Unable to replace embedded expression value inside a XML chunk read from a JS function, had a look at string to XML conversion. But unable to figure out what am I missing,
Scenario file has below, which calls a js function to get XML chunk containing a embedded expression,
* def customerNumber = functions.getRandomNumber()
* xml Security1 = functions.fetchSecExistMort()
* print Security1
Sharing JS function in my next comment.
Below is my javascript function,
function()
{
return {
fetchPrimaryResiAddress: function()
{
var PrimaryResidentialAddress =
`<Address>
<StreetNo>#(customerNumber)</StreetNo>
<Street Type="Street">RAWSON</Street>
<City>DEAKIN</City>
<State Name="ACT"/>
<Postcode>2600</Postcode>
<Country ISO3166="AU"/>
</Address>`;
return PrimaryResidentialAddress;
}
getRandomNumber: function()
{
var temp = '';
karate.repeat(14, function(){ temp += Math.floor(Math.random() * 9) + 1 });
return temp;
}
}
}
As part of print outcome, embedded expression #(customerNumber) isnt getting updated to customerNumber value
Embedded expressions will not work within JS. It is designed to work only within feature files, or when using the read() API.
If you are using JS, just do some string-concatenation and move-on.

BigQuery UDF to remove accents/diacritics in a string

Using this javascript code we can remove accents/diacritics in a string.
var originalText = "éàçèñ"
var result = originalText.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(result) // eacen
If we create a BigQuery UDF it does not (even with double \).
CREATE OR REPLACE FUNCTION project.remove_accent(x STRING)
RETURNS STRING
LANGUAGE js AS """
return x.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
""";
SELECT project.remove_accent("éàçèñ") --"éàçèñ"
Any thoughts on that?
Consider below approach
select originalText,
regexp_replace(normalize(originalText, NFD), r"\pM", '') output
if applied to sample data in your question - output is
You can easily wrap it with SQL UDF if you wish

Do strings need to be escaped inside parametrized queries?

I'm discovering Express by creating a simple CRUD without ORM.
Issue is, I'm not able to find any record through the Model.findBy() function
model User {
static async findBy(payload) {
try {
let attr = Object.keys(payload)[0]
let value = Object.values(payload)[0]
let user = await pool.query(
`SELECT * from users WHERE $1::text = $2::text LIMIT 1;`,
[attr, value]
);
return user.rows; // empty :-(
} catch (err) {
throw err
}
}
}
User.findBy({ email: 'foo#bar.baz' }).then(console.log);
User.findBy({ name: 'Foo' }).then(console.log);
I've no issue using psql if I surround $2::text by single quote ' like:
SELECT * FROM users WHERE email = 'foo#bar.baz' LIMIT 1;
Though that's not possible inside parametrized queries. I've tried stuff like '($2::text)' (and escaped variations), but that looks far from what the documentation recommends.
I must be missing something. Is the emptiness of user.rows related to the way I fetch attr & value ? Or maybe, is some kind of escape required when passing string parameters ?
"Answer":
As stated in the comment section, issue isn't related to string escape, but to dynamic column names.
Column names are not identifiers, and therefore cannot be dynamically set using a query parameter.
See: https://stackoverflow.com/a/50813577/11509906

Karate - String concatenation of JSON value with a variable [duplicate]

The embedded expressions are not replaced when appended, prepended or surrounded by characters in the following simplified and very basic scenario:
* def jobId = '0001'
* def out =
"""
{
"jobId": "#(jobId)",
"outputMetadata": {
"fileName_OK": "#(jobId)",
"fileName_Fail_1": "some_text_#(jobId)",
"fileName_Fail_2": "#(jobId)-and-some-more-text",
"fileName_Fail_3": "prepend #(jobId) and append"
}
}
"""
* print out
Executing the scenario returns:
{
"jobId": "0001",
"outputMetadata": {
"fileName_OK": "0001",
"fileName_Fail_1": "some_text_#(jobId)",
"fileName_Fail_2": "#(jobId)-and-some-more-text",
"fileName_Fail_3": "prepend #(jobId) and append"
}
}
Is it a feature, a limitation, or a bug? Or, did I miss something?
This is as designed ! You can do this:
"fileName_Fail_2": "#(jobId + '-and-some-more-text')"
Any valid JS expression can be stuffed into an embedded expression, so this is not a limitation. And this works only within JSON string values or when the entire RHS is a string within quotes and keeps the parsing simple. Hope that helps !

Stream analytics - How to handle json in reference input

I have an Azure Stream Analytics (ASA) job which processes device telemetry data from event hub. The stream should be joined with reference data from a sql table, to enhance each message with additional device meta data. The merged entry should be stored in CosmosDb.
The sql database to serve the device metadata:
CREATE TABLE [dbo].[MyTable]
(
[DeviceId] NVARCHAR(20) NOT NULL PRIMARY KEY,
[MetaData] NVARCHAR(MAX) NULL /* this stores json, which can vary per record */
)
In ASA I have configured the reference data input with a simple query:
SELECT DeviceId, JSON_QUERY(MetaData) FROM [dbo].[MyTable]
And I have the main ASA query that performs the join:
WITH temptable AS (
SELECT * FROM [telemetry-input] TD PARTITION BY PartitionId
LEFT OUTER JOIN [metadata-input] MD
ON TD.DeviceId = MD.DeviceId
)
SELECT TD.*, MD.MetaData
INTO [cosmos-db-output]
FROM temptable PARTITION BY PartitionId
It all works and merged data gets stored in CosmosDb. However, the value of the Metadata column from sql is treated as string, and stored in comos with quotes and escape chars. Example:
{ "DeviceId" : "abc1234", … , "MetaData" : "{ \"TestKey\": \"test value\" }" };
Is there a way to handle & store the json from Metadata as a proper Json object i.e.
{ "DeviceId" : "abc1234", … , "MetaData" : { "TestKey": "test value" } };
I found the way to achieve it in ASA - you need to create javascript user function:
function parseJson(strjson){
return JSON.parse(strjson);
}
And call it in your query:
...
SELECT TD.*, udf.parseJson(MD.MetaData)
...
As you mentioned in your question,the reference json data is treated as json string, not json object. Based on my researching on the Query Syntax in ASA, there is no built-in function to convert that.
However, I'd suggest you using Azure Function Cosmos DB Trigger to process every document which is created. Please refer to my function code:
using System;
using System.Collections.Generic;
using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Host;
using Newtonsoft.Json.Linq;
namespace ProcessJson
{
public class Class1
{
[FunctionName("DocumentUpdates")]
public static void Run(
[CosmosDBTrigger(databaseName:"db",collectionName: "item", ConnectionStringSetting = "CosmosDBConnection",LeaseCollectionName = "leases",
CreateLeaseCollectionIfNotExists = true)]
IReadOnlyList<Document> documents,
TraceWriter log)
{
log.Verbose("Start.........");
String endpointUrl = "https://***.documents.azure.com:443/";
String authorizationKey = "***";
String databaseId = "db";
String collectionId = "import";
DocumentClient client = new DocumentClient(new Uri(endpointUrl), authorizationKey);
for (int i = 0; i < documents.Count; i++)
{
Document doc = documents[i];
if((doc.alreadyFormat == Undefined.Value) ||(!doc.alreadyFormat)){
String MetaData = doc.GetPropertyValue<String>("MetaData");
JObject o = JObject.Parse(MetaData);
doc.SetPropertyValue("MetaData", o);
doc.SetPropertyValue("alreadyFormat", true);
client.ReplaceDocumentAsync(UriFactory.CreateDocumentUri(databaseId, collectionId, doc.Id), doc);
log.Verbose("Update document Id " + doc.Id);
}
}
}
}
}
In addition, please refer to the case: Azure Cosmos DB SQL - how to unescape inner json property