Want to parse string BLOB as JSON, modify few fields and parse it back to string BIGQUERY - sql

So I have a column which contains JSONs as string BLOBs. For example,
{
"a": "a-1",
"b": "b-1",
"c":
{
"c-1": "c-1-1",
"c-2": "c-2-1"
},
"d":
[
{
"k": "v1"
},
{
"k": "v2"
},
{
"k": "v3"
}
]
}
Now my use case is to parse the key b, hash the value of key b, assign it back to the key b and store it back as a string in Bigquery.
I initially tried to do a lazy approach where I am only extracting the key b using the json_extract_scalar function in Bigquery and for other keys (like c and d - which I dont want to modify), I used json_extract function. Then I converted back to string after doing hashing the key b. Here is the query -
SELECT
TO_JSON_STRING(
STRUCT(
json_EXTRACT(COL_NAME, "$.a") AS a,
MD5(json_extract_scalar(_airbyte_data,"$.b")) AS b,
json_EXTRACT(COL_NAME,"$.c") AS c,
json_EXTRACT(COL_NAME,"$.d") AS d ) )
FROM
dataset.TABLE
But the issue with this query is the JSON objects are getting converted to string and double quotes getting escaped due to TO_JSON_STRING (I tried using CAST AS STRING on top of STRUCT but it isn't supported). For example, the output row for this query looks like this:
{
"a": "a-1",
"b": "b-1",
"c":
"{
\"c-1\": \"c-1-1\",
\"c-2\": \"c-2-1\"
}",
"d":
"[
{
\"k\": \"v1\"
},
{
\"k\": \"v2\"
},
{
\"k\": \"v3\"
}
]"
}
I can achieve the required output if I use JSON_EXTRACT and JSON_EXTRACT_SCALAR functions on every key (and on every nested keys) but this approach isn't scalable as I have close to 200 keys and many of them are nested 2-3 levels deep.
Can anyone suggest a better approach of achieving this? TIA

This should work
declare _json string default """{
"a": "a-1",
"b": "b-1",
"c":
{
"c-1": "c-1-1",
"c-2": "c-2-1"
},
"d":
[
{
"k": "v1"
},
{
"k": "v2"
},
{
"k": "v3"
}
]
}""";
SELECT regexp_replace(_json, r'"b": "(\w+)-(\w+)"',concat('"b":',TO_JSON_STRING( MD5(json_extract_scalar(_json,"$.b")))))
output
{
"a": "a-1",
"b":"sdENsgFsL4PBOyX8sXDN6w==",
"c":
{
"c-1": "c-1-1",
"c-2": "c-2-1"
},
"d":
[
{
"k": "v1"
},
{
"k": "v2"
},
{
"k": "v3"
}
]
}
If you need specific regex then please specify the example for b values.

Related

Postgres 12.9 - Getting the value of a json array on matching key

I have a column in my database that stores a json string listing the weapons used in each game activity, what I need to be able to do is return the 'values'->'uniqueWeaponKills'->'basic'->'value' when the 'referenceId' key = 1994645182, and 0 if the key,value pair is not in the column.
Example 'weapons' column data
{
"weapons": [
{
"values": {
"uniqueWeaponKills": {
"basic": {
"value": 14,
"displayValue": "14"
}
},
"uniqueWeaponPrecisionKills": {
"basic": {
"value": 0,
"displayValue": "0"
}
},
"uniqueWeaponKillsPrecisionKills": {
"basic": {
"value": 0,
"displayValue": "0%"
}
}
},
"referenceId": 1994645182
},
{
"values": {
"uniqueWeaponKills": {
"basic": {
"value": 2,
"displayValue": "2"
}
},
"uniqueWeaponPrecisionKills": {
"basic": {
"value": 1,
"displayValue": "1"
}
},
"uniqueWeaponKillsPrecisionKills": {
"basic": {
"value": 0.5,
"displayValue": "50%"
}
}
},
"referenceId": 1853180924
}
]
}
Edit 1:
Using the suggestion from Kendle I got to the following query, I haven't seen a way to dynamically look in each of the array elements instead of having to specify the one to look at.
Query
select weapons::json->'weapons'->1->'values'->'uniqueWeaponKills'->'basic'->>'value' as "uniqueWeaponKills",
weapons::json->'weapons'->1->'referenceId' as "weaponId"
from activities
where (weapons::json->'weapons'->1->>'referenceId')::BIGINT = 1687353095;
You could try
SELECT
weapons::json->'values'->> 'uniqueWeaponKills'->>'basic' ->>'value'
FROM table_name
WHERE
weapons::json->'referenceId' = 1994645182;
See also How to parse JSON in postgresql
I think I found the solution I am looking for using json_array_elements()
SELECT obj->'values'->'uniqueWeaponKills'->'basic'->>'value' as "uniqueWeaponKills"
FROM activities a, json_array_elements(a.weapons#>'{weapons}') obj
WHERE (obj->>'referenceId')::BIGINT = 1687353095;

PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?
Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).
AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string
JSON File 1 (Where customfield_666 has only null)
{
"startAt": 0,
"total": 1,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
}
]
}
JSON File 2 (Where customfield_666 has both null and struct values)
{
"startAt": 0,
"total": 2,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
},
{
"id": "2",
"key": "BSE-555",
"issuetype": {
"id": "40",
"name": "Epic2",
},
"customfield_666":
{
"tag": "Smoke Testing",
"id": "666-01",
},
}
]
}
Below is the PySpark code used to parse above JSON data.
from pyspark.sql.functions import *
rawDF = spark.read.json("abfss://users#mydlsgen2rk.dfs.core.windows.net/raw/MyData.json", multiLine = "true")
DF = rawDF.select(explode("issues").alias("issues")) \
.select(
col("issues.id").alias("IssueId"),
col("issues.key").alias("IssueKey"),
col("issues.fields").alias("IssueFields"),
col("issues.issuetype.name").alias("IssueTypeName"),
col("issues.customfield_666.tag").alias("IssueCust666Tag")
)
You may check if it is null first
from pyspark.sql import functions as F
DF = rawDF.select(F.explode("issues").alias("issues")) \
.select(
F.col("issues.id").alias("IssueId"),
F.col("issues.key").alias("IssueKey"),
F.col("issues.fields").alias("IssueFields"),
F.col("issues.issuetype.name").alias("IssueTypeName"),
F.when(
F.col("issues.customfield_666").isNull() | (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
).otherwise(
F.col("issues.customfield_666.tag")
).alias("IssueCust666Tag")
)
Let me know if this works for you

Karate remove null value and the key from nested array list returned from DB [duplicate]

This question already has an answer here:
Comparing two JSON arrays in Karate DSL
(1 answer)
Closed 1 year ago.
for example, I get the below list when querying from the DB
[
{
"a":"123",
"b": "CAT"
}
{
"a": "456",
"b": null
}
{
"a":"789",
"b": "DOG"
}
{
"a":"134",
"b": null
}
]
I want to remove the key and value when the value is null
Expected
[
{
"a":"123",
"b": "CAT"
}
{
"a": "456"
}
{
"a":"789",
"b": "DOG"
}
{
"a":"134"
}
]
Please can someone help with this request, The reason I want to remove the null is the response which comes from the API ignores the null in the DB and looks like the expected, I have to match the response and the DB. Thanks in advance, Appreciate your time.
This is exactly what karate.filter() is for, please search for it (and read) the docs.
And in the latest version, this short-cut will also work:
* def list = [ { "a":"123", "b": "CAT" } { "a": "456", "b": null } { "a":"789", "b": "DOG" } { "a":"134", "b": null } ]
* def list = list.filter(x => x.b != null)
* print list

Get JSON Array from JSON Object and Count Number of Objects

I have a column that contains some data like this:
{
"activity_goal": 200,
"members": [
{
"json": "data"
},
{
"HAHA": "HAHA"
},
{
"HAHA": "HAHA"
}
],
"name": "Hunters Team v3",
"total_activity": "0",
"revenue_goal": 200,
"total_active_days": "0",
"total_raised": 300
}
I am using cast(team_data -> 'members' as jsonb) to get the "Members" JSON array, which gives me a column like this:
[
{
"json": "data"
},
{
"HAHA": "HAHA"
},
{
"HAHA": "HAHA"
}
]
I am using array_length(cast(team_data -> 'members' as jsonb), 1) to pull a column with the number of Members that exist in the list. When I do this, I am given this error:
function array_length(jsonb, integer) does not exist
Note: I have also tried casting as "json" instead of "jsonb"
I am following this documentation. What am I doing wrong?
Use the JSON functions when working with json such as json_array_length
select json_array_length(team_data -> 'members') from mytable

SQL to mongodb conversion

I have two fields in mongodb, A and B
I would like to perform the following sql query in mongo
SELECT DISTINCT A FROM table WHERE B LIKE 'asdf'
EDIT for clarification
foo ={
bar: [{
baz:[
‘one’,
‘two'
]
},{...}
]
}
I would like to select distinct foo objects where bar.baz contains ‘one’.  
The query:
db.runCommand({
    "distinct": "foo",
    "query": {
        “bar.baz": “one"
    },
    "key": “bar.baz"
});
This query, oddly enough, returns foo objects who's bar.baz /doesnt/ contain ‘one’.
There seems to be a misunderstanding here of how the MongoDB distinct command works or indeed how any query works with arrays.
I am going to consider that you actually have documents that look something like this:
{
"_id" : ObjectId("5398f8bf0b5d1b43d3e26816"),
"bar" : [
{
"baz" : [
"one",
"two"
]
},
{
"baz" : [
"three"
]
},
{
"baz" : [
"one",
"four"
]
}
]
}
So the query that you have run, and these two forms are equivalent:
db.runCommand({
"distinct": "foo",
"query": { "bar.baz": "one" },
"key": "bar.baz"
})
db.foo.distinct("bar.baz", { "bar.baz": "one" })
Returns essentially this:
[ "four", "one", "three", "two" ]
Why? Well, because you asked it to. Let's consider a declarative way of describing what you actually invoked.
Your "query" essentially says 'Find me all the "documents" that have "bar.baz" equal to "one" ' then you are asking 'And return me all of the "distinct" values for "bar.baz"
So the "query" part of your statement does exactly that, and matched "documents" and not array members that match the value you specified. In the above example you are then asking for the "distinct" values of "bar.baz", which is exactly what you get, with there only being the value of "one" returned once from all of the values of "bar.baz".
So "query" statements do not "filter" array contents they just "match" where the condition exists. The above document matches the condition and "bar.baz" has a value of "one", and twice even. So selecting the distinct "foo" or basically the document is really:
db.foo.find({ "bar.baz": "one" })
Matching all documents that meet the condition. This is how embedding works, but perhaps you wanted something like filtering the results. So looking at returning only those items of "bar" whose "baz" has a value of "one" you would do:
db.collection.aggregate([
// Matches documents
{ "$match": { "bar.baz": "one" } },
// Unwind to de-normalize arrays as documents
{ "$unwind": "$bar" },
// Match to "filter" documents without "bar.baz" matching "one"
{ "$match": { "bar.baz": "one" } },
// Maybe group back to document with the array
{ "$group": {
"_id": "$_id",
"bar": { "$push": "$bar" }
}}
])
The result of this .aggregate() statement is the document without the member of "bar" that does not contain "one" under "baz":
{
"_id" : ObjectId("5398f8bf0b5d1b43d3e26816"),
"bar" : [
{
"baz" : [
"one",
"two"
]
},
{
"baz" : [
"one",
"four"
]
}
]
}
But then suppose you actually want just the element "bar.baz" equal to "one" and the total count of those occurrences over your whole collection, then you would want to do this:
db.collection.aggregate([
// Matches documents
{ "$match": { "bar.baz": "one" } },
// Unwind to de-normalize arrays as documents
{ "$unwind": "$bar" },
// And the inner array as well
{ "$unwind": "$bar.baz" },
// Then just match and filter out everything but the matching items
{ "$match": { "bar.baz": "one" } },
// Group to get the count
{ "$group": {
"_id": "$bar.baz",
"count": { "$sum": 1 }
}}
])
And from our single document collection sample you get:
{ "_id": "one", "count": 2 }
As there are two occurrences of that matching value.
As for your SQL at the head of your question, that really doesn't apply to this sort of data. The more practical example would be something with data like this:
{ "A": "A", "B": "BASDFJJ" }
{ "A": "A", "B": "ASDFTT" }
{ "A": "B", "B": "CASDF" }
{ "A": "B", "B": "DKITB" }
So the "distinct" values of "A" where "B" is like "ASDF", again using aggregate and noting you are not wildcarding on either side:
db.foo.aggregate([
{ "$match": { "B": "ASDF" } },
{ "$group": { "_id": "$A" } }
])
Which essentially produces:
{ "_id": "A" }
Or with wildcards on either side "%ASDF%" this is a $regex query to match:
db.foo.aggregate([
{ "$match": { "B": { "$regex": "ASDF" } } },
{ "$group": { "_id": "$A" } }
])
So only two results:
{ "_id": "A" }
{ "_id": "B" }
Where if you were "counting" the distinct matches then you would see 2 and 1 as the counts respectively according to the documents that matched.
Take a further look at the SQL Mapping Chart and the SQL to Aggregation Mapping Chart contained within the documentation. It should help you in understanding how common actions actually translate.