PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null - dataframe

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?
Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).
AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string
JSON File 1 (Where customfield_666 has only null)
{
"startAt": 0,
"total": 1,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
}
]
}
JSON File 2 (Where customfield_666 has both null and struct values)
{
"startAt": 0,
"total": 2,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
},
{
"id": "2",
"key": "BSE-555",
"issuetype": {
"id": "40",
"name": "Epic2",
},
"customfield_666":
{
"tag": "Smoke Testing",
"id": "666-01",
},
}
]
}
Below is the PySpark code used to parse above JSON data.
from pyspark.sql.functions import *
rawDF = spark.read.json("abfss://users#mydlsgen2rk.dfs.core.windows.net/raw/MyData.json", multiLine = "true")
DF = rawDF.select(explode("issues").alias("issues")) \
.select(
col("issues.id").alias("IssueId"),
col("issues.key").alias("IssueKey"),
col("issues.fields").alias("IssueFields"),
col("issues.issuetype.name").alias("IssueTypeName"),
col("issues.customfield_666.tag").alias("IssueCust666Tag")
)

You may check if it is null first
from pyspark.sql import functions as F
DF = rawDF.select(F.explode("issues").alias("issues")) \
.select(
F.col("issues.id").alias("IssueId"),
F.col("issues.key").alias("IssueKey"),
F.col("issues.fields").alias("IssueFields"),
F.col("issues.issuetype.name").alias("IssueTypeName"),
F.when(
F.col("issues.customfield_666").isNull() | (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
).otherwise(
F.col("issues.customfield_666.tag")
).alias("IssueCust666Tag")
)
Let me know if this works for you

Related

Transform JSON: select one row from array of json objects

I can't get a specific row from this JSON array.
So I want to get the object where filed 'type' is equal to 'No-Data'
Are there exist any functions in SQL to take the row or some expressions?
"metadata": { "value": "JABC" },
"force": false
"users": [
{ "id": "111", "comment": "aaa", type: "Data" },
{ "id": "222", "comment": "bbb" , type:"No-Data"},
{ "id": "333", "comment": "ccc", type:"Data" }
]
You can use a JSON path query:
select jsonb_path_query_first(the_column, '$.users[*] ? (#.type == "No-Data")')
from the_table
This assumes that the column is defined as jsonb (which it should be). If it's not you have to cast it: the_column::jsonb
Online example

Transform Array into columns in BigQuery

I have a json string stored in a string column in BigQuery. There is an Array in it. I would like to pick some fields from array and write its value to BQ columns.
For example - Consider a below json stored in BQ
{
"pool": "mypool",
"statusCode": "0",
"payloads": [
{
"name": "request",
"fullpath": "com.gcp.commontools.edlpayload.EDLPayloadManagerTest$Request",
"jsonPayload": {
"body": "{\"data\":\"foo\"}"
},
"orientation": "REQUEST",
"httpTransport": {
"httpMethod": "POST",
"headers": {
"headers": {
"a": "1"
}
},
"sourceEndpoint": "/v1/foobar"
}
},
{
"name": "response",
"fullpath": "com.gcp.commontools.edlpayload.EDLPayloadManagerTest$Response",
"jsonPayload": {
"body": "{\"data\":\"bar\"}"
},
"orientation": "RESPONSE",
"httpTransport": {
"headers": {
"headers": {
"b": "2"
}
},
"httpResponseCode": 200
}
},
{
"name": "attributes",
"fullpath": "java.util.HashMap",
"nameValuePairs": {
"data": {
"one": "1"
}
},
"orientation": "TRANSITORY"
}
],
"uuid": "11EC-C714-8ADE2390-9619-1B80E63968CC",
"payloadName": "my-overall-name"
}
Consider a target BQ table schema is
pool, requestFullPath, requestPayload, responseFullPath, responsePayload
From the above json, i would like to pick few json elements and map there value to a column in BQ. Please note, array of payload will be dynamic in nature. There can be only 1 payload in the payloads array or there can be multiple. And the order of them is not fixed. For example, request payload can come at [0]th position, 1st position etc.
Consider below
select * from (
select
json_value(json_col, '$.pool') as pool,
json_value(payload, '$.name') as name,
json_value(payload, '$.fullpath') as FullPath,
json_value(payload, '$.jsonPayload.body') as Payload,
from your_table t
, unnest(json_extract_array(json_col, '$.payloads')) payload
)
pivot (any_value(FullPath) as FullPath, any_value(Payload) as Payload for name in ('request', 'response') )
if applied to sample data in your question - output is

Use Athena SQL to get a value from JSON key

I need to get the email address from this 'facets' table I created from my firehose logs (JSON).
Now, I am using Athena to get particular information.
I need to get the email addresses from this:
This is my out of 'facets' when I pass-
SELECT * FROM "sampledb"."facets" limit 10
{email_channel={mail_event={mail={message_id=oadfosadu6237864237615, message_send_timestamp=1622696691764, from_address=abcd#jk.com, destination=[abcd#jk.com], headers_truncated=false, headers=[{name=From, value=abcd#jk.com}, {name=To, value=abcd#jk.com}, {name=MIME-Version, value=1.0}], common_headers={from=ghjk#li.com, to=[abcd#jk.com]}}, send={}, rendering_failure=null}}}
Assuming you have one column which stores json in provided format you can use json_extract with needed paths (and maybe some casts):
with dataset1 as (
select * from (values(JSON
'{
"email_channel": {
"mail_event": {
"mail": {
"message_id": "oadfosadu6237864237615",
"message_send_timestamp": 1622696691764,
"from_address": "abcd#jk.com",
"destination": [
"abcd#jk.com"
],
"headers_truncated": false,
"headers": [
{
"name": "From",
"value": "abcd#jk.com"
},
{
"name": "To",
"value": "abcd#jk.com"
},
{
"name": "MIME-Version",
"value": "1.0"
}
],
"common_headers": {
"from": "ghjk#li.com",
"to": [
"abcd#jk.com"
]
}
},
"send": {},
"rendering_failure": null
}
}
}')) as facets(facet))
select
json_extract(facet, '$.email_channel.mail_event.mail.from_address') mail_from,
CAST(json_extract(facet, '$.email_channel.mail_event.mail.destination') AS ARRAY(VARCHAR)) destination
from dataset1
And output:
mail_from
destination
"abcd#jk.com"
{abcd#jk.com}

Unexpected behavior of ARRAY_SLICE in Cosmos Db SQL API

I have Cosmos DB collection (called sample) containing the following documents:
[
{
"id": "id1",
"messages": [
{
"messageId": "message1",
"Text": "Value1"
},
{
"messageId": "message2",
"Text": "Value2"
}
]
},
{
"id": "id2",
"messages": [
{
"messageId": "message3",
"Text": "Value3"
},
{
"messageId": "message4",
"Text": "Value1"
}
]
},
{
"id": "id3",
"messages": [
{
"messageId": "message5",
"Text": "Value1"
},
{
"messageId": "message6",
"Text": "Value2"
}
]
},
{
"id": "id4",
"messages": [
{
"messageId": "message7",
"Text": "Value5"
},
{
"messageId": "message8",
"Text": "Value2"
}
]
},
]
I am trying to retrieve all the Documents, having messages and the first message has the field "Text"= 'Value1'.
In this sample the documents with the ids '1' and '3' would be retrieved. Please notice that the document with id='id2' wouldn't be retrieved,
since the value of the text of the first message is 'Value3'.
The collection as mentioned is called sample and I am running the following Query:
"select sample.id, sample.messages, ARRAY_SLICE(sample.messages, 0, 1)[0].Text as valueOfText from sample"
As you can see in the first two images, I retrieve all Documents and every one of them have the field "valueOfText" set to value of the first message, as expected.
Now when I filter the collection (the third image), I retrieve no results at all.
Is this an expected behavior?
Following your sql, got same results:
But why you have to use ARRAY_SLICE,it is used to return truncated array.Since your requirement is specific:
trying to retrieve all the Documents, having messages and the first
message has the field "Text"= 'Value1'
Just use sql:
SELECT c.id,c.messages,c.messages[0].Text as valueOfText FROM c
where c.messages[0].Text = 'Value1'
Output:

Query to extract ids from a deeply nested json array object in Presto

I'm using Presto and trying to extract all 'id' from 'source'='dd' from a nested json structure as following.
{
"results": [
{
"docs": [
{
"id": "apple1",
"source": "dd"
},
{
"id": "apple2",
"source": "aa"
},
{
"id": "apple3",
"source": "dd"
}
],
"group": 99806
}
]
}
expected to extract the ids [apple1, apple3] into a column in Presto
I am wondering what is the right way to achieve this in Presto Query?
If your data has a regular structure as in the example you posted, you can use a combination of parsing the value as JSON, casting it to a structured SQL type (array/map/row) and the using array processing functions to filter, transform and extract the elements you want:
WITH data(value) AS (VALUES '{
"results": [
{
"docs": [
{
"id": "apple1",
"source": "dd"
},
{
"id": "apple2",
"source": "aa"
},
{
"id": "apple3",
"source": "dd"
}
],
"group": 99806
}
]
}'),
parsed(value) AS (
SELECT cast(json_parse(value) AS row(results array(row(docs array(row(id varchar, source varchar)), "group" bigint))))
FROM data
)
SELECT
transform( -- extract the id from the resulting docs
filter( -- filter docs with source = 'dd'
flatten( -- flatten all docs arrays into a single doc array
transform(value.results, r -> r.docs) -- extract the docs arrays from the result array
),
doc -> doc.source = 'dd'),
doc -> doc.id)
FROM parsed
The query above produces:
_col0
------------------
[apple1, apple3]
(1 row)