PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?
Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).
AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string
JSON File 1 (Where customfield_666 has only null)
"startAt": 0,
"total": 1,
"issues": [
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
"customfield_666": null
JSON File 2 (Where customfield_666 has both null and struct values)
"startAt": 0,
"total": 2,
"issues": [
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
"customfield_666": null
"id": "2",
"key": "BSE-555",
"issuetype": {
"id": "40",
"name": "Epic2",
"tag": "Smoke Testing",
"id": "666-01",
Below is the PySpark code used to parse above JSON data.
from pyspark.sql.functions import *
rawDF ="abfss://", multiLine = "true")
DF ="issues").alias("issues")) \

You may check if it is null first
from pyspark.sql import functions as F
DF ="issues").alias("issues")) \
F.col("issues.customfield_666").isNull() | (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
Let me know if this works for you


Transform JSON: select one row from array of json objects

I can't get a specific row from this JSON array.
So I want to get the object where filed 'type' is equal to 'No-Data'
Are there exist any functions in SQL to take the row or some expressions?
"metadata": { "value": "JABC" },
"force": false
"users": [
{ "id": "111", "comment": "aaa", type: "Data" },
{ "id": "222", "comment": "bbb" , type:"No-Data"},
{ "id": "333", "comment": "ccc", type:"Data" }
You can use a JSON path query:
select jsonb_path_query_first(the_column, '$.users[*] ? (#.type == "No-Data")')
from the_table
This assumes that the column is defined as jsonb (which it should be). If it's not you have to cast it: the_column::jsonb
Online example

Transform Array into columns in BigQuery

I have a json string stored in a string column in BigQuery. There is an Array in it. I would like to pick some fields from array and write its value to BQ columns.
For example - Consider a below json stored in BQ
"pool": "mypool",
"statusCode": "0",
"payloads": [
"name": "request",
"fullpath": "com.gcp.commontools.edlpayload.EDLPayloadManagerTest$Request",
"jsonPayload": {
"body": "{\"data\":\"foo\"}"
"orientation": "REQUEST",
"httpTransport": {
"httpMethod": "POST",
"headers": {
"headers": {
"a": "1"
"sourceEndpoint": "/v1/foobar"
"name": "response",
"fullpath": "com.gcp.commontools.edlpayload.EDLPayloadManagerTest$Response",
"jsonPayload": {
"body": "{\"data\":\"bar\"}"
"orientation": "RESPONSE",
"httpTransport": {
"headers": {
"headers": {
"b": "2"
"httpResponseCode": 200
"name": "attributes",
"fullpath": "java.util.HashMap",
"nameValuePairs": {
"data": {
"one": "1"
"orientation": "TRANSITORY"
"uuid": "11EC-C714-8ADE2390-9619-1B80E63968CC",
"payloadName": "my-overall-name"
Consider a target BQ table schema is
pool, requestFullPath, requestPayload, responseFullPath, responsePayload
From the above json, i would like to pick few json elements and map there value to a column in BQ. Please note, array of payload will be dynamic in nature. There can be only 1 payload in the payloads array or there can be multiple. And the order of them is not fixed. For example, request payload can come at [0]th position, 1st position etc.
Consider below
select * from (
json_value(json_col, '$.pool') as pool,
json_value(payload, '$.name') as name,
json_value(payload, '$.fullpath') as FullPath,
json_value(payload, '$.jsonPayload.body') as Payload,
from your_table t
, unnest(json_extract_array(json_col, '$.payloads')) payload
pivot (any_value(FullPath) as FullPath, any_value(Payload) as Payload for name in ('request', 'response') )
if applied to sample data in your question - output is

Use Athena SQL to get a value from JSON key

I need to get the email address from this 'facets' table I created from my firehose logs (JSON).
Now, I am using Athena to get particular information.
I need to get the email addresses from this:
This is my out of 'facets' when I pass-
SELECT * FROM "sampledb"."facets" limit 10
{email_channel={mail_event={mail={message_id=oadfosadu6237864237615, message_send_timestamp=1622696691764,, destination=[], headers_truncated=false, headers=[{name=From,}, {name=To,}, {name=MIME-Version, value=1.0}], common_headers={, to=[]}}, send={}, rendering_failure=null}}}
Assuming you have one column which stores json in provided format you can use json_extract with needed paths (and maybe some casts):
with dataset1 as (
select * from (values(JSON
"email_channel": {
"mail_event": {
"mail": {
"message_id": "oadfosadu6237864237615",
"message_send_timestamp": 1622696691764,
"from_address": "",
"destination": [
"headers_truncated": false,
"headers": [
"name": "From",
"value": ""
"name": "To",
"value": ""
"name": "MIME-Version",
"value": "1.0"
"common_headers": {
"from": "",
"to": [
"send": {},
"rendering_failure": null
}')) as facets(facet))
json_extract(facet, '$.email_channel.mail_event.mail.from_address') mail_from,
CAST(json_extract(facet, '$.email_channel.mail_event.mail.destination') AS ARRAY(VARCHAR)) destination
from dataset1
And output:

Unexpected behavior of ARRAY_SLICE in Cosmos Db SQL API

I have Cosmos DB collection (called sample) containing the following documents:
"id": "id1",
"messages": [
"messageId": "message1",
"Text": "Value1"
"messageId": "message2",
"Text": "Value2"
"id": "id2",
"messages": [
"messageId": "message3",
"Text": "Value3"
"messageId": "message4",
"Text": "Value1"
"id": "id3",
"messages": [
"messageId": "message5",
"Text": "Value1"
"messageId": "message6",
"Text": "Value2"
"id": "id4",
"messages": [
"messageId": "message7",
"Text": "Value5"
"messageId": "message8",
"Text": "Value2"
I am trying to retrieve all the Documents, having messages and the first message has the field "Text"= 'Value1'.
In this sample the documents with the ids '1' and '3' would be retrieved. Please notice that the document with id='id2' wouldn't be retrieved,
since the value of the text of the first message is 'Value3'.
The collection as mentioned is called sample and I am running the following Query:
"select, sample.messages, ARRAY_SLICE(sample.messages, 0, 1)[0].Text as valueOfText from sample"
As you can see in the first two images, I retrieve all Documents and every one of them have the field "valueOfText" set to value of the first message, as expected.
Now when I filter the collection (the third image), I retrieve no results at all.
Is this an expected behavior?
Following your sql, got same results:
But why you have to use ARRAY_SLICE,it is used to return truncated array.Since your requirement is specific:
trying to retrieve all the Documents, having messages and the first
message has the field "Text"= 'Value1'
Just use sql:
SELECT,c.messages,c.messages[0].Text as valueOfText FROM c
where c.messages[0].Text = 'Value1'

Query to extract ids from a deeply nested json array object in Presto

I'm using Presto and trying to extract all 'id' from 'source'='dd' from a nested json structure as following.
"results": [
"docs": [
"id": "apple1",
"source": "dd"
"id": "apple2",
"source": "aa"
"id": "apple3",
"source": "dd"
"group": 99806
expected to extract the ids [apple1, apple3] into a column in Presto
I am wondering what is the right way to achieve this in Presto Query?
If your data has a regular structure as in the example you posted, you can use a combination of parsing the value as JSON, casting it to a structured SQL type (array/map/row) and the using array processing functions to filter, transform and extract the elements you want:
WITH data(value) AS (VALUES '{
"results": [
"docs": [
"id": "apple1",
"source": "dd"
"id": "apple2",
"source": "aa"
"id": "apple3",
"source": "dd"
"group": 99806
parsed(value) AS (
SELECT cast(json_parse(value) AS row(results array(row(docs array(row(id varchar, source varchar)), "group" bigint))))
FROM data
transform( -- extract the id from the resulting docs
filter( -- filter docs with source = 'dd'
flatten( -- flatten all docs arrays into a single doc array
transform(value.results, r -> -- extract the docs arrays from the result array
doc -> doc.source = 'dd'),
doc ->
FROM parsed
The query above produces:
[apple1, apple3]
(1 row)