Flatten complex json using Databricks and ADF - pandas

I have following json which I have flattened partially using explode
{
"result":[
{
"employee":[
{
"employeeType":{
"name":"[empName]",
"displayName":"theName"
},
"groupValue":"value1"
},
{
"employeeType":{
"name":"#bossName#",
"displayName":"theBoss"
},
"groupValue":[
{
"id":"1",
"type":{
"name":"firstBoss",
"displayName":"CEO"
},
"name":"Martha"
},
{
"id":"2",
"type":{
"name":"secondBoss",
"displayName":"cto"
},
"name":"Alex"
}
]
}
]
}
]
}
I need to get following fields:
employeeType.name
groupValue
I am able to extract those fields and value. But, if name value starts with # like in "name":"#bossName#", I am getting groupValue as string from which I need to extract id and name.
"groupValue":[
{
"id":"1",
"type":{
"name":"firstBoss",
"displayName":"CEO"
},
"name":"Martha"
},
{
"id":"2",
"type":{
"name":"secondBoss",
"displayName":"cto"
},
"name":"Alex"
}
]
How to convert this string to json and get the values.
My code so far:
from pyspark.sql.functions import *
db_flat = (df.select(explode("result.employee").alias("emp"))
.withColumn("emp_name", col(emp.employeeType.name))
.withColumn("emp_val",col("emp.groupValue")).drop("emp"))
How can I extract groupValue from db_flat and get id and name from it. Maybe use python panda library.

Since you see they won't be dynamic. You can traverse through the json while mapping like as below. Just identify record and array, specify index [i] as needed.
Example:
id --> $['employee'][1]['groupValue'][0]['id']
name --> $['employee'][1]['groupValue'][0]['type']['name']

Related

Get JSON Array from JSON Object and Count Number of Objects

I have a column that contains some data like this:
{
"activity_goal": 200,
"members": [
{
"json": "data"
},
{
"HAHA": "HAHA"
},
{
"HAHA": "HAHA"
}
],
"name": "Hunters Team v3",
"total_activity": "0",
"revenue_goal": 200,
"total_active_days": "0",
"total_raised": 300
}
I am using cast(team_data -> 'members' as jsonb) to get the "Members" JSON array, which gives me a column like this:
[
{
"json": "data"
},
{
"HAHA": "HAHA"
},
{
"HAHA": "HAHA"
}
]
I am using array_length(cast(team_data -> 'members' as jsonb), 1) to pull a column with the number of Members that exist in the list. When I do this, I am given this error:
function array_length(jsonb, integer) does not exist
Note: I have also tried casting as "json" instead of "jsonb"
I am following this documentation. What am I doing wrong?
Use the JSON functions when working with json such as json_array_length
select json_array_length(team_data -> 'members') from mytable

map two payload data based on commom field and country

I have payload from which I need to extract only list of creator_by__v fields as list of strings from payload OBJECTS where abbreviation__c=='CN'. The payload is below.
The payload is:
{
"data": [{
"created_by__v": 2447129,
"document_country__vr": {
"responseDetails": {
"limit": 250
},
"data": [{
"name__v": "China",
"abbreviation__c": "CN"
}]
},
"version_modified_date__v": "2020-11-30T06:33:41.000Z"
}
]
}
enter image description here
You can use filter to get only the entries you need and then map to extra the created_by__v values
(payload.data filter $.document_country__vr.data[0].abbreviation__c == "CN")
map $.created_by__v as String

Exporting data map in JSON format through T-SQL

I'm trying to export a JSON format metadata file that describes my CSV. Here's what I have so far :
SELECT DISTINCT
[ProjectID] AS [Project.ProjectID],
[Study_Name] AS [Project.Study_Name],
Gender.[Gender] AS [Gender.GenderLabel],
Gender.GenderID AS [Gender.GenderID]
FROM
[ProjectTable] Project
JOIN
[Lkup].[Gender] Gender ON Project.Gender = Gender.GenderID
FOR JSON PATH, INCLUDE_NULL_VALUES
The output I see is a Gender body under every project but ideally I want all the projects I one and Gender shown only once.
The output from my code above:
{
"Project":{
"ProjectID":"112",
"Study_Name":"Jul-Aug Study"
},
"Gender":{
"GenderLabel":"Female",
"GenderID":2
}
},
{
"Project":{
"ProjectID":"112",
"Study_Name":"Jul-Aug Study"
},
"Gender":{
"GenderLabel":"Male",
"GenderID":1
}
}
The output I'm trying for :
{"Project": [
{
"ProjectID":"112",
"Study_Name":"Jul-Aug Study"
},
{
"ProjectID":"113",
"Study_Name":"Aug-Sept Study"
},
{
"ProjectID":"114",
"Study_Name":"Sept-Oct Study"
},
]
},
{"Gender": [
{
"GenderLabel":"Male",
"GenderID":1
},
{
"GenderLabel":"Female",
"GenderID":2
},
]
}
It is my first-time exporting JSON so not sure if this structure is feasible to export from SQL Server but any ideas are most helpful.
Thank you!

MarkLogic - Xpath on JSON document

MarkLogic Version: 9.0-6.2
I am trying to apply Xpath in extract-document-data (using Query Options) on a JSON document shown below. I need to filter out "Channel" property if the underneath property "OptIn" has a value of "True".
{
"Category":
{
"Name": "Severe Weather",
"Channels":[
{
"Channel":
{
"Name":"Email",
"OptIn": "True"
}
},
{
"Channel":
{
"Name":"Text",
"OptIn": "False"
}
}
]
}
}
I tried below code,
'<extract-document-data selected="include">' +
'<extract-path>//*[OptIn="True"]/../..</extract-path>' +
'</extract-document-data>' +
which is only pulling from "Channel" property as shown below.
[
{
"Channel": {
"Name": "Email",
"OptIn": "True"
}
}
]
But my need is to pull from parent "Category" property, but filter out the Channels that have OptIn value as False.
Any pointers?
If I understand correctly, you'd like to extract 'Category', but only with those 'Channel's that have 'OptIn' equalling 'true', right?
Extract-document-data is not advanced enough for that. You best extract entire Categories which have at least one OptIn equalling true (//Category[//OptIn = 'true']), and use a REST transform on the search response to trim down the unwanted Channels..
HTH!

express-graphql: How to remove external "data" object layer.

I am replacing an existing REST endpoint with GraphQL.
In our existing REST endpoint, we return a JSON array.
[{
"id": "ABC"
},
{
"id": "123"
},
{
"id": "xyz"
},
{
"id": "789"
}
]
GraphQL seems to be wrapping the array in two additional object layers. Is there any way to remove the "data" and "Client" layers?
Response data:
{
"data": {
"Client": [
{
"id": "ABC"
},
{
"id": "123"
},
{
"id": "xyz"
},
{
"id": "789"
}
]
}
}
My query:
{
Client(accountId: "5417727750494381532d735a") {
id
}
}
No. That was the whole purpose of GraphQL. To have a single endoint and allow users to fetch different type/granularity of data by specifying the input in a query format as opposed to REST APIs and then map them onto the returned JSON output.
'data' acts as a parent/root level container for different entities that you have queried. Without these keys in the returned JSON data, there won't be any way to segregate the corresponding data. e.g.
Your above query can be modified to include another entity like Owner,
{
Client(accountId: "5417727750494381532d735a") {
id
}
Owner {
id
}
}
In which case, the output will be something like
{
"data": {
"Client": [
...
],
"Owner": [
...
]
}
}
Without the 'Client' and 'Owner' keys in the JSON outout, there is no way to separate the corresponding array values.
In your case, you can get only the array by doing data.Client on the returned output.