Extracting multiple elements within nested structure of array spark dataframe 2.4 - dataframe

I am reading a parquet file and trying to extract elements within a Struct of Struct of Array. The null values however are returning empty when I use getItem(). This pattern works in Spark 1.6 but now using Spark 2.4 in aws glue it seems to ignore the null values and only pulls empty.
Input is parquet but I have written in JSON format:
{
"ExampleMessage":{
"activity":{
"exampleSport":[
{
"exampleRole":null
},
{
"exampleRole":null
},
{
"exampleRole":"Runner"
}
]
}
}
Attempted Extraction:
raw_df.select(col("ExampleMessage.activity").getItem("exampleSport").getItem("exampleRole"))
Current Output:
,,Runner
Desired Output:
null,null,Runner

Try the below code instead of null you will get None
raw_df.select(col("ExampleMessage.activity").getItem("exampleSport").getItem("exampleRole")).head()
output will be
Row(ExampleMessage.activity AS `activity`.exampleSport.exampleRole=[None, None, 'Runner'])

Related

Push data to mongoDB using spark from hive

i want to to extract data from hive using sql query convert that to a nested dataframe and push it into mongodb using spark.
Can anyone suggest a efficient way to do that .
eg:
Flat query result -->
{"columnA":123213 ,"Column3 : 23,"Column4" : null,"Column5" : "abc"}
Nested Record to be pushed to mongo -->
{
"columnA":123213,
"newcolumn" : {
"Column3 : 23,
"Column4" : null,
"Column5" : "abc"
}
}
You may use the map function in spark sql to achieve the desired transformation eg
df.selectExpr("ColumnA","map('Column3',Column3,'Column4',Column4,'Column5',Column5) as newcolumn")
or you may run the following on your spark session after creating a temp view
df.createOrReplaceTempView("my_temp_view")
sparkSession.sql("<insert sql below here>")
SELECT
ColumnA,
map(
"Column3",Column3,
"Column4",Column4,
"Column5",Column5
) as newcolumn
FROM
my_temp_view
Moreover, if this is the only transformation that you wish to use, you may run this query on hive also.
Additional resources:
Spark Writing to Mongo
Let me know if this works for you.
For a nested level array for your hive dataframe we can try something like:
from pyspark.sql import functions as F
df.withColumn(
"newcolumn",
F.struct(
F.col("Column3").alias("Column3"),
F.col("Column4").alias("Column4"),
F.col("Column5").alias("Column5")
)
)
followed by groupBy and F.collect_list to create an nested array wrapped in a single record.
we can then write this to mongo
df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

Need Pentaho JSON without array

I wanted to output json data not as array object and I did the changes mentioned in the pentaho document, but the output is always array even for the single set of values. I am using PDI 9.1 and I tested using the ktr from the below link
https://wiki.pentaho.com/download/attachments/25043814/json_output.ktr?version=1&modificationDate=1389259055000&api=v2
below statement is from https://wiki.pentaho.com/display/EAI/JSON+output
Another special case is when 'Nr. rows in a block' = 1.
If used with empty json block name output will looks like:
{
"name" : "item",
"value" : 25
}
My output comes like below
{ "": [ {"name":"item","value":25} ] }
I have resolved myself. I have added another JSON input step and defined as below
$.wellDesign[0] to get the array as string object

Flatten Hive struct column or avro file using pyspark

I have a Hive table which has a struct data type column(sample below). The table is created on avro file.
By using pyspark, how can I flatten the records so that I get simple data type value(not struct, array or list) in each column to load another Hive table.
I can use Hive table or avro file as source.
Sample data-
Hive Column Name: Contract_Data
{"contract":
{"contractcode":"CCC",
unit:
{"value":"LOCAL",
desc:"LOCAL"},
segmentlist:
{"segment":[ #"segment" is array of struct here
{"transaction":"1",
"plans":
{"identifier":[ #"identifier" is array of struct here
{"value":"123","desc":"L1"},
{"value":"456","desc":"L2"}]
}
}]
}
},
plans:
{"listplans":[ #"listplans" is array of struct here
{"plantype":"M",
plandesign:
{"value":"PV","desc":"PD"},
state:
{"value":"ST","desc":"ST"}
}]
}
}
You can first read the HIVE table as Spark Dataframe as below.
df = spark.table("<DB_NAME>.<Table_Name>")
then you can explode function from Spark's Dataframe API to flatten the structure. PFB sample code which should work.
from pyspark.sql.functions import *
df.select(explode("Contract_Data"))
If the structure is nested which I could see in your above sample data, you can apply explode multiple times.
Hope it helps.
Regards,
Neeraj

mule able to modify json output datamapper

I'm trying convert to csv to json format using mule datamapper ,it is working fine.
below output it produce
[ {
"propertyState" : "AL",
"propertyCity" : "NJ",
"propertyZipCode" : "67890",
"propertyCounty" : "US"
} ]
But want to remove [ ] this from json format. using datamapper is it possible
mule modify json output datamapper
[] defines List for the elements .. That is required if your elements are repeating and a Valid JSON format ..
If you don't want the [] to be there then the work around will be use <json:json-to-object-transformer returnClass="java.util.List" doc:name="JSON to List" /> to extract each elements value from the JSON payload and create your payload using expression ..
But again this is not a recommended approach since you always require [] in your JSON as you will be getting multiple rows from CSV file and the JSON format should have [] since that will represent it as List which is a valid format

Loading data into Google Big Query

my question is the following:
Let's say I have a json file that I want to load into big query.
It contains these two lines of data.
{"value":"123"}
{"value": 123 }
I have defined the following schema for my data.
[
{ "name":"value", "type":"String"}
]
When I try to load the json file into big query it will fail with the following error:
Field:value: Could not convert value to string
Is there a way to get around this issue other than transforming the data in the json file?
Thanks!
You can set the maxBadRecords property on the load job to skip a number of errors but still load the data.
Following your example, you could still load the data if you set it as:
"configuration": {
"load": {
"maxBadRecords": 1,
}
}
This is a way to get around the issue while still loading your JSON data into the table, just that the erroneous rows will be skipped. If loading a list of files, you could set it to be a function of the number of files that you are loading (e.g. maxBadRecords = 20 * fileCount)