JSON aggregation using s3-dist-cp for Spark application consumption

JSON aggregation using s3-dist-cp for Spark application consumption - apache-spark-sql

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine.
My source JSON data is in the form of multiple S3 objects. I need to compact them into a JSON array to reduce the number of S3 objects to read from within my Spark application. I tried using "s3-dist-cp --groupBy", but the result is a concatenated JSON data which in itself is not a valid JSON file, so I cannot create a Dataframe from it.
Here is simplified example to illustrate it further.
Source data :
S3 Object Record1.json : {"Name" : "John", "City" : "London"}
S3 Object Record2.json : {"Name" : "Mary" , "City" : "Paris"}
s3-dist-cp --src s3://source/ --dest s3://dest/ --groupBy='.*Record.*(\w+)'
Aggregated output
{"Name" : "Mary" , "City" : "Paris"}{"Name" : "John", "City" : "London"}
What I need :
[{"Name" : "John", "City" : "London"},{"Name" : "Mary" , "City" : "Paris"}]
Application code for reference
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
val schema = new StructType()
.add("Name",StringType,true)
.add("City",StringType,true)
val df = spark.read.option("multiline","true").schema(schema).json("test.json")
df.show()
Expected output
+----+------+
|Name| City|
+----+------+
|John|London|
|Mary| Paris|
+----+------+
Is s3-dist-cp the right tool for my need? Any other suggestion for aggregating json data to be loaded by Spark app as Dataframe?

Alternatively you can use regexp_replace to replace a single line string into multiline strings on json format, before that would be transformed into a dataset.
Check for the sample:
val df = spark.read.text("test.json")\
.withColumn("json", from_json(regexp_replace(col("value"), "\}\{", "\}\n\{"), schema))\
.select("json.*")
df.show()
About regexp_replace:
Pyspark replace strings in Spark dataframe column

Related

Push data to mongoDB using spark from hive

i want to to extract data from hive using sql query convert that to a nested dataframe and push it into mongodb using spark.
Can anyone suggest a efficient way to do that .
eg:
Flat query result -->
{"columnA":123213 ,"Column3 : 23,"Column4" : null,"Column5" : "abc"}
Nested Record to be pushed to mongo -->
{
"columnA":123213,
"newcolumn" : {
"Column3 : 23,
"Column4" : null,
"Column5" : "abc"
}
}

You may use the map function in spark sql to achieve the desired transformation eg
df.selectExpr("ColumnA","map('Column3',Column3,'Column4',Column4,'Column5',Column5) as newcolumn")
or you may run the following on your spark session after creating a temp view
df.createOrReplaceTempView("my_temp_view")
sparkSession.sql("<insert sql below here>")
SELECT
ColumnA,
map(
"Column3",Column3,
"Column4",Column4,
"Column5",Column5
) as newcolumn
FROM
my_temp_view
Moreover, if this is the only transformation that you wish to use, you may run this query on hive also.
Additional resources:
Spark Writing to Mongo
Let me know if this works for you.

For a nested level array for your hive dataframe we can try something like:
from pyspark.sql import functions as F
df.withColumn(
"newcolumn",
F.struct(
F.col("Column3").alias("Column3"),
F.col("Column4").alias("Column4"),
F.col("Column5").alias("Column5")
)
)
followed by groupBy and F.collect_list to create an nested array wrapped in a single record.
we can then write this to mongo
df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

Need Pentaho JSON without array

I wanted to output json data not as array object and I did the changes mentioned in the pentaho document, but the output is always array even for the single set of values. I am using PDI 9.1 and I tested using the ktr from the below link
https://wiki.pentaho.com/download/attachments/25043814/json_output.ktr?version=1&modificationDate=1389259055000&api=v2
below statement is from https://wiki.pentaho.com/display/EAI/JSON+output
Another special case is when 'Nr. rows in a block' = 1.
If used with empty json block name output will looks like:
{
"name" : "item",
"value" : 25
}
My output comes like below
{ "": [ {"name":"item","value":25} ] }

I have resolved myself. I have added another JSON input step and defined as below
$.wellDesign[0] to get the array as string object

Best Practice to Write Dataframe to Azure SQL Server Table?

I'm trying to figure out the best way to push data from a dataframe (DF) into a SQL Server table. I did some research on this yesterday and came up with this.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val config = Config(Map(
"url" -> "my_sql_server.database.windows.net",
"databaseName" -> "my_db_name",
"dbTable" -> "dbo.my_table",
"user" -> "xxxxx",
"password" -> "xxxxx",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
import org.apache.spark.sql.SaveMode
DF.write.mode(SaveMode.Append).sqlDB(config)
The idea is from this link.
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
Everything works fine if I use the original DF headers, as ordinal positions for field names (_c0, _c1, _c2, etc.). I have to have these field names in my table to make this work. Obviously, that's not sustainable. Is there a way to get the DF loaded into a table without matching header names (the order of the fields will always be the same in the DF and the table). Or, is it better way to do this, like renaming the field names of the Spark DF? Thanks.

I found a solution!
val newNames = Seq("ID", "FName", "LName", "Address", "ZipCode", "file_name")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema

Flatten Hive struct column or avro file using pyspark

I have a Hive table which has a struct data type column(sample below). The table is created on avro file.
By using pyspark, how can I flatten the records so that I get simple data type value(not struct, array or list) in each column to load another Hive table.
I can use Hive table or avro file as source.
Sample data-
Hive Column Name: Contract_Data
{"contract":
{"contractcode":"CCC",
unit:
{"value":"LOCAL",
desc:"LOCAL"},
segmentlist:
{"segment":[ #"segment" is array of struct here
{"transaction":"1",
"plans":
{"identifier":[ #"identifier" is array of struct here
{"value":"123","desc":"L1"},
{"value":"456","desc":"L2"}]
}
}]
}
},
plans:
{"listplans":[ #"listplans" is array of struct here
{"plantype":"M",
plandesign:
{"value":"PV","desc":"PD"},
state:
{"value":"ST","desc":"ST"}
}]
}
}

You can first read the HIVE table as Spark Dataframe as below.
df = spark.table("<DB_NAME>.<Table_Name>")
then you can explode function from Spark's Dataframe API to flatten the structure. PFB sample code which should work.
from pyspark.sql.functions import *
df.select(explode("Contract_Data"))
If the structure is nested which I could see in your above sample data, you can apply explode multiple times.
Hope it helps.
Regards,
Neeraj

mule able to modify json output datamapper

I'm trying convert to csv to json format using mule datamapper ,it is working fine.
below output it produce
[ {
"propertyState" : "AL",
"propertyCity" : "NJ",
"propertyZipCode" : "67890",
"propertyCounty" : "US"
} ]
But want to remove [ ] this from json format. using datamapper is it possible
mule modify json output datamapper

[] defines List for the elements .. That is required if your elements are repeating and a Valid JSON format ..
If you don't want the [] to be there then the work around will be use <json:json-to-object-transformer returnClass="java.util.List" doc:name="JSON to List" /> to extract each elements value from the JSON payload and create your payload using expression ..
But again this is not a recommended approach since you always require [] in your JSON as you will be getting multiple rows from CSV file and the JSON format should have [] since that will represent it as List which is a valid format

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

JSON aggregation using s3-dist-cp for Spark application consumption - apache-spark-sql

Related

Push data to mongoDB using spark from hive

Need Pentaho JSON without array

Best Practice to Write Dataframe to Azure SQL Server Table?

Flatten Hive struct column or avro file using pyspark

mule able to modify json output datamapper

Categories

Resources