Best Practice to Write Dataframe to Azure SQL Server Table? - dataframe

I'm trying to figure out the best way to push data from a dataframe (DF) into a SQL Server table. I did some research on this yesterday and came up with this.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val config = Config(Map(
"url" -> "my_sql_server.database.windows.net",
"databaseName" -> "my_db_name",
"dbTable" -> "dbo.my_table",
"user" -> "xxxxx",
"password" -> "xxxxx",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
import org.apache.spark.sql.SaveMode
DF.write.mode(SaveMode.Append).sqlDB(config)
The idea is from this link.
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
Everything works fine if I use the original DF headers, as ordinal positions for field names (_c0, _c1, _c2, etc.). I have to have these field names in my table to make this work. Obviously, that's not sustainable. Is there a way to get the DF loaded into a table without matching header names (the order of the fields will always be the same in the DF and the table). Or, is it better way to do this, like renaming the field names of the Spark DF? Thanks.

I found a solution!
val newNames = Seq("ID", "FName", "LName", "Address", "ZipCode", "file_name")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema

Related

Push data to mongoDB using spark from hive

i want to to extract data from hive using sql query convert that to a nested dataframe and push it into mongodb using spark.
Can anyone suggest a efficient way to do that .
eg:
Flat query result -->
{"columnA":123213 ,"Column3 : 23,"Column4" : null,"Column5" : "abc"}
Nested Record to be pushed to mongo -->
{
"columnA":123213,
"newcolumn" : {
"Column3 : 23,
"Column4" : null,
"Column5" : "abc"
}
}
You may use the map function in spark sql to achieve the desired transformation eg
df.selectExpr("ColumnA","map('Column3',Column3,'Column4',Column4,'Column5',Column5) as newcolumn")
or you may run the following on your spark session after creating a temp view
df.createOrReplaceTempView("my_temp_view")
sparkSession.sql("<insert sql below here>")
SELECT
ColumnA,
map(
"Column3",Column3,
"Column4",Column4,
"Column5",Column5
) as newcolumn
FROM
my_temp_view
Moreover, if this is the only transformation that you wish to use, you may run this query on hive also.
Additional resources:
Spark Writing to Mongo
Let me know if this works for you.
For a nested level array for your hive dataframe we can try something like:
from pyspark.sql import functions as F
df.withColumn(
"newcolumn",
F.struct(
F.col("Column3").alias("Column3"),
F.col("Column4").alias("Column4"),
F.col("Column5").alias("Column5")
)
)
followed by groupBy and F.collect_list to create an nested array wrapped in a single record.
we can then write this to mongo
df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow?
I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could create a custom delayed function to handle these cases after reading them but I'm hoping I could specify the schema when opening them via globing. Maybe not though as I guess opening then via globing is going to try and concatenate them. This currently fails because of the inconsistent field names.
Create a parquet file:
import dask.dataframe as dd
df = dd.demo.make_timeseries(
start="2000-01-01",
end="2000-01-03",
dtypes={"id": int, "z": int},
freq="1h",
partition_freq="24h",
)
df.to_parquet("df.parquet", engine="pyarrow", overwrite=True)
Read it in via dask and specify the schema after reading:
df = dd.read_parquet("df.parquet", engine="pyarrow")
df["z"] = df["z"].astype("float")
df = df.rename(columns={"z": "a"})
Read it in via spark and specify the schema:
from pyspark.sql import SparkSession
import pyspark.sql.types as T
spark = SparkSession.builder.appName('App').getOrCreate()
schema = T.StructType(
[
T.StructField("id", T.IntegerType()),
T.StructField("a", T.FloatType()),
T.StructField("timestamp", T.TimestampType()),
]
)
df = spark.read.format("parquet").schema(schema).load("df.parquet")
Some of the options are:
Specify dtypes after loading (requires consistent column names):
custom_dtypes = {"a": float, "id": int, "timestamp": pd.datetime}
df = dd.read_parquet("df.parquet", engine="pyarrow").astype(custom_dtypes)
This currently fails because of the inconsistent field names.
If the column names are not the same across files, you might want to use a custom delayed before loading:
#delayed
def custom_load(path):
df = pd.read_parquet(path)
# some logic to ensure consistent columns
# for example:
if "z" in df.columns:
df = df.rename(columns={"z": "a"}).astype(custom_dtypes)
return df
dask_df = dd.from_delayed([custom_load(path) for path in glob.glob("some_path/*parquet")])

JSON aggregation using s3-dist-cp for Spark application consumption

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine.
My source JSON data is in the form of multiple S3 objects. I need to compact them into a JSON array to reduce the number of S3 objects to read from within my Spark application. I tried using "s3-dist-cp --groupBy", but the result is a concatenated JSON data which in itself is not a valid JSON file, so I cannot create a Dataframe from it.
Here is simplified example to illustrate it further.
Source data :
S3 Object Record1.json : {"Name" : "John", "City" : "London"}
S3 Object Record2.json : {"Name" : "Mary" , "City" : "Paris"}
s3-dist-cp --src s3://source/ --dest s3://dest/ --groupBy='.*Record.*(\w+)'
Aggregated output
{"Name" : "Mary" , "City" : "Paris"}{"Name" : "John", "City" : "London"}
What I need :
[{"Name" : "John", "City" : "London"},{"Name" : "Mary" , "City" : "Paris"}]
Application code for reference
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
val schema = new StructType()
.add("Name",StringType,true)
.add("City",StringType,true)
val df = spark.read.option("multiline","true").schema(schema).json("test.json")
df.show()
Expected output
+----+------+
|Name| City|
+----+------+
|John|London|
|Mary| Paris|
+----+------+
Is s3-dist-cp the right tool for my need? Any other suggestion for aggregating json data to be loaded by Spark app as Dataframe?
Alternatively you can use regexp_replace to replace a single line string into multiline strings on json format, before that would be transformed into a dataset.
Check for the sample:
val df = spark.read.text("test.json")\
.withColumn("json", from_json(regexp_replace(col("value"), "\}\{", "\}\n\{"), schema))\
.select("json.*")
df.show()
About regexp_replace:
Pyspark replace strings in Spark dataframe column

Need help on using Spark Filter

I am new in Apache spark, need help in forming either SQL query or spark filter on dataframe.
Below is how my data is formed, i.e. i have large amount of users which contains below data.
{ "User1":"Joey", "Department": ["History","Maths","Geography"] }
I have multiple search conditions like below ones, wherein i need to search array of data based on operator defined by user say for example may be and / or.
{
"SearchCondition":"1",
"Operator":"and",
"Department": ["Maths","Geography"]
}
Can point me to a path of how to achieve this in spark ?
Thanks,
-Jack
I assume you use Scala and you have parsed the data in a DataFrame
val df = spark.read.json(pathToFile)
I would use DataSets for this because they provide type safety
case class User(department: Array[String], user1: String)
val ds = df.as[User]
def pred(user: User): Boolean = Set("Geography","Maths")subsetOf(user.department.toSet)
ds.filter(pred _)
You can read more about DataSets here and here.
If you prefer to use Dataframes you can do it with user defined functions
import org.apache.spark.sql.functions._
val pred = udf((arr: Seq[String]) => Set("Geography","Maths")subsetOf(arr.toSet))
df.filter(pred($"Department"))
At the same package you can find a spark built-in function for this. You can do
df.filter(array_contains($"Department", "Maths")).filter(array_contains($"Department", "Geography"))
but someone could argue that this is not so efficient and the optimizer can`t improve it a lot.
Note that for each search condition you need a different predicate.

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")