Push data to mongoDB using spark from hive - sql

i want to to extract data from hive using sql query convert that to a nested dataframe and push it into mongodb using spark.
Can anyone suggest a efficient way to do that .
eg:
Flat query result -->
{"columnA":123213 ,"Column3 : 23,"Column4" : null,"Column5" : "abc"}
Nested Record to be pushed to mongo -->
{
"columnA":123213,
"newcolumn" : {
"Column3 : 23,
"Column4" : null,
"Column5" : "abc"
}
}

You may use the map function in spark sql to achieve the desired transformation eg
df.selectExpr("ColumnA","map('Column3',Column3,'Column4',Column4,'Column5',Column5) as newcolumn")
or you may run the following on your spark session after creating a temp view
df.createOrReplaceTempView("my_temp_view")
sparkSession.sql("<insert sql below here>")
SELECT
ColumnA,
map(
"Column3",Column3,
"Column4",Column4,
"Column5",Column5
) as newcolumn
FROM
my_temp_view
Moreover, if this is the only transformation that you wish to use, you may run this query on hive also.
Additional resources:
Spark Writing to Mongo
Let me know if this works for you.

For a nested level array for your hive dataframe we can try something like:
from pyspark.sql import functions as F
df.withColumn(
"newcolumn",
F.struct(
F.col("Column3").alias("Column3"),
F.col("Column4").alias("Column4"),
F.col("Column5").alias("Column5")
)
)
followed by groupBy and F.collect_list to create an nested array wrapped in a single record.
we can then write this to mongo
df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

Related

Flatten Hive struct column or avro file using pyspark

I have a Hive table which has a struct data type column(sample below). The table is created on avro file.
By using pyspark, how can I flatten the records so that I get simple data type value(not struct, array or list) in each column to load another Hive table.
I can use Hive table or avro file as source.
Sample data-
Hive Column Name: Contract_Data
{"contract":
{"contractcode":"CCC",
unit:
{"value":"LOCAL",
desc:"LOCAL"},
segmentlist:
{"segment":[ #"segment" is array of struct here
{"transaction":"1",
"plans":
{"identifier":[ #"identifier" is array of struct here
{"value":"123","desc":"L1"},
{"value":"456","desc":"L2"}]
}
}]
}
},
plans:
{"listplans":[ #"listplans" is array of struct here
{"plantype":"M",
plandesign:
{"value":"PV","desc":"PD"},
state:
{"value":"ST","desc":"ST"}
}]
}
}
You can first read the HIVE table as Spark Dataframe as below.
df = spark.table("<DB_NAME>.<Table_Name>")
then you can explode function from Spark's Dataframe API to flatten the structure. PFB sample code which should work.
from pyspark.sql.functions import *
df.select(explode("Contract_Data"))
If the structure is nested which I could see in your above sample data, you can apply explode multiple times.
Hope it helps.
Regards,
Neeraj

Need help on using Spark Filter

I am new in Apache spark, need help in forming either SQL query or spark filter on dataframe.
Below is how my data is formed, i.e. i have large amount of users which contains below data.
{ "User1":"Joey", "Department": ["History","Maths","Geography"] }
I have multiple search conditions like below ones, wherein i need to search array of data based on operator defined by user say for example may be and / or.
{
"SearchCondition":"1",
"Operator":"and",
"Department": ["Maths","Geography"]
}
Can point me to a path of how to achieve this in spark ?
Thanks,
-Jack
I assume you use Scala and you have parsed the data in a DataFrame
val df = spark.read.json(pathToFile)
I would use DataSets for this because they provide type safety
case class User(department: Array[String], user1: String)
val ds = df.as[User]
def pred(user: User): Boolean = Set("Geography","Maths")subsetOf(user.department.toSet)
ds.filter(pred _)
You can read more about DataSets here and here.
If you prefer to use Dataframes you can do it with user defined functions
import org.apache.spark.sql.functions._
val pred = udf((arr: Seq[String]) => Set("Geography","Maths")subsetOf(arr.toSet))
df.filter(pred($"Department"))
At the same package you can find a spark built-in function for this. You can do
df.filter(array_contains($"Department", "Maths")).filter(array_contains($"Department", "Geography"))
but someone could argue that this is not so efficient and the optimizer can`t improve it a lot.
Note that for each search condition you need a different predicate.

Reading data from SQL Server using Spark SQL

Is it possible to read data from Microsoft Sql Server (and oracle, mysql, etc.) into an rdd in a Spark application? Or do we need to create an in memory set and parallize that into an RDD?
In Spark 1.4.0+ you can now use sqlContext.read.jdbc
That will give you a DataFrame instead of an RDD of Row objects.
The equivalent to the solution you posted above would be
sqlContext.read.jdbc("jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;", "TABLE_NAME", "id", 1, 100000, 1000, new java.util.Properties)
It should pick up the schema of the table, but if you'd like to force it, you can use the schema method after read sqlContext.read.schema(...insert schema here...).jdbc(...rest of the things...)
Note that you won't get an RDD of SomeClass here (which is nicer in my view). Instead you'll get a DataFrame of the relevant fields.
More information can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Found a solution to this from the mailing list. JdbcRDD can be used to accomplish this. I needed to get the MS Sql Server JDBC driver jar and add it to the lib for my project. I wanted to use integrated security, and so needed to put sqljdbc_auth.dll (available in the same download) in a location that java.library.path can see. Then, the code looks like this:
val rdd = new JdbcRDD[Email](sc,
() => {DriverManager.getConnection(
"jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;")},
"SELECT * FROM TABLE_NAME Where ? < X and X < ?",
1, 100000, 1000,
(r:ResultSet) => { SomeClass(r.getString("Col1"),
r.getString("Col2"), r.getString("Col3")) } )
This gives an Rdd of SomeClass.The second, third and fourth parameters are required and are for lower and upper bounds, and number of partitions. In other words, that source data needs to be partitionable by longs for this to work.

Spark SQL count() returns wrong number

I'm new to Apache Spark and Scala (also a beginner with Hadoop in general).
I completed the Spark SQL tutorial: https://spark.apache.org/docs/latest/sql-programming-guide.html
I tried to perform a simple query on a standard csv file to benchmark its performance on my current cluster.
I used data from https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz, converted it to csv and copy/pasted the data to make it 10 times as big.
I loaded it into Spark using Scala:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
Define classes:
case class datum(exchange: String,stock_symbol: String,date: String,stock_price_open: Double,stock_price_high: Double,stock_price_low: Double,stock_price_close: Double,stock_volume: String,stock_price_adj_close: Double)
Read in data:
val data = sc.textFile("input.csv").map(_.split(";")).filter(line => "exchange" != "exchange").map(p => datum(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toString, p(8).trim.toDouble))
Convert to table:
data.registerAsTable("data")
Define query (list all rows with 'IBM' as stock symbol):
val IBMs = sqlContext.sql("SELECT * FROM data WHERE stock_symbol ='IBM'")
Perform count so query actually runs:
IBMs.count()
The query runs fine, but returns res: 0 instead of 5000 (which is what it returns using Hive with MapReduce).
filter(line => "exchange" != "exchange")
Since "exchange" is equal to "exchange" filter will return a collection of size 0. And since there is no data, querying for any result will return 0. You need to re-write your logic.

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")