Spark SQL count() returns wrong number - sql

I'm new to Apache Spark and Scala (also a beginner with Hadoop in general).
I completed the Spark SQL tutorial: https://spark.apache.org/docs/latest/sql-programming-guide.html
I tried to perform a simple query on a standard csv file to benchmark its performance on my current cluster.
I used data from https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz, converted it to csv and copy/pasted the data to make it 10 times as big.
I loaded it into Spark using Scala:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
Define classes:
case class datum(exchange: String,stock_symbol: String,date: String,stock_price_open: Double,stock_price_high: Double,stock_price_low: Double,stock_price_close: Double,stock_volume: String,stock_price_adj_close: Double)
Read in data:
val data = sc.textFile("input.csv").map(_.split(";")).filter(line => "exchange" != "exchange").map(p => datum(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toString, p(8).trim.toDouble))
Convert to table:
data.registerAsTable("data")
Define query (list all rows with 'IBM' as stock symbol):
val IBMs = sqlContext.sql("SELECT * FROM data WHERE stock_symbol ='IBM'")
Perform count so query actually runs:
IBMs.count()
The query runs fine, but returns res: 0 instead of 5000 (which is what it returns using Hive with MapReduce).

filter(line => "exchange" != "exchange")
Since "exchange" is equal to "exchange" filter will return a collection of size 0. And since there is no data, querying for any result will return 0. You need to re-write your logic.

Related

Apache Spark: using plain SQL queries vs using Spark SQL methods

I'm very new to Apache Spark.
I have a very basic question: what is best in terms of performance between the two syntax below: using plain SQL queries or using Spark SQL methods like select, filter, etc. .
Here's a short example in Java, that will make you understand better my question.
private static void queryVsSparkSQL() throws AnalysisException {
SparkConf conf = new SparkConf();
SparkSession spark = SparkSession
.builder()
.master("local[4]")
.config(conf)
.appName("queryVsSparkSQL")
.getOrCreate();
//using predefined query
Dataset<Row> ds1 = spark
.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#localhost:1521/orcl")
.option("user", "hr")
.option("password", "hr")
.option("query","select * from hr.employees t where t.last_name = 'King'")
.load();
ds1.show();
//using spark sql methods: select, filter
Dataset<Row> ds2 = spark
.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#localhost:1521/orcl")
.option("user", "hr")
.option("password", "hr")
.option("dbtable", "hr.employees")
.load()
.select("*")
.filter(col("last_name").equalTo("King"));
ds2.show();
}
Try .explain and check if pushdown predicate is used on your second query.
It should be in that second case. If so, it is equivalent technically in performance to passing the explicit query with pushdown already contained in the query option.
See a simulated version against mySQL, based on your approach.
CASE 1: select statement via passed query containing filter
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("query","select * from family where rfam_acc = 'RF01527'").option("user", "rfamro").load().explain()
== Physical Plan ==
*(1) Scan JDBCRelation((select * from family where rfam_acc = 'RF01527') SPARK_GEN_SUBQ_4) [numPartitions=1] #[rfam_acc#867,rfam_id#868,auto_wiki#869L,description#870,author#871,seed_source#872,gathering_cutoff#873,trusted_cutoff#874,noise_cutoff#875,comment#876,previous_id#877,cmbuild#878,cmcalibrate#879,cmsearch#880,num_seed#881L,num_full#882L,num_genome_seq#883L,num_refseq#884L,type#885,structure_source#886,number_of_species#887L,number_3d_structures888,num_pseudonokts#889,tax_seed#890,... 11 more fields] PushedFilters: [], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...
Here PushedFilters is not used as a query is only used; it contains the filter in the actual passed to db query.
CASE 2: No select statement, rather using Spark SQL APIs referencing a filter
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("dbtable", "family").option("user", "rfamro").load().select("*").filter(col("rfam_acc").equalTo("RF01527")).explain()
== Physical Plan ==
*(1) Scan JDBCRelation(family) [numPartitions=1] [rfam_acc#1149,rfam_id#1150,auto_wiki#1151L,description#1152,author#1153,seed_source#1154,gathering_cutoff#1155,trusted_cutoff#1156,noise_cutoff#1157,comment#1158,previous_id#1159,cmbuild#1160,cmcalibrate#1161,cmsearch#1162,num_seed#1163L,num_full#1164L,num_genome_seq#1165L,num_refseq#1166L,type#1167,structure_source#1168,number_of_species#1169L,number_3d_structures#1170,num_pseudonokts#1171,tax_seed#1172,... 11 more fields] PushedFilters: [*IsNotNull(rfam_acc), *EqualTo(rfam_acc,RF01527)], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...
PushedFilter is set to the criteria so filtering is applied in the database itself prior to returning data to Spark. Note the * on the PushedFilters, that signfies filtering at data source = database.
Summary
I ran both options and the timing was quick. They are equivalent in terms of what DB processing is done, only filtered results are returned to Spark, but via two different mechanisms that result in the same performance and results physically.

Need help on using Spark Filter

I am new in Apache spark, need help in forming either SQL query or spark filter on dataframe.
Below is how my data is formed, i.e. i have large amount of users which contains below data.
{ "User1":"Joey", "Department": ["History","Maths","Geography"] }
I have multiple search conditions like below ones, wherein i need to search array of data based on operator defined by user say for example may be and / or.
{
"SearchCondition":"1",
"Operator":"and",
"Department": ["Maths","Geography"]
}
Can point me to a path of how to achieve this in spark ?
Thanks,
-Jack
I assume you use Scala and you have parsed the data in a DataFrame
val df = spark.read.json(pathToFile)
I would use DataSets for this because they provide type safety
case class User(department: Array[String], user1: String)
val ds = df.as[User]
def pred(user: User): Boolean = Set("Geography","Maths")subsetOf(user.department.toSet)
ds.filter(pred _)
You can read more about DataSets here and here.
If you prefer to use Dataframes you can do it with user defined functions
import org.apache.spark.sql.functions._
val pred = udf((arr: Seq[String]) => Set("Geography","Maths")subsetOf(arr.toSet))
df.filter(pred($"Department"))
At the same package you can find a spark built-in function for this. You can do
df.filter(array_contains($"Department", "Maths")).filter(array_contains($"Department", "Geography"))
but someone could argue that this is not so efficient and the optimizer can`t improve it a lot.
Note that for each search condition you need a different predicate.

Slick Plain Sql Generic Return Type

I am trying to write a configurable sql query executor using Slick. User provides a prepared statement with ? and at run time the exact query is formed by replacing ? with values.
Generally this is how one would run a plain sql query using slick.
val query = sql"#$queryString".as[(String,Int)]
In my case i would not know the result type so i want to get back a generic result type. Maybe a List of Tuples with each tuple representing a row of result SET.
Any ideas on how this would be done?
I found a solution from one of the scala git issues. Here it is
ResultMap extends GetResult[Map[String, Any]] {
def apply(pr: PositionedResult) = {
val resultSet = pr.rs
val metaData = resultSet.getMetaData();
(1 to pr.numColumns).map { i =>
metaData.getColumnName(i) -> resultSet.getObject(i)
}.toMap
}
and then we can simply do val query = sql"#$queryString".as(ResultMap)
Hope it helps!!

Aerospike limiting records by lexicographic order

Can Aerospike get records by lexicographic order.For example if U want all the records that start with "a" then U like to search for bin >="a" AND bin <="az"
aerospike support UDF modules(in LUA and C language) https://www.aerospike.com/docs/udf/developing_lua_modules.html
which can serve your purpose.
User-Defined Functions written in Lua extend the core functionality of Aerospike. You would create a stream UDF and attach it to a query.
One best practice for stream UDFs in Aerospike is to eliminate as many records as possible before passing the results into the UDF, so in this case I would create another bin to hold a prefix (first letter, or a substring, depending on your use case) and build a secondary index on it. The idea is that the query portion should return as small of a subset as you can reliably. For your example the prefix can be a single character, you can add a new bin 'firstchar' to the records in the set, then build a secondary index on it.
The stream UDF module would look something like:
local function range_filter(bin_name, substr_from, substr_to)
return function(record)
local val = record[bin_name]
if type(val) ~= 'string' then
return false
end
if val >= substr_from and val <= substr_to then
return true
else
return false
end
end
end
local function rec_to_map(record)
local xrec = map()
for i, bin_name in ipairs(record.bin_names(record)) do
xrec[bin_name] = xrec[bin_name]
end
return xrec
end
function str_between(stream, bin_name, substr_from, substr_to)
return stream : filter(range_filter(bin_name, substr_from, substr_to)) : map(rec_to_map)
end
In the Python client you'd invoke it as follows:
import aerospike
from aerospike import predicates as p
# instantiate the client and connect to the cluster, then:
query = client.query('test', 'this')
query.where(p.equals('firstchar', 'a'))
query.apply('strrangemod', 'str_between', ['a','az'])

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")