Pass date values from dataframe to query in Spark /Scala - sql

I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.

You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")

Related

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks
Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+
You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

Databricks notebook - return all values from a table

For test purposes, I have an empty DB into which I populate a tiny amount of data, extracted and transformed from a json file.
I would like to create a notebook using scala, which gets all values from all columns from a given table, and exit the notebook returning this result as a string.
I've tried variations of the following:
val result = spark.sql("select * from table.DB").as[String];
dbutils.notebook.exit(result)
However the first command fails with error:
AnalysisException: Try to map struct<Version:bigint,metadataInformation:struct<metadataID:string... etc ...> to Tuple1, but failed as the number of fields does not line up.;
However, something like the following works, to retrieve value of a specific field, from a column:
val result = spark.sql("select column.jsonfield from table.DB").as[String].first();
dbutils.notebook.exit(result)
How can I return the content of all columns?
val result = spark.sql("SELECT x FROM y").collect().toList.flatMap(x => x.toSeq).mkString(",")
dbutils.notebook.exit(result)

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

Writing where query using pyspark on SQL table

I'm querying sql table using pyspark.
If I have a sql table which has two column (value, isDelayed) where "value" is of double type and "isDelayed" has value 0 or 1. How to write a query using pyspark aggregation query which gives sum of "value" when "isDelayed" is 1.
I've already tried below code which is giving an error
def __main__(self, data):
delayedData = data.where(col('isDelayed').cast('int')==='1')
groupByIsDelayed = delayedData.agg(sum(total))
return groupByIsDelayed
I'm getting
"Syntax Error: invalid syntax"
on below line
delayedData = data.where(col('isDelayed').cast('int')==='1')
replace data.where(col('isDelayed').cast('int')==='1') with data.where(col('isDelayed').cast('int') == 1)
2 = only (equal operator in python is 2 = sign)
1 without quote (because you compare a int, not a string)
or
data.where("isDelayed=1")

How to use a list in where clause in spark-sql

I have data of the following format :
df
uid String event
a djsan C
a fbja V
a kakal Conversion
b jshaj V
b jjsop C
c dqjka V
c kjkk Conversion
I need to extract all the rows of the user whose event is conversion, so the expected outcome should be :
uid String event
a djsan C
a fbja V
a kakal Conversion
c dqjka V
c kjkk Conversion
I'm trying to use spark- sql for the same. I was trying to use a simple subquery of the form
Select * from df where uid in (Select uid from df where event = 'Conversion')
but this is giving me an exception.
Also I wanted to see if I had a list object of the uid, can I use that in a SQL statement and if yes, how?
list : List[String] = List('a','c')
The sub query syntax you've written is not supported by spark yet. Here is how you can use your list to form a query:
val list = List("a","b")
val query = s"select * from df where uid in (${list.map ( x => "'" + x + "'").mkString(",") })"
and use it to select desired rows.