How to use a list in where clause in spark-sql - sql

I have data of the following format :
df
uid String event
a djsan C
a fbja V
a kakal Conversion
b jshaj V
b jjsop C
c dqjka V
c kjkk Conversion
I need to extract all the rows of the user whose event is conversion, so the expected outcome should be :
uid String event
a djsan C
a fbja V
a kakal Conversion
c dqjka V
c kjkk Conversion
I'm trying to use spark- sql for the same. I was trying to use a simple subquery of the form
Select * from df where uid in (Select uid from df where event = 'Conversion')
but this is giving me an exception.
Also I wanted to see if I had a list object of the uid, can I use that in a SQL statement and if yes, how?
list : List[String] = List('a','c')

The sub query syntax you've written is not supported by spark yet. Here is how you can use your list to form a query:
val list = List("a","b")
val query = s"select * from df where uid in (${list.map ( x => "'" + x + "'").mkString(",") })"
and use it to select desired rows.

Related

Pass date values from dataframe to query in Spark /Scala

I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.
You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")

Writing where query using pyspark on SQL table

I'm querying sql table using pyspark.
If I have a sql table which has two column (value, isDelayed) where "value" is of double type and "isDelayed" has value 0 or 1. How to write a query using pyspark aggregation query which gives sum of "value" when "isDelayed" is 1.
I've already tried below code which is giving an error
def __main__(self, data):
delayedData = data.where(col('isDelayed').cast('int')==='1')
groupByIsDelayed = delayedData.agg(sum(total))
return groupByIsDelayed
I'm getting
"Syntax Error: invalid syntax"
on below line
delayedData = data.where(col('isDelayed').cast('int')==='1')
replace data.where(col('isDelayed').cast('int')==='1') with data.where(col('isDelayed').cast('int') == 1)
2 = only (equal operator in python is 2 = sign)
1 without quote (because you compare a int, not a string)
or
data.where("isDelayed=1")

how to set a condition for arrays in where clause [BigQuery]

I have an array in my table - something like this:
I need to take into account only rows where 'top_authors.author' = 'Caivi" and 'top_authors.total_score' = 3
I was trying to use unnest function but still I get the error "No matching signature for operator = for argument types: ARRAY, STRING. Supported signatures: ANY = ANY"
Could you help mi with that?
You can unnest() in a subquery in the where clause:
where exists (select 1
from unnest(top_authors) ta
where ta.author = 'Caivi' and ta.total_score = 3
)
Or you can do this in the main query:
select . . .
from t cross join
unnest(top_authors) ta
where ta.author = 'Caivi' and ta.total_score = 3;
Assuming you don't have duplicates in the array, these should produce equivalent results.

Python cx_oracle bind variable with a list of items

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle
Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!
I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

Entity framework join with a subquery via linq syntax

I'm trying to translate a sql query in linq sintax, but I'm having big trouble
This is my query in SQL
select * FROM dbo.ITEM item inner join
(
select SUM([QTA_PRIMARY]) QtaTotale,
TRADE_NUM,
ORDER_NUM,
ITEM_NUM
from [dbo].[LOTTI]
where FLAG_ATTIVO=1
group by [TRADE_NUM],[ORDER_NUM],[ITEM_NUM]
)
TotQtaLottiGroupByToi
on item.TRADE_NUM = TotQtaLottiGroupByToi.TRADE_NUM
and item.ORDER_NUM = TotQtaLottiGroupByToi.ORDER_NUM
and item.ITEM_NUM = TotQtaLottiGroupByToi.ITEM_NUM
where item.PRIMARY_QTA > TotQtaLottiGroupByToi.QtaTotale
and item.FLAG_ATTIVO=1
How can I translate into linq sintax?
This approach doesn't work
var res= from i in context.ITEM
join d in
(
from l in context.LOTTI
group l by new { l.TRADE_NUM, l.ORDER_NUM, l.ITEM_NUM } into g
select new TotQtaByTOI()
{
TradeNum = g.Key.TRADE_NUM,
OrderNum = g.Key.ORDER_NUM,
ItemNum = g.Key.ITEM_NUM,
QtaTotale = g.Sum(oi => oi.QTA_PRIMARY)
}
)
on new { i.TRADE_NUM, i.ORDER_NUM, i.ITEM_NUM} equals new { d.TradeNum, d.OrderNum, d.ItemNum }
I get this error
The type of one of the expressions in the join cluase is incorrect. Type inference failed in the call to 'Join'
Can you help me with this query?
Thank you!
The problem is Anonymous Type comparison. You need to specify matching property names for your two anonymous type's properties (e.g. first, second, third)
I tried it out, here's an example: http://pastebin.com/hRj0CMzs