Spark SQL: Can't UNNEST lambda variables

Spark SQL: Can't UNNEST lambda variables - sql

I am encountering a strange behaviour. I can't access lambda variable with UNNEST in my spark code:
FILTER(boxes.clicks, x -> EXISTS (SELECT 1 FROM UNNEST(x) AS clicks WHERE clicks.href IS NOT NULL))
This will complain that x does not exist: cannot resolve 'x' given input columns: []
However, without UNNEST, x can be accessed without any problems. For example, this will work just fine:
FILTER(boxes.clicks, x -> size(x) > 1)
Is it possible to use lambda variables in combination with UNNEST?

Related

log function in redshift

I am trying to run following query.
CREATE TEMP TABLE tmp_variables AS SELECT
0.99::numeric(10,8) AS y ;
select y, log(y) from tmp_variables
It gives me following error. Is there a way to get around this?
[Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
Warnings:
Function "log(numeric,numeric)" not supported.

A workaround is to use "float" instead.
CREATE TEMP TABLE tmp_variables AS SELECT
0.99::float AS y ;
select y, log(y) from tmp_variables
works fine and returns
y log
0.99 -0.004364805402450088

The LOG function requires an argument that is data type "double precision". Your code is passing in a data type of "numeric", that's why you are getting an error.
This will work:
CREATE TEMP TABLE tmp_variables AS
SELECT 0.99::numeric(10,8) AS y ;
select y, log(cast(y as double precision)) from tmp_variables;

Zip parallell arrays in hive

I have parallel arrays in a hive table, like this:
with tbl as ( select array(1,2,3) as x, array('a','b','c') as y)
select x,y from tbl;
x y
[1,2,3] ["a","b","c"]
1 row selected (0.108 seconds)
How can I zip them together (like the python zip function), so that I get back a list of structs, like
[(1, "a"), (2, "b"), (3,"c")]

You can posexplode so it gives the positions in the array which can then be used for filtering.
select x,y,collect_list(struct(val1,val2))
from tbl
lateral view posexplode(x) t1 as p1,val1
lateral view posexplode(y) t2 as p2,val2
where p1=p2
group by x,y

Here was my attempt at avoiding a double-explode:
with tbl as (select array(1,2,3,4,5) as x, array('a','b','c','d','e') as y)
select collect_list(struct(xi, y[i-1]))
from tbl lateral view posexplode(x) tbl2 as xi, i;
However, I ran into a strange error:
Error: Error while compiling statement: FAILED: IllegalArgumentException Size requested for unknown type: java.util.Collection (state=42000,code=40000)
I was able to work around it using
set hive.execution.engine=mr;
which is not as fast / optimized as using spark or tez as the back end.

Python cx_oracle bind variable with a list of items

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle

Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!

I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

How can I use arrayExists function when the array contains a null value?

I have a nullable array column in my table: Array(Nullable(UInt16)). I want to be able to query this column using arrayExists (or arrayAll) to check if it contains a value above a certain threshold but I'm getting an exception when the array contains a null value:
Exception: Expression for function arrayExists must return UInt8, found Nullable(UInt8)
My query is below where distance is the array column:
SELECT * from TracabEvents_ArrayTest
where arrayExists(x -> x > 9, distance);
I've tried updating the comparison in the lambda to "(isNotNull(x) and x > 9)" but I'm still getting the error. Is there any way of handling nulls in these expressions or are they not supported yet?

Add a condition to filter rows with empty list using notEmpty and assumeNotNull for x in arrayExists.
SELECT * FROM TracabEvents_ArrayTest WHERE notEmpty(distance) AND arrayExists(x -> assumeNotNull(x) > 9, distance)

ERROR: column mm.geom does not exist in PostgreSQL execution using R

I am trying to run the model in R software which calls functions from GRASS GIS (version 7.0.2) and PostgreSQL (version 9.5) to complete the task. I have created a database in PostgreSQL and created an extension Postgis, then imported required vector layers into the database using Postgis shapefile importer. Every time I try to run using R (run as an administrator), it returns an error like:
Error in fetch(dbSendQuery(con, q, n = -1)) :
error in evaluating the argument 'res' in selecting a method for function 'fetch': Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column mm.geom does not exist
LINE 5: (st_dump(st_intersection(r.geom, mm.geom))).geom as geom,
^
HINT: Perhaps you meant to reference the column "r.geom".
QUERY:
insert into m_rays
with os as (
select r.ray, st_endpoint(r.geom) as s,
(st_dump(st_intersection(r.geom, mm.geom))).geom as geom,
mm.legend, mm.hgt as hgt, r.totlen
from rays as r,bh_gd_ne_clip as mm
where st_intersects(r.geom, mm.geom)
)
select os.ray, os.geom, os.hgt, l.absorb, l.barrier, os.totlen,
st_length(os.geom) as shape_length, st_distance(os.s, st_endpoint(os.geom)) as near_dist
from os left join lut as l
on os.legend = l.legend
CONTEXT: PL/pgSQL function do_crtn(text,text,text) line 30 at EXECUTE
I have checked over and over again, column geometry does exist in Schema>Public>Views of PostgreSQL. Any advise on how to resolve this error?

add quotes and then use r."geom" instead r.geom

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark SQL: Can't UNNEST lambda variables - sql

Related

log function in redshift

Zip parallell arrays in hive

Python cx_oracle bind variable with a list of items

How can I use arrayExists function when the array contains a null value?

ERROR: column mm.geom does not exist in PostgreSQL execution using R

Categories

Resources