Python cx_oracle bind variable with a list of items - sql

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle

Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!

I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

Related

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

find tables with specific columns' names in a database on databricks by pyspark

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks
The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.
I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")
SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.

Passing a parameter to a sql query using pyodbc failing

I have read dozens of similar posts and tried everything but I still get an error message when trying to pass a parameter to a simple query using pyodbc. Apologies if there is an answer to this elsewhere but I cannot find it
I have a very simple table:
select * from Test
yields
a
b
c
This works fine:
import pyodbc
import pandas
connection = pyodbc.connect('DSN=HyperCube SYSTEST',autocommit=True)
result = pandas.read_sql("""select * from Test where value = 'a'""",connection,params=None)
print(result)
result:
value
0 a
However if I try to do the where clause with a parameter it fails
result = pandas.read_sql("""select * from Test where value = ?""",connection,params='a')
yields
Error: ('01S02', '[01S02] Unknown column/parameter value (9001) (SQLPrepare)')
I also tried this
cursor = connection.cursor()
cursor.execute("""select * from Test where value = ?""",['a'])
pyodbcResults = cursor.fetchall()
and still received the same error
Does anyone know what is going on? Could it be an issue with the database I am querying?
PS. I looked at the following post and the syntax there in the first part of answer 9 where dates are passed by strings looks identical to what I am doing
pyodbc the sql contains 0 parameter markers but 1 parameters were supplied' 'hy000'
Thanks
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html]ΒΆ
params : list, tuple or dict, optional, default: None
example:
cursor.execute("select * from Test where value = %s",['a'])
or Named arguments example:
result = pandas.read_sql(('select * from Test where value = %(par)s'),
db,params={"par":'p'})
in pyodbc write parms directly after sql parameter:
cursor.execute(sql, *parameters)
for example:
onepar = 'a'
cursor.execute("select * from Test where value = ?", onepar)
cursor.execute("select a from tbl where b=? and c=?", x, y)

Sql query dynamic variable passing

I've looked at the documentation in various places to see how to do this, but I haven't had any success. I want to pass in the name of a column into a sql query. I'm using psycopg2 and My most recent attempt was based off of this doc page http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
Here is my latest attempt, but I get an error IndexError: tuple index out of range that points to the format() where I'm passing in the parameter.
def parse_files(cursor):
for name in column_names:
cursor.execute(
sql.SQL(
"select planet_osm_point.{}, count(*) from planet_osm_point group by planet_osm_point.{}"
).format(sql.Identifier(name)))
for row in cursor:
print(str(row[0]) + str(row[1]))
It's not clear by the given documentation, but it looks like I need to pass in a value inside of the {} specifying what argument I want to use. In this case it's {0}
column_names = ['col1', 'col2']
for column in column_names:
query = sql.SQL('''
select {0}, count(*)
from planet_osm_point pop
group by {0}
''').format(sql.Identifier('pop.' + column))
cursor.execute(query)
for row in cursor.fetchall():
print (str(row[0]) + str(row[1]))

Efficient way of iterating rows in pandas

I have a dataframe in pandas containing the following information
Using a for loop for each entry in the TRANSACTION_ID, I am calling the following function,
def checkForImages(TransNum):
"""pass function a transaction number and get the string with image found information then store that
string into the same row in a new column"""
try:
cursor.execute('select CAMERA_TYPE from VEHICLE_IMAGE where TRANSACTION_ID=' + str(TransNum))
result = ''
for img_type in cursor:
result = result + img_type[0]
if result == '':
result = 'No image available'
print 'Images found: ' + str(TransNum) + " "+ result
resultSort = result.split()
resultSort.sort()
result = ''
for i in range(len(resultSort)):
result = result + " " + resultSort[i]
cursor.close()
return result
except Exception as e:
# print 'Error occured while getting image references: ', e
pass
This function returns a string which is either 'No images available' or has the image information if found. I have to create a new column in the dataframe populated with this result so my final dataframe should look like this
My question is: How can I speed up this process? Using for loop on rows with 100k+ entries is extremely slow and painful. I have looked into functions like dataframe.map and dataframe.apply but haven't been able to get it working. Other options I see is using cython or multiple threads. In which option should I invest my time? Any help is appreciated
You query Oracle for each transaction and then additionally aggregate fetched data for each transaction in a loop - it's very inefficient.
First i would create a "mapping" DataFrame like as follows:
transaction_id images
111 No image available
112 FRONT REAR
113 OVERVIEW
this can be done using Oracle's LISTAGG function:
qry = """
select
transaction_id,
NVL(listagg(camera_type, ' ') within group (order by camera_type), 'No image available') as images
from vehicle_image group by transaction_id
"""
# `engine` - is a SQLAlchemy engine connection ...
cam = pd.read_sql(qry, con=engine, index_col=['transaction_id'])
after that we can use Series.map() method:
df['Image_Found'] = df.transaction_id.map(cam.images)