column names are coming as 0,1,2,3 while executing SQL Query via python in snowflake using snowflake connector - sql

I am executing a sql query from a python script to retrieve the data from snowflake in windows 10 but the resulting query is missing column names and its getting replaced by 0,1,2,3 so on. While executing query in snowflake interface and downloading csv is giving the columns in the file. I am passing column names as Aliases in my query
Below is code
def _CONSUMPTION(con):
data2 = con.cursor().execute("""select sd.sales_force_lvl_1_code "Plan-To Code",sd.sales_force_lvl_1_desc "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd, (select sales_force_lvl_1_code,max(sales_force_lvl_1_desc) sales_force_lvl_1_desc from DW.mv_us_sales_force_dim group by sales_force_lvl_1_code) sd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and sd.sales_force_lvl_1_code = md.curr_sales_force_lvl_1_code
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.curr_sales_force_lvl_1_code is not null
UNION
select '5000016240' "Plan-To Code", 'AWG TOTAL' "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.market_code = '20267'""").fetchall()
df = pd.DataFrame(data2)
df.head(5)
df.to_csv('CONSUMPTION.csv',index = False)

Looking [at the docs], seems the easiest way is to use the cursor method .fetch_pandas_all():
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
cur.execute(query).fetch_pandas_all()
Or if you want to dump the results into a CSV, just do so as in the question:
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
df = cur.execute(query).fetch_pandas_all()
df.to_csv('x.csv', index = False)
Visualized:

Looks like you haven’t defined the column methods in your code to define the data frame.
My recommendation will be to add column methods first df.columns
In addition refer snowflake page for details
https://docs.snowflake.com/en/user-guide/python-connector-pandas.html
Try this
import pandas as pd
def fetch_pandas_old(cur, sql):
cur.execute(sql)
rows = 0
while True:
dat = cur.fetchmany(50000)
if not dat:
break
df = pd.DataFrame(dat, columns=cur.description)
rows += df.shape[0]
print(rows)

A nice way to extract the column headings from the cursor description and save in a pandas df using the Snowflake connector (also works for psycopg2 btw) is as follows:
#Create the connection
def connect_snowflake(uname, pword, acct, role_name, whouse, dbase, schema_name):
conn = snowflake.connector.connect(
user=uname,
password=pword,
account=acct,
role = role_name,
warehouse = whouse,
database = dbase,
schema = schema_name
)
cur = conn.cursor()
return conn, cur
Then execute your query. The cur.description object returns a list of tuples, the first of each being the column name :)
conn, cur = connect_snowflake(username, password, account_name, role, warehouse, database, schema)
cur.execute('select * from my_schema.my_table')
result =cur.fetchall()
# Extract the column names
col_names = []
for elt in cur.description:
col_names.append(elt[0])
df = pd.DataFrame(result, columns=col_names)
cur.close()
conn.close()

Related

PLSQL NOT IN ('val1','val2') not working in procedure [duplicate]

I'd like to use the IN clause with a prepared Oracle statement using cx_Oracle in Python.
E.g. query - select name from employee where id in ('101', '102', '103')
On python side, I have a list [101, 102, 103] which I converted to a string like this ('101', '102', '103') and used the following code in python -
import cx_Oracle
ids = [101, 102, 103]
ALL_IDS = "('{0}')".format("','".join(map(str, ids)))
conn = cx_Oracle.connect('username', 'pass', 'schema')
cursor = conn.cursor()
results = cursor.execute('select name from employee where id in :id_list', id_list=ALL_IDS)
names = [x[0] for x in cursor.description]
rows = results.fetchall()
This doesn't work. Am I doing something wrong?
This concept is not supported by Oracle -- and you are definitely not the first person to try this approach either! You must either:
create separate bind variables for each in value -- something that is fairly easy and straightforward to do in Python
create a subquery using the cast operator on Oracle types as is shown in this post: https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::p11_question_id:210612357425
use a stored procedure to accept the array and perform multiple queries directly within PL/SQL
or do something else entirely!
Just transform your list into a tuple and format the sql string with it
ids = [101, 102, 103]
param = tuple(ids)
results = cursor.execute("select name from employee where id IN {}".format(param))
Otra opción es dar formato a una cadena con la consulta.
import cx_Oracle
ids = [101, 102, 103]
ALL_IDS = "('{0}')".format("','".join(map(str, ids)))
conn = cx_Oracle.connect('username', 'pass', 'schema')
cursor = conn.cursor()
query = """
select name from employee where id in ('{}')
""".format("','".join(map(str, ids)))
results = cursor.execute(query)
names = [x[0] for x in cursor.description]
rows = results.fetchall()
Since you created the string, you're almost there. This should work:
results = cursor.execute('select name from employee where id in ' + ALL_IDS)

find tables with specific columns' names in a database on databricks by pyspark

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks
The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.
I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")
SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.

Python cx_oracle bind variable with a list of items

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle
Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!
I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

How to use variables in SQL query when using Python and pyodbc

I am using Python to extract data from SQL by using ODBC to linking Python to SQL database. when I do the query, I need to use variables in the query to make my query result changeable. For example, my code is:
import pyodbc
myConnect = pyodbc.connect('DSN=B1P HANA;UID=***;PWD=***')
myCursor = myConnect.cursor()
Start = 20180501
End = 20180501
myOffice = pd.Series([1,2,3])
myRow = myCursor.execute("""
SELECT "CALDAY" AS "Date",
"/BIC/ZSALE_OFF" AS "Office"
FROM "SAPB1P"."/BIC/AZ_RT_A212"
WHERE "CALDAY" BETWEEN 20180501 AND 20180501
GROUP BY "CALDAY","/BIC/ZSALE_OFF"
""")
Result = myRow.fetchall()
d = pd.DataFrame(columns=['Date','Office'])
for i in Result:
d= d.append({'Date': i.Date,
'Office': i.Office},
ignore_index=True)
You can see that I retrieve data from SQL database and save it into a list (Result), then I convert this list to a data frame (d).
But, my problems are:
I need to specify a start date and an end data in myCursor.execute part, something like "CALDAY" BETWEEN Start AND End
Let's say I have 100 offices in my data. Now I just need 3 of them (myOffice). So, I need to put a condition in myCursor.execute part, like myOffice in (1,2,3)
In R, I know how to deal with these two problems. the code is like:
office_clause = ""
if (myOffice != 0) {
office_clause = paste(
'AND "/BIC/ZSALE_OFF" IN (',paste(myOffice, collapse=", "),')'
)
}
a <- sqlQuery(ch,paste(' SELECT ***
FROM ***
WHERE "CALDAY" BETWEEN',Start,'AND',End,'
',office_clause1,'
GROUP BY ***
'))
But I do not know how to do this in Python. How can I do this?
You can use string formatting operations for this.
First define
query = """
SELECT
"CALDAY" AS "Date",
"/BIC/ZSALE_OFF" AS "Office"
FROM
"SAPB1P"."/BIC/AZ_RT_A212"
WHERE
"CALDAY" BETWEEN {start} AND {end}
{other_conds}
GROUP BY
"CALDAY","/BIC/ZSALE_OFF"
"""
Now you can use
myRow = myCursor.execute(query.format(
start='20180501'
end='20180501',
other_conds=''))
and
myRow = myCursor.execute(query.format(
start='20180501'
end='20180501',
other_conds='AND myOffice IN (1,2,3)'))

Adjusting a for loop into a list comprehension

I loop through a list of tables and dates from a database to gather data. Something like this:
df_list = []
for table in table_list:
for date in required_date_range:
query = 'SELECT * FROM {} WHERE row_date = {};'.format(table, date)
df = pd.read_sql_query(sql=query, con=engine)
df_list.append(df)
result = pd.concat(df_list)
Is there a way to put a loop like that into a list comprehension? Is it even worth it?
I found some example code from https://tomaugspurger.github.io/modern-4-performance.html
files = glob.glob('weather/*.csv')
weather_dfs = [pd.read_csv(fp, names=columns) for fp in files]
weather = pd.concat(weather_dfs)
It looks better and the charts show it performs better but I just can't seem to wrap my head around it when I try to adjust my own code.
Edit-
It seems to work if I make a list of the queries instead. Is there a way to get that initial for loop and .format into a list comprehension as well?
for table in table_list:
for date in required_date_range:
queries = ['SELECT * FROM {} WHERE row_date = {};'.format(table, date)]
dfs = [pd.read_sql_query(query, con=pg_engine) for query in queries]
I don't think a list comprehension by itself would give you a significant performance boost. I mean, it might give you a slight performance increase compared to a loop, but I don't think it'd be significant relative to the other things that need to be done, e.g. querying the database, initializing the dataframe, concating.
What could potentially give you a performance boost is eliminating your inner loop by using the SQL IN operator:
SELECT * FROM table_name WHERE row_date IN (date1, date2, date3,...);
So, that would change your loop to something like:
df_list = []
for table in table_list:
query = 'SELECT * FROM {} WHERE row_date IN ({});'.format(table, ','.join(date_range))
df = pd.read_sql_query(sql=query, con=engine)
df_list.append(df)
From there it's fairly straightforward to convert it to a comprehension:
query = 'SELECT * FROM {} WHERE row_date IN ({});'
dfs = (pd.read_sql_query(sql=query.format(table, ','.join(date_range)), con=engine) for table in table_list)
df = pd.concat(dfs)
If the columns from each table are identical and in the same order, you could even eliminate the table loop by using UNION ALL to build a single query along the lines of:
SELECT * FROM table1 WHERE row_date IN (date1, date2, date3,...)
UNION ALL
SELECT * FROM table2 WHERE row_date IN (date1, date2, date3,...)
UNION ALL
...
And then just do a single read_sql_query call:
df = pd.read_sql_query(sql=union_all_query, con=engine)
I think this should work
def q(table, date):
query = 'SELECT * FROM {} WHERE row_date = {};'.format
return pd.read_sql_query(sql=query(table, date), con=engine)
df_list = [q(table, date) for table in table_list for date in required_date_range]
Dmonstration
Note: I switched to returning just the query as this is a demonstration and I don't have your database connections.
table_list = ['table1', 'table2']
required_date_range = ['date1', 'date2']
def q(table, date):
query = 'SELECT * FROM {} WHERE row_date = {};'.format
return query(table, date)
df_list = [q(table, date) for table in table_list for date in required_date_range]
df_list
['SELECT * FROM table1 WHERE row_date = date1;',
'SELECT * FROM table1 WHERE row_date = date2;',
'SELECT * FROM table2 WHERE row_date = date1;',
'SELECT * FROM table2 WHERE row_date = date2;']