Getting a column full of selected column name instead of data in column (Python & SQLite3) - python-3.8

I have a sqlite file that I am trying to extract some data from, I established the connection and selected the columns that I wanted to save in a csv but when it saves it returns column with the column name repeated. The number of rows is the same as the data available (which I was able to see when I selected all columns).
#code:
file = sqlite_file
def SQLiteQuery(query, file):
con = sqlite3.connect(file)
return pd.read_sql_query(query, con)
#I used this to see the available db tables
query = "SELECT * FROM sqlite_master"
table_names = SQLiteQuery(query, file)
#Selecting specific columns
table = table_name.loc[15, 'name']
print(table)
query = (f"SELECT 'column_name1', 'column_name2' FROM '{table}';")
df = SQLiteQuery(query, file)
save_to = 'path_to_folder/filename.csv'
df.to_csv(save_to)
result in csv:
1 column_name1 column_name2
2 column_name1 column_name2
3 column_name1 column_name2
4 column_name1 column_name2
...

Related

How can I count all NULL values, without column names, using SQL?

I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results
If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results

how to refer values based on column names in python

i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#

column names are coming as 0,1,2,3 while executing SQL Query via python in snowflake using snowflake connector

I am executing a sql query from a python script to retrieve the data from snowflake in windows 10 but the resulting query is missing column names and its getting replaced by 0,1,2,3 so on. While executing query in snowflake interface and downloading csv is giving the columns in the file. I am passing column names as Aliases in my query
Below is code
def _CONSUMPTION(con):
data2 = con.cursor().execute("""select sd.sales_force_lvl_1_code "Plan-To Code",sd.sales_force_lvl_1_desc "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd, (select sales_force_lvl_1_code,max(sales_force_lvl_1_desc) sales_force_lvl_1_desc from DW.mv_us_sales_force_dim group by sales_force_lvl_1_code) sd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and sd.sales_force_lvl_1_code = md.curr_sales_force_lvl_1_code
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.curr_sales_force_lvl_1_code is not null
UNION
select '5000016240' "Plan-To Code", 'AWG TOTAL' "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.market_code = '20267'""").fetchall()
df = pd.DataFrame(data2)
df.head(5)
df.to_csv('CONSUMPTION.csv',index = False)
Looking [at the docs], seems the easiest way is to use the cursor method .fetch_pandas_all():
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
cur.execute(query).fetch_pandas_all()
Or if you want to dump the results into a CSV, just do so as in the question:
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
df = cur.execute(query).fetch_pandas_all()
df.to_csv('x.csv', index = False)
Visualized:
Looks like you haven’t defined the column methods in your code to define the data frame.
My recommendation will be to add column methods first df.columns
In addition refer snowflake page for details
https://docs.snowflake.com/en/user-guide/python-connector-pandas.html
Try this
import pandas as pd
def fetch_pandas_old(cur, sql):
cur.execute(sql)
rows = 0
while True:
dat = cur.fetchmany(50000)
if not dat:
break
df = pd.DataFrame(dat, columns=cur.description)
rows += df.shape[0]
print(rows)
A nice way to extract the column headings from the cursor description and save in a pandas df using the Snowflake connector (also works for psycopg2 btw) is as follows:
#Create the connection
def connect_snowflake(uname, pword, acct, role_name, whouse, dbase, schema_name):
conn = snowflake.connector.connect(
user=uname,
password=pword,
account=acct,
role = role_name,
warehouse = whouse,
database = dbase,
schema = schema_name
)
cur = conn.cursor()
return conn, cur
Then execute your query. The cur.description object returns a list of tuples, the first of each being the column name :)
conn, cur = connect_snowflake(username, password, account_name, role, warehouse, database, schema)
cur.execute('select * from my_schema.my_table')
result =cur.fetchall()
# Extract the column names
col_names = []
for elt in cur.description:
col_names.append(elt[0])
df = pd.DataFrame(result, columns=col_names)
cur.close()
conn.close()

find tables with specific columns' names in a database on databricks by pyspark

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks
The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.
I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")
SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.

READ TABLE with dynamic key fields?

I have the name of a table DATA lv_tablename TYPE tabname VALUE 'xxxxx', and a generic FIELD-SYMBOLS: <lt_table> TYPE ANY TABLE. which contains entries selected from that corresponding table.
I've defined my line structure FIELD-SYMBOLS: <ls_line> TYPE ANY. which i'd use for reading from the table.
Is there a way to create a READ statement on <lt_table> fully specifying the key fields?
I am aware of the statement / addition READ TABLE xxxx WITH KEY (lv_field_name) = 'asdf'., but this however wouldn't work (afaik) for a dynamic number of key fields, and I wouldn't like to create a large number of READ TABLE statements with an increasing number of key field specifications.
Can this be done?
Actually i found this to work
DATA lt_bseg TYPE TABLE OF bseg.
DATA ls_bseg TYPE bseg.
DATA lv_string1 TYPE string.
DATA lv_string2 TYPE string.
lv_string1 = ` `.
lv_string2 = lv_string1.
SELECT whatever FROM wherever INTO TABLE lt_bseg.
READ TABLE lt_bseg INTO ls_bseg
WITH KEY ('MANDT') = 800
(' ') = ''
('BUKRS') = '0005'
('BELNR') = '0100000000'
('GJAHR') = 2005
('BUZEI') = '002'
('') = ''
(' ') = ''
(' ') = ' '
(lv_string1) = '1'
(lv_string2) = ''.
By using this syntax one can just specify as many key fields as required. If some fields will be empty, then these will just get ignored, even if values are specified for these empty fields.
One must pay attention that using this exact syntax (static definitions), 2 fields with the exact same name (even blank names) will not be allowed.
As shown with the variables lv_string1 and lv_string2, at run-time this is no problem.
And lastly, one can specify the fields in any order (i don't know what performance benefits or penalties one might get while using this syntax)
There seems to be the possibility ( like a dynamic select statement whith binding and lt_dynwhere ).
Please refer to this post, there was someone, who also asked for the requirement:
http://scn.sap.com/thread/1789520
3 ways:
READ TABLE itab WITH [TABLE] KEY (comp1) = value1 (comp2) = value2 ...
You can define a dynamic number of key fields by indicating statically the maximum number of key fields in the code, and indicate at runtime some empty key field names if there are less key fields to be used.
LOOP AT itab WHERE (where) (see Addition 4 "WHERE (cond_syntax)")
Available since ABAP 7.02.
SELECT ... FROM #itab WHERE (where) ...
Available since ABAP 7.52. It may be slow if the condition is complex and cannot be handled by the ABAP kernel, i.e. it needs to be executed by the database. In that case, only few databases are supported (I think only HANA is supported currently).
Examples (ASSERT statements are used here to prove that the conditions are true, otherwise the program would fail):
TYPES: BEGIN OF ty_table_line,
key_name_1 TYPE i,
key_name_2 TYPE i,
attr TYPE c LENGTH 1,
END OF ty_table_line,
ty_internal_table TYPE SORTED TABLE OF ty_table_line WITH UNIQUE KEY key_name_1 key_name_2.
DATA(itab) = VALUE ty_internal_table( ( key_name_1 = 1 key_name_2 = 1 attr = 'A' )
( key_name_1 = 1 key_name_2 = 2 attr = 'B' ) ).
"------------------ READ TABLE
DATA(key_name_1) = 'KEY_NAME_1'.
DATA(key_name_2) = 'KEY_NAME_2'.
READ TABLE itab WITH TABLE KEY
(key_name_1) = 1
(key_name_2) = 2
ASSIGNING FIELD-SYMBOL(<line>).
ASSERT <line> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 2 attr = 'B' ).
key_name_2 = ''. " ignore this key field
READ TABLE itab WITH TABLE KEY
(key_name_1) = 1
(key_name_2) = 2 "<=== will be ignored
ASSIGNING FIELD-SYMBOL(<line_2>).
ASSERT <line_2> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).
"------------------ LOOP AT
DATA(where) = 'key_name_1 = 1 and key_name_2 = 1'.
LOOP AT itab ASSIGNING FIELD-SYMBOL(<line_3>)
WHERE (where).
EXIT.
ENDLOOP.
ASSERT <line_3> = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).
"---------------- SELECT ... FROM #itab
SELECT SINGLE * FROM #itab WHERE (where) INTO #DATA(line_3).
ASSERT line_3 = VALUE ty_table_line( key_name_1 = 1 key_name_2 = 1 attr = 'A' ).