find tables with specific columns' names in a database on databricks by pyspark - sql

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks

The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.

I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```

#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")

SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.

Related

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

Push a SQL query to a server from JDBC connection which reads from multiple databases within that server

I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).
SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8

Querying DynamoDB table using Global Secondary Indexes

I was trying to query a DynamoDB table using a Lambda function.
My table's partition key is id. I am trying to query it on another key named dipl_idpp. I unstrood that that is not possible.
I found here a solution: I need to create a Global Secondary Index poitning on the column that I want to query on (in my case dipl_idpp).
I did that on Dynamo. But when I execute my function I still have the same problem:
An error occurred (ValidationException) when calling the Query operation: Query condition missed key schema element: id', 'occurred at index 0')"
This is the code I use:
def query_dipl_dynamo(key_table,valeur_query,name_table):
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table(name_table)
response = table.query(
KeyConditionExpression=Key(key_table).eq(valeur_query))
df_fr = pd.DataFrame([response['Items']])
if len(df_fr.columns) > 0 :
print("hellooo1")
df = pd.DataFrame([response['Items'][0]])
return valeur_query, df["dipl_libelle"].iloc[0]
//
//
df9_tmp["dipl_idpp"] = df8_tmp.apply(lambda x : query_dipl_dynamo("dipl_idpp",x["num_auto"], "ddb-dev-PS_LibreAcces_Dipl_AutExerc")[0], axis=1)
Should I change something else beside creating the index? Too little documentation is available.
Thank you!
I just found the solution. When we use Indexes we must provide
an argument namde IndexName who takes the name of the index in Dynamo.
I had to change my code to:
def query_dipl_dynamo(key_table,valeur_query,name_table):
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table(name_table)
response = table.query(
IndexName:"NameOfTheIndexInDynamoDB",
KeyConditionExpression=Key(key_table).eq(valeur_query))
df_fr = pd.DataFrame([response['Items']])
if len(df_fr.columns) > 0 :
df = pd.DataFrame([response['Items'][0]])
return valeur_query, df["dipl_libelle"].iloc[0]
//
//
df9_tmp["dipl_idpp"] = df8_tmp.apply(lambda x : query_dipl_dynamo("dipl_idpp",x["num_auto"], "ddb-dev-PS_LibreAcces_Dipl_AutExerc")[0], axis=1)

Python cx_oracle bind variable with a list of items

I have a query like this:
SELECT prodId, prod_name , prod_type FROM mytable WHERE prod_type in (:list_prod_names)
I want to get the information of a product, depending on the possible types are: "day", "week", "weekend", "month". Depending on the date it might be at least one of those option, or a combination of all of them.
This info (List type) is returned by the function prod_names(date_search)
I am using cx_oracle bindings with code like:
def get_prod_by_type(search_date :datetime):
query_path = r'./queries/prod_by_name.sql'
raw_query = open(query_path).read().strip().replace('\n', ' ').replace('\t', ' ').replace(' ', ' ')
print(sql_read_op)
# Depending on the date the product types may be different
prod_names(search_date) #This returns a list with possible names
qry_params = {"list_prod_names": prod_names} # See attempts bellow
try:
db = DB(username='username', password='pss', hostname="localhost")
df = db.get(raw_query,qry_params)
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get_short_cov_op2() : %s\n%s' % exception_error
print(exception_error)
return df
For this: qry_params = {"list_prod_names": prod_names} I have tried multiple different things such as:
prod_names = ''.join(prod_names)
prod_names = str(prod_names)
prod_names =." \'"+''.join(prod_names)+"\'"
The only thing I have managed to get it work is by doing:
new_query = raw_query.format(list_prod_names=prodnames_for_date(search_date)).replace('[', '').replace(']','')
df = db.query(new_query)
I am trying not to use .format() because is bad practie to do a .format to an sql to prevent attacks.
db.py contains among other functions:
def get(self, sql, params={}):
cur = self.con.cursor()
cur.prepare(sql)
try:
cur.execute(sql, **params)
df = pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])
except Exception:
exception_error = traceback.format_exc()
exception_error = 'Exception on DB.get() : %s\n%s' % exception_error
print(exception_error)
self.con.rollback()
cur.close()
df.columns = df.columns.map(lambda x: x.upper())
return df
I would like to be able to do a type binding.
I am using:
python = 3.6
cx_oracle = 6.3.1
I have read the followig articles but I a still unable to find a solution:
Python cx_Oracle bind variables
Python cx_Oracle SQL with bind string variable
Search for name in cx_Oracle
Unfortunately you cannot bind an array directly unless you convert it to a SQL type and use a subquery -- which is fairly complex. So instead you need to do something like this:
inClauseParts = []
for i, inValue in enumerate(ARRAY_VALUE):
argName = "arg_" + str(i + 1)
inClauseParts.append(":" + argName)
clause = "%s in (%s)" % (columnName, ",".join(inClauseParts))
This works fine but be aware that if the number of elements in the array changes regularly that using this technique will create a separate statement that must be parsed for each number of elements. If you know that (in general) you won't have more than (for example) 10 elements in the array it would be better to append None to the incoming array so that the number of elements is always 10.
Hopefully that is clear enough!
I have finally manage to do it. It might not be pretty but it works.
I have modified my sql query to include an extra select which returns the value of my list of descriptors:
inner join (
SELECT regexp_substr(:my_list_of_items, '[^,]+', 1, LEVEL) as mylist
FROM dual
CONNECT BY LEVEL <= length(:my_list_of_items) - length(REPLACE(:my_list_of_items, ',', '')) + 1
) d
on d.mylist= a.corresponding_columns

How to get the partition column name of a Hive table

I'm developing a unix script where I'll be dealing with Hive tables partitioned by either column A or column B. I'd like to find on what column a table is partition on so that I can do subsequent operations on those partition instances.
Is there any property in Hive which returns the partition column directly?
I'm thinking I'll have to do a show create table and extract the partition name somehow if there isn't any other way possible.
May be not the best, but one more approach is by using describe command
Create table:
create table employee ( id int, name string ) PARTITIONED BY (city string);
Command:
hive -e 'describe formatted employee' | awk '/Partition/ {p=1}; p; /Detailed/ {p=0}'
Output:
# Partition Information
# col_name data_type comment
city string
you can improve it as per your need.
One more option which i dint explore is by querying meta-store repository tables to get the partition column information for a table.
Through scala/java api, we can get to the hive meta store and get the partition column names
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
val conf = new Configuration()
conf.set("hive.metastore.uris","thrift://hdppmgt02.domain.com:9083")
val hiveConf = new HiveConf(conf, classOf[HiveConf])
val metastoreClient = new HiveMetaStoreClient(hiveConf)
metastoreClient.getTable(db, tbl).getPartitionKeys.foreach(x=>println("Keys : "+x))
#use python pyhive:
import hive_client
def get_partition_column(table_name):
#hc=hive connection
hc=hive_client.HiveClient()
cur=hc.query("desc "+table_name)
return cur[len(cur)-1][0]
#################
hive_client.py
from pyhive import hive
default_encoding = 'utf-8'
host_name = 'localhost'
port = 10000
database="xxx"
class HiveClient:
def __init__(self):
self.conn = hive.Connection(host=host_name,port=port,username='hive',database=database)
def query(self, sql):
cursor = self.conn.cursor()
#with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetchall()
def execute(self,sql):
#with self.conn.cursor() as cursor:
cursor = self.conn.cursor()
cursor.execute(sql)
def close(self):`enter code here`
self.conn.close()
List<String> parts = new ArrayList<>();
try {
List<FieldSchema> partitionKeys = client.getTable(dbName, tableName).getPartitionKeys();
for (FieldSchema partition : partitionKeys) {
parts.add(partition.getName());
}
} catch (Exception e) {
throw new RuntimeException("Fail to get Hive partitions", e);
}