PySpark - Querying a Oracle Database Directly without loading it entirely beforehand

PySpark - Querying a Oracle Database Directly without loading it entirely beforehand - pandas

I am trying to query a database directly:
file_df.createOrReplaceTempView("file_contents")
QUERY = "SELECT * FROM TABLE1 INNER JOIN file_contents on TABLE1.ID = file_contents.ID"
df = sqlContext.read.format("jdbc").options(
url=URL,
driver=DRIVER,
query=QUERY,
user=USER,
password=PASSWORD
).load()
TABLE1 is in the Oracle Database.
However, this code results in the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o343.load.
: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
How can I fix this? That is I want to not load the large database table and instead query it directly and load only the contents that result from the inner join with the TempView file_contents.

You cannot do it without taking to same platform
Option 1 - Preferred
spark = SparkSession.builder.getOrCreate()
jdbcUrl = "jdbc:oracle:thin:#{0}:{1}/{2}".format("asdas", "1521", "asdasd")
connectionProperties = {
"user" : "asdasd",
"password" : "asdasda",
"driver" : "oracle.jdbc.driver.OracleDriver",
"fetchsize" : "100000"
}
pushdown_query = "(SELECT * FROM TABLE1 ) aliasname"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
file_df.createOrReplaceTempView("file_contents")
df.createOrReplaceTempView("table1")
spark.sql("SELECT * FROM TABLE1 INNER JOIN file_contents on TABLE1.ID = file_contents.ID")
Option 2
You have to have temp table in the oracle end to load it say table is filecontent so this and then try extract the required.
file_df.write.format('jdbc').options(url=tgt_url,driver=tgtdriver, dbtable=filecontent,user=tgt_username,password=tgt_password).mode("overwrite).option("truncate","true").save()
Option3 - If the if file content is something collected as list and passed in clause
file_df_id= file_df.select("ID").rdd.flatMap(lambda x: x).collect()
query_param = ",".join(file_df_id)
query = f'select * from table1 where TABLE1.ID in ({query_param}) query_temp'
print(query)
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

Related

Push a SQL query to a server from JDBC connection which reads from multiple databases within that server

I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).

SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8

Writing results of SQL query to Temp View in Databricks

I would like to create a Temporary View from the results of a SQL Query - which sounds like a basic thing to do, but I just couldn't make it work and don't understand what is wrong.
This is my SQL query - which works fine and returns Col1.
%sql
SELECT
Col1
FROM
Table1
WHERE EXISTS (
select *
from TempView1)
I would like to write the results in another table which I can query. Therefore I do this :
df = spark.sql("""
SELECT
Col1
FROM
Table1
WHERE EXISTS (
select *
from TempView1)""")
OK
df
Out[28]: DataFrame[Col1: bigint]
df.createOrReplaceTempView("df_tmp_view")
OK
%sql
select * from df_tmp_view
Error in SQL statement: AnalysisException: Table or view not found: df_tmp_view; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [df_tmp_view], [], false
display(affected_customers_tmp_view)
NameError: name 'df_tmp_view' is not defined
What am I doing wrong ?
I don't understand the error saying that the name is not defined although I define it just one command above. Also the SQL query is working and returning data...so what am I missing ?
Thanks !

you need to get the global context of the view, for example in your case:
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + 'df_tmp_view'))
documentation
for example:
df_pd = pd.DataFrame(
{
'Name' : [231232,12312321,3213231],
}
)
df = spark.createDataFrame(df_pd)
df.createOrReplaceGlobalTempView('test_tmp_view')
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + 'test_tmp_view'))

find tables with specific columns' names in a database on databricks by pyspark

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks

The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.

I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```

#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")

SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.

R vs SQL - loading only some data from db

I'm using this code for load only id that are in my df.
library(dplyr)
tbl(conn, "table") %>%
filter(idvar %in% df$id) %>%
select(var1, var2, var3) %>%
collect()
The question is how to use that with joining and another criteria like on code below, but still load only that matched ids - there are milions ids in my db but in my df are only hundreds.
SELECT *
FROM table
LEFT JOIN table2 on table2.id = table.id
WHERE date > "2010-01-01" and column3 is not null

Hope this helps you with little workaround.
I have tried with similar scenario and it worked for me.
Note : I didn't try using dplyr.
I have used My-SQL as db and DBI & pool are the R packages.
library(DBI)
library(pool)
pool <- dbPool(drv = RMySQL::MySQL(),dbname = "db_name",host = "host_name",username = "User_name", password = "password", port = 3306, unix.sock = "/path/to/mysqld/mysqld.sock")
In the above line at unix.sock i gave My_SQL socket path because i've encountered a problem without it. To get the socket path:
mysql_config --socket (ubuntu)
users <- lapply(df$id, function(x){
dbGetQuery(pool, paste0("SELECT * FROM table LEFT JOIN table2 on table2.id = table.id
WHERE table.user_id IN('", x,"');" ))
})
Please edit the SQL query according to your requirement till WHERE condition.
It fetches from database as a list. Process that list as per your requirement.

Return a List of typed object via CreateSQLQuery in NHibernate

Been trying to get the following query working for a few hours now and am running out of ideas. Can anyone spot where I'm going wrong. Any pointers much appreciated.
CalEvents = (List<CalEvent>)session.CreateSQLQuery(#"
SELECT *
FROM dbo.tb_calendar_calEvents
INNER JOIN dbo.tb_calEvents
ON (dbo.tb_calendar_calEvents.calEventID = dbo.tb_calEvents.id)
WHERE dbo.tb_calendar_calEvents.calendarID = 'theCalID'"
)
.AddEntity(typeof(CalEvent))
.SetInt64("theCalID", cal.id);
Error:
Kanpeki.NUnit.CalUserTest.Should_return_logged_in_user:
System.ArgumentException : Parameter theCalID does not exist as a
named parameter in [SELECT * FROM dbo.tb_calendar_calEvents INNER JOIN
dbo.tb_calEvents ON (dbo.tb_calendar_calEvents.calEventID =
dbo.tb_calEvents.id) WHERE dbo.tb_calendar_calEvents.calendarID =
'theCalID']

"SELECT * FROM dbo.tb_calendar_calEvents INNER JOIN dbo.tb_calEvents ON (dbo.tb_calendar_calEvents.calEventID = dbo.tb_calEvents.id) WHERE dbo.tb_calendar_calEvents.calendarID = 'theCalID'"
should be
"SELECT * FROM dbo.tb_calendar_calEvents INNER JOIN dbo.tb_calEvents ON (dbo.tb_calendar_calEvents.calEventID = dbo.tb_calEvents.id) WHERE dbo.tb_calendar_calEvents.calendarID = :theCalID"
= 'theCalID' should be written as = :theCalId; :theCalId is how you use named parameters even in Native SQL Queries.

You should remove the query.ExecuteUpdate() call.
Doing the query.List() is enough to issue the query on the session and return the result set.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PySpark - Querying a Oracle Database Directly without loading it entirely beforehand - pandas

Related

Push a SQL query to a server from JDBC connection which reads from multiple databases within that server

Writing results of SQL query to Temp View in Databricks

find tables with specific columns' names in a database on databricks by pyspark

R vs SQL - loading only some data from db

Return a List of typed object via CreateSQLQuery in NHibernate

Categories

Resources