Loading Data from sql table to spark dataframe in Azure Databricks - azure-sql-database

I'm new to dbricks and I'm learning it. I am trying to load a SQL table into a dataframe. I am following the official documentation from Microsoft.
But I am getting this error:
java.lang.ClassNotFoundException:
My notebook block:
connectionProperties = {
"Driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
server_name = "jdbc:sqlserver://removed.database.windows.net"
database_name = "demo"
url = server_name + ";" + "databaseName=" + database_name + ";"
table_name = "Production.Data"
username = "removed"
password = "dummy"
try:
Dataframe = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
except ValueError as error :
print("Connector write failed", error)
Error :

java.lang.ClassNotFoundException
This error mainly happens because of package missing, please install below package.
Loading Data from SQL table to spark dataframe
Please follow below repro it has a detailed information about how to connect Azure SQL to Databricks.
PySpark
Updated code:
jdbcHostname = "xxxx.database.windows.net"
jdbcDatabase = "Databasename"
jdbcPort = "1433"
username = "sql Username"
password = "xxxxx"
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
conProp = {
"user" : username,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
def1 = "(Select * from Table_name where Table_ID = 1) Table_ID "
df = spark.read.jdbc(url=jdbcUrl, table=def1, properties=conProp)
display(df)
Output:
Reference:
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases
https://www.sqlshack.com/load-data-into-azure-sql-database-from-azure-databricks/

Related

Login failed for user token-identified principal when connecting via sql_driver_manager.getConnection

I am trying to connect to Azure SQL using Service Principle to create views, but it says
com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user ClientConnectionId: XXXXX-XXXX-XXXX
However, with the same SPN I was able to connect and create tables, read tables.
import adal
resource_app_id_url = "https://database.windows.net/"
service_principal_id = dbutils.secrets.get(scope = "XX", key = "XXX")
service_principal_secret = dbutils.secrets.get(scope = "XX", key = "spn-XXXX")
tenant_id = dbutils.secrets.get(scope = "XX", key = "xxId")
authority = "https://login.windows.net/" + tenant_id
azure_sql_url = "jdbc:sqlserver://xxxxxxx.windows.net"
database_name = "testDatabase"
encrypt = "true"
host_name_in_certificate = "*.database.windows.net"
context = adal.AuthenticationContext(authority)
token = context.acquire_token_with_client_credentials(resource_app_id_url, service_principal_id, service_principal_secret)
access_token = token["accessToken"]
using above code I am able to create and read tables. There is a requirement to create views so I am using sql_driver_manager to connect to Azure SQL
properties = spark._sc._gateway.jvm.java.util.Properties()
properties.setProperty("accessToken", access_token)
sql_driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
sql_con = sql_driver_manager.getConnection(azure_sql_url, properties)
query = """
create or alter view test_view as select * from dbo.test_table
"""
stmt = sql_con.createStatement()
stmt.executeUpdate(query)
stmt.close()
this is resulting in an error:
Py4JJavaError: An error occurred while calling
z:java.sql.DriverManager.getConnection. :
com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user token-identified principal. ClientConnectionId:
If I try the same with username and password instead of token, it works but I just need to use spn token for authenticating.
Working code:
sql_driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
sql_con = sql_driver_manager.getConnection(azure_sql_url, username, password)
query = """
create or alter view test_view as select * from dbo.test_table
"""
stmt = sql_con.createStatement()
stmt.executeUpdate(query)
stmt.close()
What is that I am missing, can someone help me understand the issue. Thanks.
You can not use that method since it was built to execute an update statement and return indexes.
See documentation. Use the prepare() and execute() methods.
https://learn.microsoft.com/mt-mt/sql/connect/jdbc/reference/executeupdate-method-java-lang-string?view=azuresqldb-current
#
# 5 - Upsert from stage to active table.
# Grab driver manager conn, exec sp
#
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(url, user_name, password)
connection.prepareCall("exec [stage].[UpsertFactInternetSales]").execute()
connection.close()
This is sample code from an article I wrote. It calls a stored procedure to execute an UPSERT. However, any DML or DDL will work as long as it does not return a result set.

Why I'm getting an error in DataBricks connection with a SQL database?

I am trying to connect to a SQL Server but somehow i'm getting the below error when trying to connect to db from databricks using Python:
Error:java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbd.SQLServerDriver
My connection code is the next one:
jdbcHostname = "hostname"
jdbcDatabase = "databasename"
jdbcPort = port
username = 'username'
password = 'password'
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : username,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbd.SQLServerDriver"
}
The last code works, but when I try to execute a query I got the mentioned error coming out from the second code line from the next block:
pushdown_query = "select * from table"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
I tried to install different connectors but I have not been lucky with it, could you help me?

how to set superset SQLALCHEMY URI to connect HiveServer2 with custom auth

Superset want to connect HiveServer2 datasource with custom auth (that is specify username and password), python code as below is ok
from pyhive import hive
host_name = "192.168.0.38"
port = 10000
user = "admin"
password = "password"
database="test_db"
def hiveconnection(host_name, port, user,password, database):
conn = hive.Connection(host=host_name, port=port, username=user, password=password,
database=database, auth='CUSTOM')
cur = conn.cursor()
cur.execute('select item_sk,reason_sk, account_credit from returns limit 5')
result = cur.fetchall()
return result
but how to set Superset SQLALCHEMY URI ?
hive://hive#{hostname}:{port}/{database}
please see:
https://github.com/apache/superset/discussions/18699#discussioncomment-2170334
hive://username:password#hostname:port/database?auth=CUSTOM

How to perform a SQL query with SQLAlchemy to later pass it into a pandas dataframe

I have followed this article set up, it all seems to work properly, but I would like now to perform a SQL query and pass the result into a pandas data frame, how could I proceed?
This is what I have now;
host_server = os.environ.get('host_server', 'localhost')
db_server_port = urllib.parse.quote_plus(str(os.environ.get('db_server_port', '5432')))
database_name = os.environ.get('database_name', 'my_data_base123')
db_username = urllib.parse.quote_plus(str(os.environ.get('db_username', 'my_user_name123')))
db_password = urllib.parse.quote_plus(str(os.environ.get('db_password', 'my_password123')))
ssl_mode = urllib.parse.quote_plus(str(os.environ.get('ssl_mode','prefer')))
DATABASE_URL = 'postgresql://{}:{}#{}:{}/{}?sslmode={}'.format(db_username, db_password, host_server, db_server_port, database_name, ssl_mode)
database = databases.Database(DATABASE_URL)
metadata = sqlalchemy.MetaData()
engine = sqlalchemy.create_engine(
DATABASE_URL, pool_size=3, max_overflow=0
)
metadata.create_all(database)
app = FastAPI(title="REST API using FastAPI PostgreSQL Async EndPoints")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"]
)
#app.on_event("startup")
async def startup():
await database.connect()
#app.on_event("shutdown")
async def shutdown():
await database.disconnect()
Trying to access the database with a SQL query:
with engine.connect() as conn:
result = conn.execute(text("select 'hello world'))
print(result.all())
as it says in the sqlalchemy documentation , but i get some errors, like:
print(result.all())
AttributeError: 'ResultProxy' object has no attribute 'all'
even if i try to access the tables of my database
with engine.connect() as conn:
result = conn.execute(text("select * FROM users"))
print(result.all())
i get the same error
It is solved, I had to upgrade SQLAlchemy
sudo pip install sqlalchemy --upgrade

Character encoding in R MySQL on Linux machine

I'm trying to fetch data which includes some German word with umlaut characters. following the bellow structure everything is fine in windows machine :
Sys.setlocale('LC_ALL','C')
library(RMySQL)
conn <- dbConnect(MySQL(), user = "user", dbname = "database",
host = "host", password = "pass")
sql.query <- paste0("some query")
df <- dbSendQuery(conn, sql.query)
names <- fetch(df, -1)
dbDisconnect(conn)
As an example I have :
names[1230]
[1] "StrĂ¼bbel"
What should I change in order to get the same result in Linux Ubuntu ?
the query will run without problem, but the result is :
names[1230]
[1] "Str\374bbel"
I have checked This solution, but when I put the 'set character set "utf8"' inside of query I'm getting the following error :
df <- dbSendQuery(conn, sql.query, 'set character set "utf8"')
names <- fetch(df, -1)
Error in .local(conn, statement, ...) :
unused argument ("set character set \"utf8\"")
I should mention the encoding for the result is unknown :
Encoding(names[1230])
[1] "unknown"
and doing the :
Encoding(names[1230]) <- "UTF-8"
names[1230]
[1] "Str<fc>bbel"
does not solve the problem !
Instead of :
Sys.setlocale('LC_ALL','C')
You have to use :
Sys.setlocale('LC_ALL','en_US.UTF-8')
and in the sql query :
library(RMySQL)
conn <- dbConnect(MySQL(), user = "user", dbname = "database",
host = "host", password = "pass")
sql.query <- paste0("some query")
dbSendQuery(conn,'set character set "utf8"')
df <- dbSendQuery(conn, sql.query)
names <- fetch(df, -1)
dbDisconnect(conn)
Not sure if this solution will help you but you could try such approach:
con <- dbConnect(MySQL(), user = "user", dbname = "database",
host = "host", password = "pass", encoding = "ISO-8859-1")
If this encoding doesn't work then try "brute force" with different variants