Is it possible in SparkSQL to join the data from mysql and Oracle databases? I tried to join them, but I have some troubles with set the multiple jars (jdbc drivers for mysql and Oracle) in SPARK_CLASSPATH.
Here is my code:
import os
import sys
os.environ['SPARK_HOME']="/home/x/spark-1.5.2"
sys.path.append("/home/x/spark-1.5.2/python/")
try:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
MYSQL_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/mysql-connector-java-5.1.38-bin.jar"
MYSQL_CONNECTION_URL = "jdbc:mysql://192.111.333.999:3306/db?user=us&password=pasw"
ORACLE_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/ojdbc6.jar"
Oracle_CONNECTION_URL = "jdbc:oracle:thin:user/pasw#192.111.333.999:1521:xe"
# Define Spark configuration
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("MySQL_Oracle_imp_exp")
# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
#sc.addJar(MYSQL_DRIVER_PATH)
sqlContext = SQLContext(sc)
ora_tmp=sqlContext.read.format('jdbc').options(
url=Oracle_CONNECTION_URL,
dbtable="TABLE1",
driver="oracle.jdbc.OracleDriver"
).load()
ora_tmp.show()
tmp2=sqlContext.load(
source="jdbc",
path=MYSQL_DRIVER_PATH,
url=MYSQL_CONNECTION_URL,
dbtable="(select city,zip from TABLE2 limit 10) as tmp2",
driver="com.mysql.jdbc.Driver")
c_rows=tmp2.collect()
....
except Exception as e:
print e
sys.exit(1)
Could someone please help me to solve this problem?
Thanks in advance :)
Here are the steps you need to follow:
First register SPARK_CLASSPATH to jars of one of the databases say mysql using command
os.environ['SPARK_CLASSPATH'] = "/usr/share/java/mysql-connector-java.jar"
Run query against mysql database and assign to RDD
Register SPARK_CLASSPATH with jars of second database by changing the path from above command
Run query against second database
If you have issues with lazy evaluation, make sure you first write first data set into files and then proceed further.
Related
I have my postgres installed on PC1 and I am connecting to the database using PC2. I have modified the settings so that postgres on PC1 is accessible to local network.
On PC2 I am doing the following:
import pandas as pd, pyodbc
from sqlalchemy import create_engine
z1 = create_engine('postgresql://postgres:***#192.168.40.154:5432/myDB')
z2 = pd.read_sql(fr"""select * from public."myTable" """, z1)
I get the error:
File "C:\Program Files\Python311\Lib\site-packages\pandas\io\sql.py", line 1405, in execute
return self.connectable.execution_options().execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'OptionEngine' object has no attribute 'execute'
While running the same code on PC1 I get no error.
I just noticed that it happens only when reading from the db. If I do to_sql it works. Seems there is missing on the PC2 as instead of trying 192.168.40.154:5432 if I use localhost:5432 I get the same error.
Edit:
Following modification worked but not sure why. Can someone please educate me what could be the reason for this.
from sqlalchemy.sql import text
connection = connection = z1.connect()
stmt = text("SELECT * FROM public.myTable")
z2 = pd.read_sql(stmt, connection)
Edit2:
PC1:
pd.__version__
'1.5.2'
import sqlalchemy
sqlalchemy.__version__
'1.4.46'
PC2:
pd.__version__
'1.5.3'
import sqlalchemy
sqlalchemy.__version__
'2.0.0'
Does it mean that if I update the packages on PC1 everything is going to break?
I ran into the same problem just today and basically it's the SQLalchemy version, if you look at the documentation here the SQLalchemy version 2.0.0 was released a few days ago so pandas is not updated, for now I think the solution is sticking with the 1.4.x version.
The sqlalchemy.sql.text() part is not the issue. The addition of connection() to the connect_engine() instruction seems to have done the trick.
You should also use a context manager in addition to a SQLAlchemy SQL clause using text, e.g.:
import pandas as pd, pyodbc
from sqlalchemy import create_engine, text
engine = create_engine('postgresql://postgres:***#192.168.40.154:5432/myDB')
with engine.begin() as connection:
res = pd.read_sql(
sql=text(fr'SELECT * FROM public."myTable"'),
con=connection,
)
As explained here https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html :
conSQLAlchemy connectable, str, or sqlite3 connection
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. The user is
responsible for engine disposal and connection closure for the
SQLAlchemy connectable; str connections are closed automatically. See
here.
--> especially this point: https://docs.sqlalchemy.org/en/20/core/connections.html#connect-and-begin-once-from-the-engine
I have an Azure data lake external table, and want to remove all rows from it. I know that the 'truncate' command doesn't work for external tables, and BTW I don't really want to re-create the table (but might consider that option for certain flows). Anyway, the best I've gotten to work so far is to create an empty data frame (with a defined schema) and overwrite the folder containing the data, e.g.:
from pyspark.sql.types import *
data = []
schema = StructType(
[
StructField('Field1', IntegerType(), True),
StructField('Field2', StringType(), True),
StructField('Field3', DecimalType(18, 8), True)
]
)
sdf = spark.createDataFrame(data, schema)
#sdf.printSchema()
#display(sdf)
sdf.write.format("csv").option('header',True).mode("overwrite").save("/db1/table1")
This mostly works, except that if I go to select from the table, it will fail with the below error:
Error: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13) (vm-9cb62393 executor 2): java.io.FileNotFoundException: Operation failed: "The specified path does not exist."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried running 'refresh' on the table but the error persisted. Restarting the spark session fixes it, but that's not ideal. Is there a correct way for me to be doing this?
UPDATE: I don't have it working yet, but at least I now have a function that dynamically clears the table:
from pyspark.sql.types import *
from pyspark.sql.types import _parse_datatype_string
def empty_table(database_name, table_name):
data = []
schema = StructType()
for column in spark.catalog.listColumns(table_name, database_name):
datatype_string = _parse_datatype_string(column.dataType)
schema.add(column.name, datatype_string, True)
sdf = spark.createDataFrame(data, schema)
path = "/{}/{}".format(database_name, table_name)
sdf.write.format("csv").mode("overwrite").save(path)
when I insert my query result to an Oracle and even Netezza table, the code below works well with queryResultToTable() method, all records are loaded as expected.
import cx_Oracle
import pandas
import sys
from multiprocessing.pool import ThreadPool # Import ThreadPool to enable parallel execution
from sqlalchemy import create_engine, inspect # Import create_engine to use Pandas database function, e.g. dataframe.to_sql()
from sqlalchemy.dialects.oracle import \
BFILE, BLOB, CHAR, CLOB, DATE, \
DOUBLE_PRECISION, FLOAT, INTERVAL, LONG, NCLOB, \
NUMBER, NVARCHAR, NVARCHAR2, RAW, TIMESTAMP, VARCHAR, \
VARCHAR2
import netezza_dialect
class databaseOperation():
def queryResultToTable(self, sourceDBEngineURL, targetDBEngineURL, targetSchemaName, targetTableName, targetDataTypes, queryScript):
sourceDBEngine = create_engine(sourceDBEngineURL)
try:
with sourceDBEngine.connect() as sourceDBConnection:
try:
queryResult = pandas.read_sql(queryScript,sourceDBConnection)
except Exception as e:
print(e)
except Exception as e:
print(e)
return
targetDBEngine = create_engine(targetDBEngineURL)
try:
with targetDBEngine.connect() as targetDBConnection:
targetDBConnection.execution_options(autocommit = True) # sumbit commit() automatically
try:
queryResult.to_sql(targetTableName, targetDBConnection, targetSchemaName, if_exists = 'append', index = False, dtype = targetDataTypes, method = None)
# !!! method = 'multi' doesn't work for Oracle database
except Exception as e:
print(e)
except Exception as e:
print(e)
return
if __name__=='__main__':
db = databaseOperation()
sourceORAEngineURL = "....." # the format like "oracle+cx_oracle://user:pwd#server_address1/db1"
targetORAEngineURL = "....." # the format like "oracle+cx_oracle://user:pwd#server_address2/db2"
sql = "SELECT abc, def, ggg FROM table_name WHERE abc = 'txt'"
ORA_targetSCHEMANAME = 'hr'
ORA_targetTABLENAME = 'cmpresult'
ORA_tagetDATATYPES = {
'abc': NVARCHAR2(20),
'def': NVARCHAR2(100),
'ggg': NVARCHAR2(100)
}
db.queryResultToTable(sourceORAEngineURL, targetORAEngineURL, ORA_targetSCHEMANAME, ORA_targetTABLENAME, ORA_tagetDATATYPES, sql)
sys.exit(0)
But when I change method = None to method = 'multi', like:
queryResult.to_sql(targetTableName, targetDBConnection, targetSchemaName, if_exists = 'append', index = False, dtype = targetDataTypes, method = 'multi')
with the same method, Netezza works fine, but Oracle got the message as below:
'CompileError' object has no attribute 'orig'
other than that, no more information displayed, and I have no idea what issue is. I also tried to switch Connection.execution_options(autocommit = True) on or off, but no change.
could someone can help me out?
It looks like this is not supported for oracle db.
Per pandas docs
Pass multiple values in a single INSERT clause. It uses a special SQL
syntax not supported by all backends.
The SQL Alchemy documentation notes:
Two phase transactions are not
supported under cx_Oracle due to poor driver support. As of cx_Oracle
6.0b1, the interface for two phase transactions has been changed to be more of a direct pass-through to the underlying OCI layer with less
automation. The additional logic to support this system is not
implemented in SQLAlchemy.
This question suggests using executemany from cx_Oracle
I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
1. using SQLContext
~~~~~~~~~~~~~~~~~~~~
1. import org.apache.spark.sql.SQLContext
2. val sqlctx = new SQLContext(sc)
3. import sqlctx._
val df = sqlctx.read.format("com.databricks.spark.csv").option("inferScheme","true").option("delimiter",";").option("header","true").load("/user/cloudera/data.csv")
df.select(avg($"col1")).show() // this works fine
sqlctx.sql("select percentile_approx(balance,0.5) as median from port_bank_table").show()
or
sqlctx.sql("select percentile(balance,0.5) as median from port_bank_table").show()
// both are not working , getting the below error
org.apache.spark.sql.AnalysisException: undefined function percentile_approx; line 0 pos 0
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
using HiveContext
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
so tried using hive context
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hivectx = new HiveContext(sc)
18/01/09 22:51:06 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
hivectx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#5be91161
scala> import hivectx._
import hivectx._
getting the below error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#be453c4,
see the next exception for details.
I can't find any percentile_approx,percentile function in Spark aggregation functions.It does not seem like this functionality is built into the Spark DataFrames. For more information please follow this How to calculate Percentile of column in a DataFrame in spark?
I hope so it will help you.
I don't think so, it should work, for that you should save the table in
dataFrame using saveAsTable. Then you will be able to run your query using
HiveContext.
df.someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("Table_name")
# In my case "mode" is working as mode("Overwrite")
hivectx.sql("select avg(col1) as median from Table_name").show()
It will work.