I have my postgres installed on PC1 and I am connecting to the database using PC2. I have modified the settings so that postgres on PC1 is accessible to local network.
On PC2 I am doing the following:
import pandas as pd, pyodbc
from sqlalchemy import create_engine
z1 = create_engine('postgresql://postgres:***#192.168.40.154:5432/myDB')
z2 = pd.read_sql(fr"""select * from public."myTable" """, z1)
I get the error:
File "C:\Program Files\Python311\Lib\site-packages\pandas\io\sql.py", line 1405, in execute
return self.connectable.execution_options().execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'OptionEngine' object has no attribute 'execute'
While running the same code on PC1 I get no error.
I just noticed that it happens only when reading from the db. If I do to_sql it works. Seems there is missing on the PC2 as instead of trying 192.168.40.154:5432 if I use localhost:5432 I get the same error.
Edit:
Following modification worked but not sure why. Can someone please educate me what could be the reason for this.
from sqlalchemy.sql import text
connection = connection = z1.connect()
stmt = text("SELECT * FROM public.myTable")
z2 = pd.read_sql(stmt, connection)
Edit2:
PC1:
pd.__version__
'1.5.2'
import sqlalchemy
sqlalchemy.__version__
'1.4.46'
PC2:
pd.__version__
'1.5.3'
import sqlalchemy
sqlalchemy.__version__
'2.0.0'
Does it mean that if I update the packages on PC1 everything is going to break?
I ran into the same problem just today and basically it's the SQLalchemy version, if you look at the documentation here the SQLalchemy version 2.0.0 was released a few days ago so pandas is not updated, for now I think the solution is sticking with the 1.4.x version.
The sqlalchemy.sql.text() part is not the issue. The addition of connection() to the connect_engine() instruction seems to have done the trick.
You should also use a context manager in addition to a SQLAlchemy SQL clause using text, e.g.:
import pandas as pd, pyodbc
from sqlalchemy import create_engine, text
engine = create_engine('postgresql://postgres:***#192.168.40.154:5432/myDB')
with engine.begin() as connection:
res = pd.read_sql(
sql=text(fr'SELECT * FROM public."myTable"'),
con=connection,
)
As explained here https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html :
conSQLAlchemy connectable, str, or sqlite3 connection
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. The user is
responsible for engine disposal and connection closure for the
SQLAlchemy connectable; str connections are closed automatically. See
here.
--> especially this point: https://docs.sqlalchemy.org/en/20/core/connections.html#connect-and-begin-once-from-the-engine
Related
I have a file called im.db in my current working directory, as shown here:
I also am able to query this database directly from sqlite3 at the command line:
% sqlite3
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite\> .open im.db
sqlite\> select count(\*) from writers;
255873
However, when running the same query inside of my notebook:
import sqlite3 as sql
con = sql.connect("im.db")
writers_dataframe = pd.read_sql_query("SELECT COUNT(\*) from writers", con)
I get an error message which states no such table: writers
Any help would be much appreciated. Thanks!
querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!
503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page
I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall
I am trying to do POC in pyspark on a very simple requirement. As a first step, I am just trying to copy the table records from one table to another table. There are more than 20 tables but at first, I am trying to do it only for the one table and later enhance it to multiple tables.
The below code is working fine when I am trying to copy only 10 records. But, when I am trying to copy all records from the main table, this code is getting stuck and eventually I have to terminate it manually. As the main table has 1 million records, I was expecting it to happen in few seconds, but it just not getting completed.
Spark UI :
Could you please suggest how should I handle it ?
Host : Local Machine
Spark verison : 3.0.0
database : Oracle
Code :
from pyspark.sql import SparkSession
from configparser import ConfigParser
#read configuration file
config = ConfigParser()
config.read('config.ini')
#setting up db credentials
url = config['credentials']['dbUrl']
dbUsr = config['credentials']['dbUsr']
dbPwd = config['credentials']['dbPwd']
dbDrvr = config['credentials']['dbDrvr']
dbtable = config['tables']['dbtable']
#print(dbtable)
# database connection
def dbConnection(spark):
pushdown_query = "(SELECT * FROM main_table) main_tbl"
prprDF = spark.read.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable",pushdown_query)\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.option("numPartitions", 2)\
.load()
prprDF.write.format("jdbc")\
.option("url",url)\
.option("user",dbUsr)\
.option("dbtable","backup_tbl")\
.option("password",dbPwd)\
.option("driver",dbDrvr)\
.mode("overwrite").save()
if __name__ =="__main__":
spark = SparkSession\
.builder\
.appName("DB refresh")\
.getOrCreate()
dbConnection(spark)
spark.stop()
It looks like you are using only one thread(executor) to process the data by using JDBC connection. Can you check the executors and driver details in Spark UI and try increasing the resources. Also share the error by which it's failing. You can get this from the same UI or use CLI to logs "yarn logs -applicationId "
I would like to be performing a simple task of bringing table data from a MS Access database into Pandas in the form of a dataframe. I had this working great recently and now I can not figure out why it is no longer working. I remember when initially troubleshooting the connection there was work that I needed to do around installing a new microsoft database driver with the correct bitness so I have revisited that and gone through a reinstallation of the driver. Below is what I am using for a setup.
Record of install on Laptop:
OS: Windows 7 Professional 64-bit (verified 9/6/2017)
Access version: Access 2016 32bit (verified 9/6/2017)
Python version: Python 3.6.1 (64-bit) found using >Python -V (verified 9/11/2017)
the AccessDatabaseEngine needed will be based on the Python bitness above
Windows database engine driver installed with AccessDatabaseEngine_X64.exe from 2010 release using >AccessDatabaseEngine_X64.exe /passive (verified 9/11/2017)
I am running the following simple test code to try out the connection to a test database.
import pyodbc
import pandas as pd
[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver')]
returns:
['Microsoft Access Driver (*.mdb, *.accdb)']
Setting the connection string.
dbpath = r'Z:\1Users\myfiles\software\JupyterNotebookFiles\testDB.accdb'
conn_str = (r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};''DBQ=%s;' %(dbpath))
cnxn = pyodbc.connect(conn_str)
crsr = cnxn.cursor()
Verifying that the I am connected to the db...
for table_info in crsr.tables(tableType='TABLE'):
print(table_info.table_name)
returns:
TestTable1
Trying to connect to TestTable1 gives the error below.
dfTable = pd.read_sql_table(TestTable1, cnxn)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-a24de1550834> in <module>()
----> 1 dfTable = pd.read_sql_table(TestTable1, cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
NameError: name 'TestTable1' is not defined
Trying again with single quotes gives the error below.
dfTable = pd.read_sql_table('TestTable1', cnxn)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-15-1f89f9725f0a> in <module>()
----> 1 dfTable = pd.read_sql_table('TestTable1', cnxn)
2 #dfQuery = pd.read_sql_query("SELECT FROM [TestQuery1]", cnxn)
C:\Users\myfiles\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
250 con = _engine_builder(con)
251 if not _is_sqlalchemy_connectable(con):
--> 252 raise NotImplementedError("read_sql_table only supported for "
253 "SQLAlchemy connectable.")
254 import sqlalchemy
NotImplementedError: read_sql_table only supported for SQLAlchemy connectable.
I have tried going back to the driver issue and reinstalling a 32bit version without any luck.
Anybody have any ideas?
Per the docs of pandas.read_sql_table:
Given a table name and an SQLAlchemy connectable, returns a DataFrame.
This function does not support DBAPI connections.
Since pyodbc is a DBAPI, use the query method, pandas.read_sql which the con argument does support DBAPI:
dfTable = pd.read_sql("SELECT * FROM TestTable1", cnxn)
Reading db table with just table_name
import pandas
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://user:password#localhost/db_name')
df=pandas.read_sql_table("table_name",engine)
Is it possible in SparkSQL to join the data from mysql and Oracle databases? I tried to join them, but I have some troubles with set the multiple jars (jdbc drivers for mysql and Oracle) in SPARK_CLASSPATH.
Here is my code:
import os
import sys
os.environ['SPARK_HOME']="/home/x/spark-1.5.2"
sys.path.append("/home/x/spark-1.5.2/python/")
try:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
MYSQL_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/mysql-connector-java-5.1.38-bin.jar"
MYSQL_CONNECTION_URL = "jdbc:mysql://192.111.333.999:3306/db?user=us&password=pasw"
ORACLE_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/ojdbc6.jar"
Oracle_CONNECTION_URL = "jdbc:oracle:thin:user/pasw#192.111.333.999:1521:xe"
# Define Spark configuration
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("MySQL_Oracle_imp_exp")
# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
#sc.addJar(MYSQL_DRIVER_PATH)
sqlContext = SQLContext(sc)
ora_tmp=sqlContext.read.format('jdbc').options(
url=Oracle_CONNECTION_URL,
dbtable="TABLE1",
driver="oracle.jdbc.OracleDriver"
).load()
ora_tmp.show()
tmp2=sqlContext.load(
source="jdbc",
path=MYSQL_DRIVER_PATH,
url=MYSQL_CONNECTION_URL,
dbtable="(select city,zip from TABLE2 limit 10) as tmp2",
driver="com.mysql.jdbc.Driver")
c_rows=tmp2.collect()
....
except Exception as e:
print e
sys.exit(1)
Could someone please help me to solve this problem?
Thanks in advance :)
Here are the steps you need to follow:
First register SPARK_CLASSPATH to jars of one of the databases say mysql using command
os.environ['SPARK_CLASSPATH'] = "/usr/share/java/mysql-connector-java.jar"
Run query against mysql database and assign to RDD
Register SPARK_CLASSPATH with jars of second database by changing the path from above command
Run query against second database
If you have issues with lazy evaluation, make sure you first write first data set into files and then proceed further.