Sqlalchemy - override database engine in tests - testing

I have the following sqlalchemy initialization in my application
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine(config.DATABASE_URL, echo=True)
db_session = scoped_session(sessionmaker(bind=engine))
Base = declarative_base()
Base.query = db_session.query_property()
I would like to override the database engine in tests (pytest) to sqlite (in memory)
Is it possible to do this per test ?
My goal is that each test case should start with empty (in-meomry-sqlite) database.

Related

sql in spark structure streaming

I am exploring structured streaming by doing a small POC. Below is the code that have written so far. However, I would like to validate some of answers that I could not find in spark documentation(I may have missed it).
Validated so far:
Can we process sql query dynamically or conditionally ? yes, I could pass the sql query as an argument and start the execution.
Can sql query can run in parallel : yes as per (How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?)
Need to validate
what are limitation of the sql query : I found that we cannot perform all type of sql query, as we normally do for relational database for example, partition.
Can execution of particular sql be terminated conditionally ?
Can anyone help me to guide what are the limitation that I need to consider while generating sql queries. I know its very broad question to ask but any guidance will be very helpful that could help me to look in right direction.
The POC code.
"""
Run the example
`$ bin/spark-submit examples/src/main/python/sql/streaming/structured_kafka_SQL_query.py \
host1:port1,host2:port2 subscribe topic1,topic2`
"""
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql import window as w
if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
""", file=sys.stderr)
sys.exit(-1)
bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]
spark = SparkSession\
.builder\
.appName("StructuredKafkaWordCount")\
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
schema = StructType([
StructField("p1", StringType(), True),
StructField("p2", StringType(), True),
StructField("p3" , StringType(), True)
])
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", bootstrapServers)\
.option(subscribeType, topics)\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))\
.select(col("parsed_value.*"))
query="select count(*),p1,p2 from events group by p1,p2"
query1="select count(*),p2,p3 from events group by p2,p3"
query_list=[query,query1] # it can be passed as an argument.
lines.createOrReplaceTempView("events")
for q in query_list:
spark.sql(q).writeStream\
.outputMode('complete')\
.format('console')\
.start()
spark.streams.awaitAnyTermination()
Please let me know if my question is still unclear, I can update it accordinlgy.
answering to one part of my own question.
Can execution of particular sql be terminated conditionally ?
yes, spark provides stream query management API to stop the streaming queries.
StreamingQuery.stop()

Suppress output of SQL statements when calling pandas to_sql()

to_sql() is printing every insert statement within my Jupyter Notebook and this makes everything run very slowly for millions of records. How can I decrease the verbosity significantly? I haven't found any verbosity setting of this function. I've tried %%capture as written here. The same method works for another simple test case with print() but not for to_sql(). How do you suppress output in IPython Notebook?
from sqlalchemy import create_engine, NVARCHAR
import cx_Oracle
df.to_sql('table_name', engine, if_exists='append',
schema='schema', index=False, chunksize=10000,
dtype={col_name: NVARCHAR(length=20) for col_name in df} )
Inside create_engine(), set echo=False and all logging will be disabled. More detail here: https://docs.sqlalchemy.org/en/13/core/engines.html#more-on-the-echo-flag.
from sqlalchemy import create_engine, NVARCHAR
import cx_Oracle
host='hostname.net'
port=1521
sid='DB01' #instance is the same as SID
user='USER'
password='password'
sid = cx_Oracle.makedsn(host, port, sid=sid)
cstr = 'oracle://{user}:{password}#{sid}'.format(
user=user,
password=password,
sid=sid
)
engine = create_engine(
cstr,
convert_unicode=False,
pool_recycle=10,
pool_size=50,
echo=False
)
Thanks to #GordThompson for pointing me in the right direction!

Can't add vertex with python in neptune workbench

I'm trying to put together a demo of Neptune using Neptune workbench, but something's not working right. I've got this block set up:
from __future__ import print_function # Python 2/3 compatibility
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
cluster_url = #my cluster
remoteConn = DriverRemoteConnection( f'wss://{cluster_url}:8182/gremlin','g')
g = graph.traversal().withRemote(remoteConn)
import uuid
tmp = uuid.uuid4()
tmp_id=str(id)
def get_id(name):
uid = uuid.uuid5(uuid.NAMESPACE_DNS, f"{name}.licensing.company.com")
return str(uid)
def add_sku(name):
tmp_id = get_id(name)
g.addV('SKU').property('id', tmp_id, 'name', name)
return name
def get_values():
return g.V().properties().toList()
The problem is that calling add_sku doesn't result in a vertex being added to the graph. Doing the same operation in a cell with gremlin magic works, and I can retrieve values through python, but I can't add vertices. Does anyone see what I'm missing here?
The Python code is not working because it is missing a terminal step (next() or iterate()) on the end of it which forces it to evaluate. If you add the terminal step it should work:
g.addV('SKU').property('id', tmp_id, 'name', name).next()

SQL not working within a python function when printed reflects as a string

I am trying to run SQL code to retrieve data from IBM DB2 within a python function that is retrieving data from SAP GUI and based on certain criteria pull the data from IBM DB2. When i print the connection of DB2 it works. However the SQL code is being printed as a string. Note that i have not mentioned the entire code to login in to SAP as it would be very long. When run the same script separately it works fine retrieving the required data. Any idea why it is considering it like a string not a SQL script.
result of the query:
<ibm_db_dbi.Connection object at 0x000002B8DF807588>
Select * from DBA.M82 T82
WHERE T82.EID IN 324809
Code is:
import win32com.client
import sys
import subprocess
import time
import pandas as pd
import numpy as np
from datetime import date
from datetime import datetime, timedelta
from multiprocessing import Process
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from time import sleep
from selenium.webdriver.support.ui import Select
import ibm_db
import ibm_db_dbi as db
def sap_login()
dsn = "DRIVER={{IBM DB2 ODBC DRIVER}};" + \
"DATABASE=;" + \
"HOSTNAME=;" + \
"PORT=;" + \
"UID=;" + \
"PWD=;"
hdbc = db.connect(dsn, "", "")
e_id=session.findById("wnd[0]/usr/cntlBCALV_GRID_DEMO_0100_CONT1/shellcont/shell").GetCellValue(i,"ZEMP_CODE")
sql=(""" Select *
from DBA.M82 T82
WHERE T82.EID in {}""").format(e_id)
print(sql)
fsdd=pd.read_sql(sql,hdbc)
sap_login()
import win32com.client
import sys
import subprocess
import time
import pandas as pd
import numpy as np
from datetime import date
from datetime import datetime, timedelta
from multiprocessing import Process
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from time import sleep
from selenium.webdriver.support.ui import Select
import ibm_db
import ibm_db_dbi as db
def sap_login()
dsn = "DRIVER={{IBM DB2 ODBC DRIVER}};" + \
"DATABASE=;" + \
"HOSTNAME=;" + \
"PORT=;" + \
"UID=;" + \
"PWD=;"
hdbc = db.connect(dsn, "", "")
e_id=session.findById("wnd[0]/usr/cntlBCALV_GRID_DEMO_0100_CONT1/shellcont/shell").GetCellValue(i,"ZEMP_CODE")
sql=""" Select *
from DBA.M82 T82
WHERE T82.EID in {}""".format(e_id)
print(sql)
fsdd=pd.read_sql(sql,hdbc)
A note you would need to convert your list to tuples while using format to pull data for multiple values.

How to connect to AWS Neptune (graph database) with Python?

I'm following this tutorial:
https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-python.html
How can I add a node and then retrieve the same node?
from __future__ import print_function # Python 2/3 compatibility
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
remoteConn = DriverRemoteConnection('wss://your-neptune-endpoint:8182/gremlin','g')
g = graph.traversal().withRemote(remoteConn)
print(g.V().limit(2).toList())
remoteConn.close()
All the above right now is doing is retrieving 2 nodes right?
If you want to add a vertex and then return the information about the vertex (assuming you did not provide your own ID) you can do something like
newId = g.addV("mylabel").id().next()