Airflow + pandas read_sql_query() with commit

Airflow + pandas read_sql_query() with commit - sql

Question
Can I commit a SQL transaction to a DB using read_sql()?
Use Case and Background
I have a use case where I want to allow users to execute some predefined SQL and have a pandas dataframe returned. In some cases, this SQL will need to query a pre-populated table, and in other cases, this SQL will execute a function which will write to a table and then that table will be queried.
This logic is currently contained inside of method in an Airflow DAG in order to leverage database connection information accessible to Airflow using the PostgresHook - the method is eventually called in a PythonOperator task. It's my understanding through testing that the PostgresHook creates a psycopg2 connection object.
Code
from airflow.hooks.postgres_hook import PostgresHook
import pandas as pd
def create_df(job_id,other_unrelated_inputs):
conn = job_type_to_connection(job_type) # method that helps choose a database
sql = open('/sql_files/job_id_{}.sql'.format(job_id)) #chooses arbitrary SQL
sql_template = sql.read()
hook = PostgresHook(postgres_conn_id=conn) #connection information for alias is predefined elsewhere within Airflow
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
# Runs SQL template with variables, but does not commit. Alternatively, have used hook.get_pandas_df(sql_template)
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj)
except:
#catches some errors#
return df
Problem
Currently, when executing a SQL function, this code generates a dataframe, but does not commit any of the DB changes made in the SQL function. For example, to be more precise, if the SQL function INSERTs a row into a table, that transaction will not commit and the row will not appear in the table.
Attempts
I've attempted a few fixes but am stuck. My latest effort was to change the autocommit attribute of the psycopg2 connection that read_sql uses in order to autocommit the transaction.
I'll admit that I haven't been able to figure out when the attributes of the connection have an impact on the execution of the SQL.
I recognize that an alternative path is to replicate some of the logic in PostgresHook.run() to commit and then add some code to push results into a dataframe, but it seems more parsimonious and easier for future support to use the methods already created, if possible.
The most analogous SO question I could find was this one, but I'm interested in an Airflow-independent solution.
EDIT
...
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
hook_conn_obj.autocommit = True
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj) # Runs SQL template with variables, but does not commit
except:
#catches some errors#
return df
This seems to work. If anyone has any commentary or thoughts on a better way to achieve this, I'm still interested in learning from a discussion.
Thank you!

read_sql won't commit because as that method name implies, the goal is to read data, not write. It's good design choice from pandas. This is important because it prevents accidental writes and allows interesting scenarios like running a procedure, read its effects but nothing is persisted. read_sql's intent is to read, not to write. Expressing intent directly is a gold standard principle.
A more explicit way to express your intent would be to execute (with commit) explicitly before fetchall. But because pandas offers no simple way to read from a cursor object, you would lose the ease of mind provided by read_sql and have to create the DataFrame yourself.
So all in all your solution is fine, by setting autocommit=True you're indicating that your database interactions will persist whatever they do so there should be no accidents. It's a bit weird to read, but if you named your sql_template variable something like write_then_read_sql or explain in a docstring, the intent would be clearer.

I had a similar use case -- load data into SQL Server with Pandas, call a stored procedure that does heavy lifting and writes to tables, then capture the result set into a new DataFrame.
I solved it by using a context manager and explicitly committing the transaction:
# Connect to SQL Server
engine = sqlalchemy.create_engine('db_string')
with engine.connect() as connection:
# Write dataframe to table with replace
df.to_sql(name='myTable', con=connection, if_exists='replace')
with connection.begin() as transaction:
# Execute verification routine and capture results
df_processed = pandas.read_sql(sql='exec sproc', con=connection)
transaction.commit()

Related

Why are Pandas and GeoPandas able to read a database table using a DBAPI (psycopg2) connection but have to rely on SQLAlchemy to write one?

Context
I just get into trouble while trying to do some I/O operations on some databases from a Python3 script.
When I want to connect to a database, I habitually use psycopg2 in order to handle the connections and cursors.
My data are usually stored as Pandas DataFrames and/or GeoPandas's equivalent GeoDataFrames.
Difficulties
In order to read data from a database table;
Using Pandas:
I can rely on its .read_sql() methods which takes as a parameter con, as stated in the doc:
con : SQLAlchemy connectable (engine/connection) or database str URI
or DBAPI2 connection (fallback mode)'
Using SQLAlchemy makes it possible to use any DB supported by that
library. If a DBAPI2 object, only sqlite3 is supported. The user is responsible
for engine disposal and connection closure for the SQLAlchemy connectable. See
`here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
Using GeoPandas:
I can rely on its .read_postigs() methods which takes as a parameter con, as stated in the doc:
con : DB connection object or SQLAlchemy engine
Active connection to the database to query.
In order to write data to a database table;
Using Pandas:
I can rely on the .to_sql() methods which takes as a parameter con, as stated in the doc:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable See `here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
Using GeoPandas:
I can rely on the .to_sql() methods (which directly relies on the Pandas .to_sql()) which takes as a parameter con, as stated in the doc:
con : sqlalchemy.engine.Engine or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable See `here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_
From here, I easily understand that GeoPandas is built on Pandas especially for its GeoDataFrame object, which is, shortly, a special DataFrame that can handle geographic data.
But I'm wondering why do GeoPandas has the ability to directly takes a psycopg2 connection as an argument and not Pandas and if it is planned for the latter?
And why is it neither the case for one nor the other when it comes to writing data?
I would like (as probably many of others1,2) to directly give them a psycopg2 connections argument instead of relying on SQLAlchemy engine.
Because even is this tool is really great, it makes me use two different frameworks to connect to my database and thus handle two different connection strings (and I personally prefer the way psycopg2 handles the parameters expansion from a dictionary to build a connection string properly such as; psycopg2.connect(**dict_params) vs URL injection as explained here for example: Is it possible to pass a dictionary into create_engine function in SQLAlchemy?).
Workaround
I was first creating my connection string with psycopg2 from a dictionary of parameters this way:
connParams = ("user={}", "password={}", "host={}", "port={}", "dbname={}")
conn = ' '.join(connParams).format(*dict_params.values())
Then I figured out it was better and more pythonic this way:
conn = psycopg2.connect(**dict_params)
Which I finally replaced by this, so that I can interchangeably use it to build either a psycopg2 connections, or a SQLAlchemy engine:
def connector():
return psycopg2.connect(**dict_params)
a) Initialize a psycopg2 connection is now done by:
conn = connector()
curs = conn.cursor()
b) And initialize a SQLAlchemy engine by:
engine = create_engine('postgresql+psycopg2://', creator=connector)
(or with any of your flavored db+driver)
This is well documented here:
https://docs.sqlalchemy.org/en/13/core/engines.html#custom-dbapi-args
and here:
https://docs.sqlalchemy.org/en/13/core/engines.html#sqlalchemy.create_engine
[1] Dataframe to sql without Sql Alchemy engine
[2] How to write data frame to Postgres table without using SQLAlchemy engine?

Probably the main reason why to_sql needs a SQLAlchemy Connectable (Engine or Connection) object is that to_sql needs to be able to create the database table if it does not exist or if it needs to be replaced. Early versions of pandas worked exclusively with DBAPI connections, but I suspect that when they were adding new features to to_sql they found themselves writing a lot of database-specific code to work around the quirks of the various DDL implementations.
On realizing that they were duplicating a lot of logic that was already in SQLAlchemy they likely decided to "outsource' all of that complexity to SQLAlchemy itself by simply accepting an Engine/Connection object and using SQLAlchemy's (database-independent) SQL Expression language to create the table.
it makes me use two different frameworks to connect to my database
No, because .read_sql_query() also accepts a SQLAlchemy Connectable object so you can just use your SQLAlchemy connection for both reading and writing.

EXASol set a custom session variable

In SQL Server (2016) we have the SESSION_CONTEXT() and sp_set_session_context to retrieve/store custom variables in a key-value store. These values are available only in the session and their lifetime ends when the session is terminated. (Or in earlier versions the good old CONTEXT_INFO to store some data in a varbinary).
I am looking for a similar solution in EXASol (6.0).
An obvious one would be to create a table and store this info there, however this requires scheduled cleanup script and more error prone than a built-in solution. This is the fallback plan, however I'd like to be sure that there is no other options.
Another option could be to create individual users in the database and configure them, but just because of the amount of users to be added, this was ruled out.
The use-case is the following: An application has several users, each user have some values to be used in each queries. The application have access only to some views.
This works wonderfully in SQL Server, but we want to test EXASol as an alternative with the same functionality.
I cannot find anything related in the EXASol Manual but it is possible, that I just missed something.
Here is a simplified sample code in SQL Server 2016
sp_set_session_context #key='filter', #value='asd', #read_only=1;
CREATE VIEW FilteredMyTable AS
SELECT Col1, Col2, Col3 FROM MyTable
WHERE MyFilterCol = CONVERT(VARCHAR(32), SESSION_CONTEXT('filter'))
I've tried an obviously no-go solution, just to test if it works (it does not).
ALTER SESSION SET X_MY_CUSTOM_FILTER = "asd"

You cannot really set a session parameter in EXASOL, the only way to achieve something similar is to store the values that you need in a table with a structure like:
SESSION_ID KEY VALUE READ_ONLY
8347387 filter asd 1
With LUA you could create a script that will make easier for you to manage these "session" variables.

I think you can achieve what you need by using the scripting capabilities within Exasol - see section 3.5 in the user manual..
You could also handle the parameterisation 'externally' via a shell script

Convert sqlalchemy ORM query object to sql query for Pandas DataFrame

This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.

The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.

this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.

Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)

Invoking a large set of SQL from a Rails 4 application

I have a Rails 4 application that I use in conjunction with sidekiq to run asynchronous jobs. One of the jobs I normally run outside of my Rails application is a large set of complex SQL queries that cannot really be modeled by ActiveRecord. The connection this set of SQL queries has with my Rails app is that it should be executed anytime one of my controller actions is invoked.
Ideally, I'd queue a job from my Rails application within the controller for Sidekiq to go ahead and run the queries. Right now they're stored in an external file, and I'm not entirely sure what the best way is to have Rails run the said SQL.
Any solutions are appreciated.

I agree with Sharagoz, if you just need to run a specific query, the best way is to send the query string directly into the connection, like:
ActiveRecord::Base.connection.execute(File.read("myquery.sql"))
If the query is not static and you have to compose it, I would use Arel, it's already present in Rails 4.x:
https://github.com/rails/arel

You didn't say what database you are using, so I'm going to assume MySQL.
You could shell out to the mysql binary to do the work:
result = `mysql -u #{user} --password #{password} #{database} < #{huge_sql_filename}`
Or use ActiveRecord::Base.connection.execute(File.read("huge.sql")), but it won't work out of the box if you have multiple SQL statements in your SQL file.
In order to run multiple statements you will need to create an initializer that monkey patches the ActiveRecord::Base.mysql2_connection to allow setting MySQL's CLIENT_MULTI_STATEMENTS and CLIENT_MULTI_RESULTS flags.
Create a new initializer config/initializers/mysql2.rb
module ActiveRecord
class Base
# Overriding ActiveRecord::Base.mysql2_connection
# method to allow passing options from database.yml
#
# Example of database.yml
#
# login: &login
# socket: /tmp/mysql.sock
# adapter: mysql2
# host: localhost
# encoding: utf8
# flags: 131072
#
# #param [Hash] config hash that you define in your
# database.yml
# #return [Mysql2Adapter] new MySQL adapter object
#
def self.mysql2_connection(config)
config[:username] = 'root' if config[:username].nil?
if Mysql2::Client.const_defined? :FOUND_ROWS
config[:flags] = config[:flags] ? config[:flags] | Mysql2::Client::FOUND_ROWS : Mysql2::Client::FOUND_ROWS
end
client = Mysql2::Client.new(config.symbolize_keys)
options = [config[:host], config[:username], config[:password], config[:database], config[:port], config[:socket], 0]
ConnectionAdapters::Mysql2Adapter.new(client, logger, options, config)
end
end
end
Then update config/database.yml to add flags:
development:
adapter: mysql2
database: app_development
username: user
password: password
flags: <%= 65536 | 131072 %>
I just tested this on Rails 4.1 and it works great.
Source: http://www.spectator.in/2011/03/12/rails2-mysql2-and-stored-procedures/

Executing one query is - as outlined by other people - quite simply done through
ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM users")
You are talking about a 20.000 line sql script of multiple queries. Assuming you have the file somewhat under control, you can extract the individual queries from it.
script = Rails.root.join("lib").join("script.sql").read # ah, Pathnames
# this needs to match the delimiter of your queries
STATEMENT_SEPARATOR = ";\n\n"
ActiveRecord::Base.transaction do
script.split(STATEMENT_SEPARATOR).each do |stmt|
ActiveRecord::Base.connection.execute(stmt)
end
end
If you're lucky, then the query delimiter could be ";\n\n", but this depends of course on your script. We had in another example "\x0" as delimiter. The point is that you split the script into queries to send them to the database. I wrapped it in a transaction, to let the database know that there is coming more than one statement. The block commits when no exception is raised while sending the script-queries.
If you do not have the script-file under control, start talking to those who control it to get a reliable delimiter. If it's not under your control and you cannot talk to the one who controls it, you wouldn't execute it, I guess :-).
UPDATE
This is a generic way to solve this. For PostgreSQL, you don't need to split the statements manually. You can just send them all at once via execute. For MySQL, there seem to be solutions to get the adapter into a CLIENT_MULTI_STATEMENTS mode.

If you want to execute raw SQL through active record you can use this API:
ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM users")

If you are running big SQL every time, i suggest you to create a sql view for it. It be boost the execution time. The other thing is, if possible try to split all those SQL query in such a way that it will be executed parallely instead of sequentially and then push it to sidekiq queue.
You have to use ActiveRecord::Base.connection.execute or ModelClass.find_by_sql to run custom SQL.
Also, put an eye on ROLLBACK transactions, you will find many places where you dont need such ROLLBACK feature. If you avoid that, the query will run faster but it is dangerous.
Thanks all i can suggest.

use available database tools to handle the complex queries, such as views, stored procedures etc and call them as other people already suggested (ActiveRecord::Base.connection.execute and ModelClass.find_by_sql for example)- it might very well cut down significantly on query preparation time in the DB and make your code easier to handle
http://dev.mysql.com/doc/refman/5.0/en/create-view.html
http://dev.mysql.com/doc/connector-cpp/en/connector-cpp-tutorials-stored-routines-statements.html
abstract your query input parameters into a hash so you can pass it on to sidekiq, don't send SQL strings as this will probably degrade performance (due to query preparation time) and make your life more complicated due to funny SQL driver parsing bugs
run your complex queries in a dedicated named queue and set concurrency to such a value that will prevent your database of getting overwhelmed by the queries as they smell like they could be pretty db heavy
https://github.com/mperham/sidekiq/wiki/API
https://github.com/mperham/sidekiq/wiki/Advanced-Options
have a look at Squeel, its a great addition to AR, it might be able to pull some of the things you are doing
https://github.com/activerecord-hackery/squeel
http://railscasts.com/episodes/354-squeel
I'll assume you use MySQL for now, but your mileage will vary depending on the DB type that you use. For example, Oracle has some good gems for handling stored procedures, views etc, for example https://github.com/rsim/ruby-plsql
Let me know if some of this stuff doesn't fit your use case and I'll expand

I see this post is kind of old. But I would like to add my solution to it. I was in a similar situation; I also needed a way to force feed "PRAGMA foreign_keys = on;" into my sqlite connection (I could not find a previous post that spelled it out how to do it.) Anywho, this worked like a charm for me. It allowed me to write "pretty" sql and still get it executed. Blank lines are ignored by the if statement.
conn = ActiveRecord::Base.establish_connection(adapter:'sqlite3',database:DB_NAME)
sqls = File.read(DDL_NAME).split(';')
sqls.each {|sql| conn.connection.execute(sql<<';') unless sql.strip.size == 0 }
conn.connection.execute('PRAGMA foreign_keys = on;')

I had the same problem with a set of sql statements that I needed to execute all in one call to the server. What worked for me was to set up an initializer for Mysql2 adapter (as explained in infused answer) but also do some extra work to process multiple results. A direct call to ActiveRecord::Base.connection.executewould only retrieve the first result and issue an Internal Error.
My solution was to get the Mysql2 adapter and work directly with it:
client = ActiveRecord::Base.connection.raw_connection
Then, as explained here, execute the query and loop through the results:
client.query(multiple_stms_query)
while client.next_result
result = client.store_result
# do something with it ...
end

DAL without web2py

I am using web2py to power my web site. I decided to use the web2py DAL for a long running program that runs behind the site.
This program does not seem to update its data or the database (sometimes).
from gluon.sql import *
from gluon.sql import SQLDB
from locdb import *
# contains
# db = SQLDB("mysql://user/pw#localhost/mydb", pool_size=10)
# db.define_table('orders', Field('status', 'integer'), Field('item', 'string'),
# migrate='orders.table')
orderid = 20 # there is row with id == 20 in table orders
#when I do
db(db.orders.id==orderid).update(status=6703)
db.commit()
It does not update the database, and a select on orders with this id, shows the correct data. In some circumstances a "db.rollback()" after a commit seems to help.
Very strange to say the least. Have you seen this? More importantly do you know the solution?
UPDATE:
Correction:
The select in question is done within the program, not outside it.
Sometimes, when doing a series of updates, some will work and be available outside and some will not be available. Also some queries will return the data it originally returned even though the data has changes in the DB since th4 original query.
I am tempted to dump this approach and move to another method, any suggestions?

This problem has been resolved:
mysql runs at isolation level REPEATABLE READ (that is, once the transaction starts, the data reflected in the select output will not change till the transaction ends). It needed changing the isolation level to READ COMMITED and that resolved the issue. By the way READ COMMITED is the isolation level at which Oracle and mssql run by default.
This can be set in the my.cnf. Details in the link below:
http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas