Convert sqlalchemy ORM query object to sql query for Pandas DataFrame - pandas

This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.

The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.

this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.

Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)

Related

Django: how to view raw sql response upon a query

In Django if I want to see the raw sql in debug mode I can check using in django shell:
from django.db import connections
User.objects.all()
print(connections['default'].queries[-1]['sql'])
Similarly can we see the raw response of that sql. In the above case the sql query may return the raw results in csv, tab delimited format. From then django may create the model objects array.
There are better tools to debug SQL queries in Django. The standard tool is to use Django-debug-toolbar, which allows you to view both the SQL query, the result, and the EXPLAIN output, for all queries in a request/response, along with the time needed for each query. The documentation is available at https://django-debug-toolbar.readthedocs.io/en/latest/
you can do:
str(User.objects.all().query)
# or
print(User.objects.all().query)
# or
print(User.objects.all().query.sql_with_params())

Spark Dataframe from SQL Query

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.
Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

Airflow + pandas read_sql_query() with commit

Question
Can I commit a SQL transaction to a DB using read_sql()?
Use Case and Background
I have a use case where I want to allow users to execute some predefined SQL and have a pandas dataframe returned. In some cases, this SQL will need to query a pre-populated table, and in other cases, this SQL will execute a function which will write to a table and then that table will be queried.
This logic is currently contained inside of method in an Airflow DAG in order to leverage database connection information accessible to Airflow using the PostgresHook - the method is eventually called in a PythonOperator task. It's my understanding through testing that the PostgresHook creates a psycopg2 connection object.
Code
from airflow.hooks.postgres_hook import PostgresHook
import pandas as pd
def create_df(job_id,other_unrelated_inputs):
conn = job_type_to_connection(job_type) # method that helps choose a database
sql = open('/sql_files/job_id_{}.sql'.format(job_id)) #chooses arbitrary SQL
sql_template = sql.read()
hook = PostgresHook(postgres_conn_id=conn) #connection information for alias is predefined elsewhere within Airflow
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
# Runs SQL template with variables, but does not commit. Alternatively, have used hook.get_pandas_df(sql_template)
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj)
except:
#catches some errors#
return df
Problem
Currently, when executing a SQL function, this code generates a dataframe, but does not commit any of the DB changes made in the SQL function. For example, to be more precise, if the SQL function INSERTs a row into a table, that transaction will not commit and the row will not appear in the table.
Attempts
I've attempted a few fixes but am stuck. My latest effort was to change the autocommit attribute of the psycopg2 connection that read_sql uses in order to autocommit the transaction.
I'll admit that I haven't been able to figure out when the attributes of the connection have an impact on the execution of the SQL.
I recognize that an alternative path is to replicate some of the logic in PostgresHook.run() to commit and then add some code to push results into a dataframe, but it seems more parsimonious and easier for future support to use the methods already created, if possible.
The most analogous SO question I could find was this one, but I'm interested in an Airflow-independent solution.
EDIT
...
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
hook_conn_obj.autocommit = True
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj) # Runs SQL template with variables, but does not commit
except:
#catches some errors#
return df
This seems to work. If anyone has any commentary or thoughts on a better way to achieve this, I'm still interested in learning from a discussion.
Thank you!
read_sql won't commit because as that method name implies, the goal is to read data, not write. It's good design choice from pandas. This is important because it prevents accidental writes and allows interesting scenarios like running a procedure, read its effects but nothing is persisted. read_sql's intent is to read, not to write. Expressing intent directly is a gold standard principle.
A more explicit way to express your intent would be to execute (with commit) explicitly before fetchall. But because pandas offers no simple way to read from a cursor object, you would lose the ease of mind provided by read_sql and have to create the DataFrame yourself.
So all in all your solution is fine, by setting autocommit=True you're indicating that your database interactions will persist whatever they do so there should be no accidents. It's a bit weird to read, but if you named your sql_template variable something like write_then_read_sql or explain in a docstring, the intent would be clearer.
I had a similar use case -- load data into SQL Server with Pandas, call a stored procedure that does heavy lifting and writes to tables, then capture the result set into a new DataFrame.
I solved it by using a context manager and explicitly committing the transaction:
# Connect to SQL Server
engine = sqlalchemy.create_engine('db_string')
with engine.connect() as connection:
# Write dataframe to table with replace
df.to_sql(name='myTable', con=connection, if_exists='replace')
with connection.begin() as transaction:
# Execute verification routine and capture results
df_processed = pandas.read_sql(sql='exec sproc', con=connection)
transaction.commit()

DBI/Spark: how to store the result in a Spark Dataframe?

I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.
I am able to run the following query
query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)
but this returns the data into R (in a dataframe).
What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.
Any ideas?
The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...
What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.
Thus if you want to use it to return small dataset, you don't need spark for the matter.
Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.
This is an example on how you can use the sql in sparkr :
sc %>% spark_session %>%
invoke("sql", "SELECT 1") %>%
invoke("createTempView", "foo")
You may also do:
mydata_spark_df <- tbl(sc, sql("select * from mydata"))

When did sqlalchemy execute the query?

As I've just start learning to use sqlalchemy recently, the result of the following code make me confused about when sqlalchemy execute the query:
query = db.session.query(MyTable)
query = query.filter(...)
query = query.limit(...)
query = query.offset(...)
records = query #records=query.all()
for r in records:
#do something
note the line
records = query #records=query.all()
Seems that it brings the same correct result(stored in variable "records") when using "query" and "query.all()", I wonder when was the query executed?
If it is executed during the first line "db.session.query(MyTable)", the result set may be large at this point; if during the fifth line "records = query", how could that happen as there's no function call at all?
In your example, the query gets executed upon for r in records. Accessing the query object via iterator triggers the execution. (Normally, only then will it be compiled into a SELECT statement)
Up until this time, the query will be built (via filter, limit etc).
Please read also the ORM Tutorial on querying