Is there a way to query a postgresql database using pandas syntax? - pandas

Is there some sort of adaptor that allows querying a postgresql database like it was a pandas dataframe?

Update (16th March 2016)
It is possible, but you would have to have a compiler, which evaluates your query and transforms it into SQL clauses.
The fact that SQL is a higher level language and that DBMS interpret SQL clauses with regard to not only the query, but also the data and its distribution, makes this really hard do to performantly.
Wes McKinney is trying to do this with Ibis project and has a nice writeup about some of the challenges.
Previous post
Unfortunately that's not possible, because SQL is higher level language than Python.
With pandas you specify what and how you want to do something, whereas with SQL you only specify what you want. The SQL server is then free to decide how to serve your query. When you add an index to a table, the SQL server can then use that index to serve your query faster without you rewriting your query.
If you instructed your database how you want it to execute your query, then you would also need to rewrite your SQL statements if you wanted them to use an index.
That being said, I commonly use the pattern in neurite's answer for analysis, using SQL to perform initial aggregation (and reduce size of data) and then perform other operations in pandas.

Not sure if this is exactly what you want but you can load postgres tables into pandas and manipulate them from there.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html
Shamelessly stolen from the pages referenced above:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql+pg8000://scott:tiger#localhost/test',
isolation_level='READ UNCOMMITTED'
)
df = pandas.read_sql('SELECT * FROM <TABLE>;' con=engine)

Related

pyspark using sql queries and doing group by optimisation

In spark one can write sql queries as well use spark api functions. ReduceByKey should always be used than groupbykey as it prevents more shuffling.
I would like to know, when you use sql queries by registering the dataframe how can we use reduceby ? In sql queries there is only group by no reduce by. Do internally it optimises to use reduceBykey than a group by ?
I got it. I actually did an explain to understand the physical plan and it first executes a function as partial_sum and then after that executes the function sum which implies that it has first performed a sum within executors and then shuffled across.

SparkSQL Query performance improvement by CLUSTER By

I am new to SparkSQL and I primarily work with writing SparkSQL queries. We often need to JOIN big tables in the queries and it did not take long to face performance issues pertaining to them (eg. Joins, aggregates etc).
While searching for remedies online, I recently came across the terms - COALESCE(), REPARTITION(), DISTRIBUTE BY, CLUSTER BY etc and the fact that they could probably be used for enhancing performance of slow running SparkSQL queries.
Unfortunately, I could not find enough examples around, for me to understand them clearly and start applying them to my queries. I am primarily looking for examples explaining their syntax, hints and usage scenarios.
Can anyone please help me out here and provide SparkSQL query examples of their usage and when to use them ? E.g.
syntax
hint syntax
tips
scenarios
Note: I only have access to writing SparkSQL Queries but don't have access to PySpark-SQL.
Any help is much appreciated.
Thanks
coalesce
coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.
Examples:
SELECT coalesce(NULL, 1, NULL);
1
Since: 1.0.0
Distribute By and REPARTITION
Repartitions a DataFrame by the given expressions. The number of partitions is equal to spark.sql.shuffle.partitions. Note that in Spark, when a DataFrame is partitioned by some expression, all the rows for which this expression is equal are on the same partition (but not necessarily vice-versa)!
This is how it looks in practice. Let’s say we have a DataFrame with two columns: key and value.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY key
Equivalent in DataFrame API:
df.repartition($"key", 2)
Cluster By
This is just a shortcut for using distribute by and sort by together on the same set of expressions.
In SQL:
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key

sort_values vs order by and when and why should I use which

This is mainly in the scope of jupyter notes and queries in pandas(i'm very new to both) . I noticed that when I write a query where I need the dataframe in a certain order I do:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn").sort_values(['date'].ascending=False)
My friends who are much more experienced than me do :
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date order by date",conn")
The results are the same but I couldnt get an answer about why/when I would use order by over sort_values
Like you said, both achieve the same output. The difference in is where the sorting operation takes place. In the first case, sort_values() is a pandas function which you've chained onto the first read_sql() function. This means that your Python engine is executing the sort after it retrieves the data from the database connection. This is equivalent to doing something like:
df = pd.read_sql("select date, count(*) as count from "+tableName+" group by date" ,conn)
df = df.sort_values(by='date', ascending=False) #sorting done in python environment, not by the database
The second method performs the sorting in the database, so the python environment doesn't sort anything. The key here is to remember that you're basically composing a SQL statement and running it using Python pandas.
Whether or not you should put the burden of sorting on the database or on your machine that's running the python environment depends. If this is a very busy production database you might not want to run expensive sorting operations but simply retrieve the data and perform all operations locally using pandas. Alternatively, if the database is for casual use or non-critical, then in this case it makes sense to just sort the results and before loading the data into pandas.
Update:
To reinforce the notion that SQL engine driven (server side, or db driven) sorting isn't necessarily always the most optimal thing to do, please read this article which has some interesting profiling stats and common scenarios of when to load the db with data manipulation operations vs. when to do it "locally".
I can think of a few reasons here:
Performance
Many, many hours of effort have gone into tuning the code that runs SQL commands. SQL is fast, and I'm willing to bet it is going to be faster to sort with the SQL engine than with pandas.
Maintainability
If, for example, you determine that you do not need the result sorted tomorrow, then you can simply change the query string without having to change your code. This is particularly useful if you are passing the query to some function which runs it for you.
Aesthetics
As a programmer with good design sense, surely the second method should appeal to you. Segmenting logic into separate pieces is definitely a recipe for bad design.

Spark sql queries vs dataframe functions

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

How to create dynamic and safe queries

A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.