Spark Dataframe from SQL Query - sql

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.

Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

Related

AWS Redshift Leader Node-Only Function with table reference

I have a requirement to pass the server address, server port, and the count from a table in one query in AWS Redshift. i.e.
select inet_server_addr(), inet_server_port(), count(*) from my_table;
ERROR: 0A000: Specified types or functions (one per INFO message) not
supported on Redshift tables.
I understand that this query does not work because I am trying to execute a Leader Node-Only Function in conjunction with a query which needs to access the compute nodes.
I am wondering, however, if there is a work around available to get the information that I need in one query execution.
Note: Editing the above query to use common table expressions (cte), sub-queries, views, scalar-functions etc results in the same error message.
Note 2: PostgreSQL System information functions like inet_server_addr() are currently unsupported in AWS Redshift, however, they work when called without a table reference.

DBI/Spark: how to store the result in a Spark Dataframe?

I am using sparklyr to run some analysis, but I am interested also in writing raw SQL queries using DBI.
I am able to run the following query
query <- "SELECT col1, FROM mydata WHERE some_condition"
dataframe <- dbGetQuery(spark_connection, query)
but this returns the data into R (in a dataframe).
What I want instead is keep the data inside Spark and store it in another Spark Dataframe for further interaction with sparklyr.
Any ideas?
The issue with using DBI is memory. You wont be able to fetch a huge amount of data with that. If your query results return a huge amount of data, the will overwhelm spark's driver memory and cause out of memory errors...
What's happening with sparklyr is the following. DBI runs the sql command a returns an R DataFrame which means it is collecting the data to materialize it in a regular R context.
Thus if you want to use it to return small dataset, you don't need spark for the matter.
Then DBI isn't the solution for you; you ought using regular SparkR if you want to stick with R for that.
This is an example on how you can use the sql in sparkr :
sc %>% spark_session %>%
invoke("sql", "SELECT 1") %>%
invoke("createTempView", "foo")
You may also do:
mydata_spark_df <- tbl(sc, sql("select * from mydata"))

When querying MongoDB using DBeaver, what's the right syntax for filtering by date?

I recently discovered that DBeaver can connect to MongoDB. My next discovery was that DBeaver expects SQL-like queries instead of the JavaScript-like queries I use with the mongo command line client. I've been unable to find any good documentation on the syntax I should be using, so I've been learning by trial and error. I need some help filtering query results by date.
I have a collection named tasks. Each object in the collection has a startedAt attribute that holds a timestamp.
This query gives me lots of results using the command line client: db.tasks.find({startedAt:{$gt:ISODate("2017-03-03")}});
I'm guessing the syntax in DBeaver should be something like this: select * from tasks where startedAt > '2017-03-03';
But, I'm doing something wrong because I don't get any results in DBeaver unless I drop the where clause. What's the right way?

Is it possible to access a BigQuery partition in Standard SQL using the '$' decorator?

In Google BigQuery, I'm trying to use the $ decorator when querying a partitioned table using Standard SQL. I assume this is supposed to allow me to access partitions and table metadata as it did in Legacy SQL, but it doesn't appear to work in Standard SQL.
Both of the following queries return Error: Table "dataset.partitioned_table$___" cannot include decorator:
1) Accessing a partition directly:
#StandardSQL
SELECT a, b, c
FROM `mydataset.partitioned_table$20161115`
2) Accessing table metadata:
#StandardSQL
SELECT partition_id
FROM `mydataset.partitioned_table$__PARTITIONS_SUMMARY__`;
The obvious workaround for the first query is to use the _PARTITIONTIME pseudocolumn:
#StandardSQL
SELECT a, b, c
FROM mydataset.partitioned_table
WHERE _PARTITIONTIME = '2016-11-15'
However, I haven't been able to find a workaround for the second query, which is useful for retrieving the most recent partition (though using that info to actually query the latest partition seems broken as well. See: How to choose the latest partition in BigQuery table?)
Obtaining the partitions summary using a decorator is currently not supported in StandardSQL. We are planning some work in this area but we don't have an ETA currently on when that might be available. The fastest option right now is to run the query over T$__PARTITIONS_SUMMARY__ using legacy SQL.

Convert sqlalchemy ORM query object to sql query for Pandas DataFrame

This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.
The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.
this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.
Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)