Recently I am working on a POC in databricks, where I need to move my R script to the Notebook in Databricks.
for running any Sql expression I need to point to %sql interpreter and then write the query, which works fine.
However, is there any way I can save this query result to an object:
%sql
a <- SHOW databases
This is not working, following is the error:
Please let me know if anything like is possible or not,as of now I can run using library(DBI)
and then save it using dbGetQuery(....)
I would recommend using the spark.sql interface as you are working in a Databricks notebook. Below is code which will work inside a Python DB notebook for reference.
from pyspark.sql.functions import col
# execute and store query result in data frame, collect results to use
mytabs = spark.sql("show databases").select('databaseName').filter(col("databaseName")=="<insert your database here, for example>")
str(mytabs.collect()[0][0])
Just to add to Ricardo's answer, the first line in a command cell is parsed for an optional directive (beginning with a percentage symbol).
If no directive is supplied, then the default language (scala, python, sql, r) of the notebook is assumed. In your example, the default language of the notebook is Python.
When you supply %sql (it must be on the first parsed line), it assumes that everything in that command cell is a SQL command.
The command that you listed:
%sql
a <- SHOW databases
is actually mixing SQL with R.
If you want to return the result of a SQL query to an R variable, you would need to do something like the following:
%r
library(SparkR)
a <- sql("SHOW DATABASES")
You can find more such examples in the SparkR docs here:
https://docs.databricks.com/spark/latest/sparkr/overview.html#from-a-spark-sql-query
Related
I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.
Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
This code:
cursor.execute('select RLAMBD from ?', OPTable)
print cursor.fetchone().RLAMBD
produces this error:
ProgrammingError: ('42S02', '[42S02] [Oracle][ODBC][Ora]ORA-00903: invalid table name\n (903) (SQLExecDirectW)')
OPTable is an alphanumeric string which I've built from another database query which contains the table name I want to select from.
The following code works just fine within the same script.
sql = 'select RLAMBD from ' + OPTable
cursor.execute(sql)
print cursor.fetchone().RLAMBD
I guess it's not a huge deal to build the sql statements this way, but I just don't understand why it's not accepting the ? parameters. I even have another query in the same script which uses the ? parameterization and works just fine. The parameters for the working query are produced using the raw_input function, though. Is there some subtle difference between the way those two strings might be formatted that's preventing me from getting the query to work? Thank you all.
I'm running python 2.7 and pyodbc 3.0.10.
Parameter placeholders cannot be used to represent object names (e.g., table or column names) or SQL keywords. They are only used to pass data values, e.g., numbers, strings, dates, etc..
I am using OrientDB and the gremlin console that comes with.
I am trying to search a pattern in text property. I have Email vertices with ebodyText property. The problem is that the result of querying with SQL like command and Gremlin language is quite different.
If I use SQL like query such as:
select count(*) from Email where eBodyText like '%Syria%'
it returns 24.
But if I query in gremlin console such as:
g.V.has('eBodyText').filter{it.eBodyText.matches('.*Syria.*')}.count()
it returns none.
Same queries with a different keyword 'memo' returns 161 by SQL but 20 by gremlin.
Why does this behave like this? Is there a problem with the syntax of gremlin command? Is there a better way to search text in gremlin?
I guess there might be a problem of setting properties in the upload script which uses python driver 'pyorient'.
Python script used to upload the dataset
Thanks for your help.
I tried with 2.1.15 and I had no problem.
These are the records.
EDITED
I added some vertexes to my DB and now the count() is 11
QUERY:
g.V.has('eBodyText').filter{it.eBodyText.contains('Syria')}.count()
OUTPUT:
==>11
Hope it helps.
This question feels fiendishly simple but I haven't been able to find an answer.
I have an ORM query object, say
query_obj = session.query(Class1).join(Class2).filter(Class2.attr == 'state')
I can read it into a dataframe like so:
testdf = pd.read_sql(query_obj.statement, query_obj.session.bind)
But what I really want to do is use a traditional SQL query instead of the ORM:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
Where query is a traditional SQL string. Now when I run this I get an error along the lines of "query_obj is not executable" Anyone know how to convert the ORM query to a traditional query? Also how does one get the columns in after getting the dataframe?
Context why I'm doing this: I've set up an ORM layer on top of my database and am using it to query data into a Pandas DataFrame. It works, but it's frequently maxing out my memory. I want to cut my in-memory overhead with some string folding (pass 3 outlined here: http://www.mobify.com/blog/sqlalchemy-memory-magic/). That requires (and correct me if I'm wrong here) not using the read_sql string and instead processing the query's return as raw tuples.
The long version is described in detail in the FAQ of sqlalchemy: http://sqlalchemy.readthedocs.org/en/latest/faq/sqlexpressions.html#how-do-i-render-sql-expressions-as-strings-possibly-with-bound-parameters-inlined
The short version is:
statement = query.statement
print(statement.compile(engine))
The result of this can be used in read_sql.
this may be a later version of sqlalchemy since the post.
print(query)
outputs the query you can copy and paste back into your script.
Fiendishly simple indeed. Per Jori's link to the docs, it just query_obj.statement to get the SQL query. So my code is:
with engine.connect() as connection:
# Execute the query against the database
results = connection.execute(query_obj.statement)
# Fetch all the results of the query
fetchall = results.fetchall()
# Build a DataFrame with the results
dataframe = pd.DataFrame(fetchall)
I have a sequence defined in my database that I want to use in my django application. I assumed I could use the raw sql method specified in the django documentation here: http://docs.djangoproject.com/en/1.1/topics/db/sql/. My plan was to execute the following SQL statment:
select nextval('winner')
where winner is the sequence I want to get the next value from. Here is my code:
from django.db import connection, transaction
.....
cursor = connection.cursor()
result = cursor.execute("select nextval('winner')")
The result is always a NoneType object. This seems pretty simple and straightforward, but I haven't been able to make this work. I've tried it in the interactive console with the same results. If I look in the connection.queries object, i see this:
{'time': '0.000', 'sql': "select nextval('winner')"}
The sql generated is valid. Any ideas?
I'm using:
Django 1.1
Python 2.6
Postgres 8.3
Psycopg2
Mac OSX
This code will work:
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("select nextval('winner')")
result = cursor.fetchone()
full documentation about cursor
cursor.execute always returns None.
To get the value from the SQL statement, you have to use cursor.fetchone() (or fetchall()).
See the documentation, although note that this doesn't actually have anything to do with Django, but is the standard Python SQL DB API.