At office, we use Zeppelin Notebooks with Spark as the default interpreter. Most of the time I use Spark SQL. Here is a simple example:
%SQL
select * from db.table1
How to make Spark SQL the default interpreter such that I do not have to write %SQL in each cell?
Create New Note
Select Default Interpreter
sql
Related
am working on my graduation project and its using Impala ,
so i want to ask is there anyway so i can use options like ' for , if , while ' ... etc in Cloudera Impala ?
#Atef Ibrahim you can use if :
if(boolean condition, type ifTrue, type ifFalseOrNull)
for more info you can read the doc Impala Conditional Functions
Regarding the for/while loop statement you can read
Write a While loop in Impala SQL?
I would like to know if there is a way to use standard sql with the airflow BigQueryValueCheckOperator in apache airflow 1.9 The airflow BigQueryOperator normally has a flag like this
use_legacy_sql=False to disable legacy sql. I can't find a way to achieve this with the BigQueryValueCheckOperator.
Rewriting the query in legacy sql is not an option for now since I want to use the _PARTITIONTIME in my where clause.
Thank you.
Currently, you can't use StandardSQL with this operator.
However, for your use-case, you can still use _PARTITIONTIME with Legacy Sql as mentioned here in the docs: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#querying_ingestion-time_partitioned_tables_using_time_zones
Sample Query:
#legacySQL
SELECT
field1
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
AND TIMESTAMP("2016-05-06")
AND DATE_ADD([MY_TIMESTAMP_FIELD], 8, 'HOUR') BETWEEN TIMESTAMP("2016-05-01 12:00:00")
AND TIMESTAMP("2016-05-05 14:00:00");
Recently I am working on a POC in databricks, where I need to move my R script to the Notebook in Databricks.
for running any Sql expression I need to point to %sql interpreter and then write the query, which works fine.
However, is there any way I can save this query result to an object:
%sql
a <- SHOW databases
This is not working, following is the error:
Please let me know if anything like is possible or not,as of now I can run using library(DBI)
and then save it using dbGetQuery(....)
I would recommend using the spark.sql interface as you are working in a Databricks notebook. Below is code which will work inside a Python DB notebook for reference.
from pyspark.sql.functions import col
# execute and store query result in data frame, collect results to use
mytabs = spark.sql("show databases").select('databaseName').filter(col("databaseName")=="<insert your database here, for example>")
str(mytabs.collect()[0][0])
Just to add to Ricardo's answer, the first line in a command cell is parsed for an optional directive (beginning with a percentage symbol).
If no directive is supplied, then the default language (scala, python, sql, r) of the notebook is assumed. In your example, the default language of the notebook is Python.
When you supply %sql (it must be on the first parsed line), it assumes that everything in that command cell is a SQL command.
The command that you listed:
%sql
a <- SHOW databases
is actually mixing SQL with R.
If you want to return the result of a SQL query to an R variable, you would need to do something like the following:
%r
library(SparkR)
a <- sql("SHOW DATABASES")
You can find more such examples in the SparkR docs here:
https://docs.databricks.com/spark/latest/sparkr/overview.html#from-a-spark-sql-query
I'm using a HQL query, that contains something similar to...
INSERT OVERWRITE TABLE ex_tb.ex_orc_tb
select *, SUBSTR(INPUT__FILE__NAME,60,4), CONCAT_WS('-', SUBSTR(INPUT__FILE__NAME,71,4), SUBSTR(INPUT__FILE__NAME,75,2), SUBSTR(INPUT__FILE__NAME,77,2))
from ex_db.ex_ext_tb
When I go into hive, and I use that command, it works fine.
When I put it into a pyspark, hivecontext command, instead I get the error...
pyspark.sql.utils.AnalysisException: u"cannot resolve 'INPUT__FILE__NAME' given input columns: [list_name, name, day, link_params, id, template]; line 2 pos 17"
Any ideas why this might be?
INPUT__FILE__NAME is a Hive specific virtual column and it is not supported in Spark.
Spark provides input_file_name function which should work in a similar way:
SELECT input_file_name() FROM df
but it requires Spark 2.0 or later to work correctly with PySpark.
As I am new to hive I am trying to implement some function of sql in
hive.How to implement over() function of sql in hive.
I am using shark 0.8.0 which uses hive 0.9 where in this version over() is not implemented.
you can see a full description of the syntax here