SparkJob file name - sql

I'm using a HQL query, that contains something similar to...
INSERT OVERWRITE TABLE ex_tb.ex_orc_tb
select *, SUBSTR(INPUT__FILE__NAME,60,4), CONCAT_WS('-', SUBSTR(INPUT__FILE__NAME,71,4), SUBSTR(INPUT__FILE__NAME,75,2), SUBSTR(INPUT__FILE__NAME,77,2))
from ex_db.ex_ext_tb
When I go into hive, and I use that command, it works fine.
When I put it into a pyspark, hivecontext command, instead I get the error...
pyspark.sql.utils.AnalysisException: u"cannot resolve 'INPUT__FILE__NAME' given input columns: [list_name, name, day, link_params, id, template]; line 2 pos 17"
Any ideas why this might be?

INPUT__FILE__NAME is a Hive specific virtual column and it is not supported in Spark.
Spark provides input_file_name function which should work in a similar way:
SELECT input_file_name() FROM df
but it requires Spark 2.0 or later to work correctly with PySpark.

Related

Write PySpark Dataframe to Impala table

I want to write a PySpark dataframe into an Impala table, but I am getting an error message. Basically, my code looks like:
properties = {'user': os.getenv('USERNAME), 'password': os.getenv('SECRET), 'driver': 'com.cloudera.impala.jdbc41.Driver'}
df.write.jdbc(url=os.getenv('URL'), table=os.getenv('URL'), mode='append', properties=properties)
The problem seems that the "Create table" statement that is generated has some bad syntax:
When I run the same query in DBeaver, I get the same error message. Only when I delete the quotation marks, the table gets created. I have no idea how to solve this. I created the dataframe by flattening a json file, using the withColumn and explode functions. Can I somehow avoid these quotation marks from being generated?
As an alternative, would it be possible to write the dataframe in an already existing table, using an insert into query instead?
Edit: Another issue I just realized: when it comes to string columns, the "create table"-statements contains the word "TEXT", instead of "STRING" or "varchar", which is also not recognized as a proper data type by Impala...
Thanks a lot!

Common syntax for creating named_struct in hive and presto

I am trying to define a sql view that picks a subset of elements from a source struct data type and creates a new struct. In hive I can do this:
create view myview as
select
id,
named_struct("cnt", bkg.cnt, "val", bkg.val) as bkg
from mybkgtable
This works. The trouble is, when this view is invoked from presto, it fails with: Function named_struct not registered
Found that presto has no struct data type, but has ROW instead. It works with this syntax:
select
id,
CAST(ROW(bkg.cnt, bkg.val) as row(cnt integer, val double)) as bkg
from mybkgtable
However, this syntax isn't understood by Hive.
Question is, is it possible to have one view definition that works on both hive and presto?
Question is, is it possible to have one view definition that works on both hive and presto?
Sadly, no.
You can use coral to write the view definition in hiveql and translate it to presto.
https://github.com/linkedin/coral
https://engineering.linkedin.com/blog/2020/coral

Spark Dataframe from SQL Query

I'm attempting to use Apache Spark in order to load the results of a (large) SQL query with multiple joins and sub-selects into a DataFrame from Spark as discussed in Create Spark Dataframe from SQL Query.
Unfortunately, my attempts to do so result in an error from Parquet:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unable to infer schema for Parquet. It must be specified manually.
I have seen information from google implying that this error occurs when a DataFrame is empty. However, the results of the query load plenty of rows in DBeaver.
Here is an example query:
(SELECT REPORT_DATE, JOB_CODE, DEPT_NBR, QTY
FROM DBO.TEMP
WHERE BUSINESS_DATE = '2019-06-18'
AND STORE_NBR IN (999)
ORDER BY BUSINESS_DATE) as reports
My Spark code looks like this.
val reportsDataFrame = spark
.read
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
scheduledHoursDf.show(10)
I read in the previous answer that it is possible to run queries against an entire database using this method. In particular, that if you specify the "dbtable" parameter to be an aliased query when you first build your DataFrame in Spark. You can see I've done this in the query by specifying the entire query to be aliased "as reports".
I don't believe this to be a duplicate question. I've extensively researched this specific problem and have not found anyone facing the same issue online. In particular, the Parquet error resulting from running the query.
It seems the consensus is that one should not be running SQL queries this way and should instead use Spark's DataFrames many methods to filter, group by and aggregate data. However, it would be very valuable for us to be able to use raw SQL instead even if it incurs a performance penalty.
Quick look at your code tells me you are missing .format("jdbc")
val reportsDataFrame = spark
.read
.format("jdbc")
.option("url", db2JdbcUrl)
.option("dbtable", queries.reports)
.load()
This should work provided you have username and password set to connect to the database.
Good resource to know more about the JDBC Sources in spark (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

How to point %sql query in Notebook to an object in R

Recently I am working on a POC in databricks, where I need to move my R script to the Notebook in Databricks.
for running any Sql expression I need to point to %sql interpreter and then write the query, which works fine.
However, is there any way I can save this query result to an object:
%sql
a <- SHOW databases
This is not working, following is the error:
Please let me know if anything like is possible or not,as of now I can run using library(DBI)
and then save it using dbGetQuery(....)
I would recommend using the spark.sql interface as you are working in a Databricks notebook. Below is code which will work inside a Python DB notebook for reference.
from pyspark.sql.functions import col
# execute and store query result in data frame, collect results to use
mytabs = spark.sql("show databases").select('databaseName').filter(col("databaseName")=="<insert your database here, for example>")
str(mytabs.collect()[0][0])
Just to add to Ricardo's answer, the first line in a command cell is parsed for an optional directive (beginning with a percentage symbol).
If no directive is supplied, then the default language (scala, python, sql, r) of the notebook is assumed. In your example, the default language of the notebook is Python.
When you supply %sql (it must be on the first parsed line), it assumes that everything in that command cell is a SQL command.
The command that you listed:
%sql
a <- SHOW databases
is actually mixing SQL with R.
If you want to return the result of a SQL query to an R variable, you would need to do something like the following:
%r
library(SparkR)
a <- sql("SHOW DATABASES")
You can find more such examples in the SparkR docs here:
https://docs.databricks.com/spark/latest/sparkr/overview.html#from-a-spark-sql-query

HIve CLI doesn't support MySql style data import to tables

Why can't we import data into hive CLI as following, The hive_test table has user, comments columns.
insert into table hive_test (user, comments)
value ("hello", "this is a test query");
Hive throws following exception in hive CLI
FAILED: ParseException line 1:28 cannot recognize input near '(' 'user' ',' in select clause
I don't want to import the data through csv file like following for a testing perpose.
load data local inpath '/home/hduser/test_data.csv' into table hive_test;
It's worth noting that Hive advertises "SQL-like" syntax, rather than actual SQL syntax. There's no particular reason to think that pure SQL queries will actually run on Hive. HiveQL's DML is documented here on the Wiki, and does not support the column specification syntax or the VALUES clause. However, it does support this syntax:
INSERT INTO TABLE tablename1 SELECT ... FROM ...
Extrapolating from these test queries, you might be able to get something like the following to work:
INSERT INTO TABLE hive_test SELECT 'hello', 'this is a test query' FROM src LIMIT 1
However, it does seem that Hive is not really optimized for this small-scale data manipulation. I don't have a Hive instance to test any of this on.
I think, it is because user is a built-in (Reserved) keyword.
Try this:
insert into table hive_test ("user", comments)
value ('hello', 'this is a test query');