Not able to read Hive table using sparkR submit

Not able to read Hive table using sparkR submit - apache-spark-sql

Here is my Code:
sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="8g"))
hiveContext <- sparkRHive.init(sc)
sqlQuery <- "SELECT * from table ABC"
joinSQL <- sql(hiveContext,sqlQuery)
This is giving error while I am submiting using spark-submit sparkRscript.R
Error: a37fe9-9e5c-4569-879d-475944333fb0/_tmp_space.db
Error: Character input expected Execution halted 18/09/16 23:33:29 INFO SparkContext: Invoking stop() from shutdown hook
But While I am running on SparkR interactive shell Its working and giving expected result.

Related

Airflow Xcom with SSHOperator

Im trying to get param from SSHOperator into Xcom and get it in python.
def decision_function(**context):
ti = context['ti']
output_ssh= ti.xcom_pull(task_ids='ssh_task')
print('ssh output is: {}'.format(output_ssh))
ls = SSHOperator(
task_id="ssh_task",
command= "ls",
ssh_hook = sshHook,
dag = mydag)
get_xcom = PythonOperator(
task_id='test',
python_callable=decision_function,
provide_context=True,
dag=mydag
ls >> get_xcom
I get the wrong result in the xcom_pull:
ssh output is: MTcySyBhaXJmbG93CTIuMUcgQXV0b0hpdHVtX05ld
Every thing is work fine with BashOperator but when I try to use the SSH, it not working currect.
I also try to change in the config the enable_xcom_pickling param, but still not working.
airflow: 2.1.4
linux
thank you.

The value of Xcom may be b64encoded depending on the value of enable_xcom_pickling as you can see in the source code
The issue you are facing has been discussed in PR which suggested to change functionality to be like other operators but the PR was not merged and you can see the reasons for it in the code review comments.
if you wish you can create custom version of the operator without the enable_xcom_pickling condition:
class MySSHOperator(SSHOperator):
def execute(self, context=None) -> Union[bytes, str]:
result: Union[bytes, str]
if self.command is None:
raise AirflowException("SSH operator error: SSH command not specified. Aborting.")
# Forcing get_pty to True if the command begins with "sudo".
self.get_pty = self.command.startswith('sudo') or self.get_pty
try:
with self.get_ssh_client() as ssh_client:
result = self.run_ssh_client_command(ssh_client, self.command)
except Exception as e:
raise AirflowException(f"SSH operator error: {str(e)}")
return result.decode('utf-8')

Passing query as variable to Rmarkdown sql chunk

I'm trying to use the SQL chunk function available in the preview version of RStudio 1.0 to connect to a SQL Server (using the RSQLServer backend for DBI) and I'm having some difficulty passing variables.
If I connect to the server and then put the query in the chunk it works as expected
```{r, eval = F}
svr <- dbConnect(RSQLServer::SQLServer(), "Server_name", database = 'Database_name')
query <- 'SELECT TOP 10 * FROM Database_name.dbo.table_name'
```
```{sql, connection = svr, eval = F}
SELECT TOP 10 * FROM Database_name.dbo.table_name
```
But if I try to pass the query as a variable it throws an error
```{sql, connection = svr, eval = F}
?query
```
Error: Unable to retrieve JDBC result set for 'SELECT TOP 10 * FROM Database_name.dbo.table_name': Incorrect syntax near 'SELECT TOP 10 * FROM Database_name.dbo.table_name'.
Failed to execute SQL chunk
I think it's related to the way R wraps character vectors in quotes, because I get the same error if I run the following code.
```{sql, connection = svr, eval = F}
'SELECT TOP 10 * FROM Database_name.dbo.table_name'
```
Is there a way I can get around this error?
Currently I can achieve what I want by using inline expressions to print the query, using the pygments for highlighting and running the query in a R chunk with DBI commands, so using code chunks would be a bit nicer.

Looks like Using R variables in queries applies some kind of escaping and can therefore only be used in cases like the example from the documentation (SELECT * FROM trials WHERE subjects >= ?subjects) but not to dynamically set up the whole query.
Instead, the code chunk option can be used to achieve the desired behavior:
The example uses the SQLite sample database from sqlitetutorial.net. Unzip it to your working directory before running the code.
```{r}
library(DBI)
db <- dbConnect(RSQLite::SQLite(), dbname = "chinook.db")
query <- "SELECT * FROM tracks"
```
```{sql, connection=db, code = query}
```

I haven't been able to determine a way to print and execute in the same chunk however with a few extra lines of code it is possible to achieve my desired output.
Printing is solved by CL.'s answer and then I can use EXEC to run the code.
```{sql, code = query}
```
```{sql, connection = svr}
EXEC (?query)
```

Hive on spark doesn't work in hue

I am trying to trigger hive on spark using hue interface . The job works perfectly when run from commandline but when i try to run from hue it throws exceptions. In hue, I tried mainly two things:
1) when I give all the properties in .hql file using set commands
set spark.home=/usr/lib/spark;
set hive.execution.engine=spark;
set spark.eventLog.enabled=true;
add jar /usr/lib/spark/assembly/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar;
set spark.eventLog.dir=hdfs://10.11.50.81:8020/tmp/;
set spark.executor.memory=2899102923;
I get an error
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Unsupported execution engine: Spark. Please set hive.execution.engine=mr)'
org.apache.hadoop.hive.ql.metadata.HiveException: Unsupported execution engine: Spark. Please set hive.execution.engine=mr
2) when I give properties in hue properties, it just works with mr engine but not spark execution engine.
Any help would be appreciated

I have solved this issue by using a shell action in oozie.
This shell action invokes a pyspark action bearing my sql file.
Even though the job shows as MR in jobtracker, spark history server recognizes as a spark action and the output is achieved.
shell file:
#!/bin/bash
export PYTHONPATH=`pwd`
spark-submit --master local testabc.py
python file:
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext();
sqlContext = HiveContext(sc)
result = sqlContext.sql("insert into table testing_oozie.table2 select * from testing_oozie.table1 ");
result.show()

knitr and SQL query

I am having difficulty querying a SQL database from a knitr chunk. I can establish a connection and the query works in an R session but hangs indefinitely when knitting from RStudio.
---
title: "Untitled"
author: "XXXXX XXXXXXXXX"
date: "Monday, April 27, 2015"
output: html_document
---
TEST TEST
```{r}
library(RJDBC)
jd<-JDBC(driverClass = "com.osisoft.jdbc.Driver",classPath = "C://Program Files (x86)//PIPC//JDBC//PIJDBCDriver.jar")
piDB<-dbConnect(drv = jd,"jdbc:pisql://XX.XXX.XX.XX/Data Source=XXX;Integrated Security=SSPI")
sql1<-"SELECT * FROM pipoints"
sql.dat <- dbGetQuery(piDB, sql1)
dbDisconnect(piDB)
print('Success')
```

If you can use a different connection driver, try ODBC.
RODBC works fine with knitr in RStudio:
```{r}
library(RODBC)
myconn = odbcConnect('myServer')
myquery = paste0("") #add some query
data = sqlQuery(myconn, myquery)
head(data)
```

With RStudio v1.0 you can now use sql chunks directly from your RMarkdown or RNotebook. I use the odbc package for this. I like this because it avoids hard-coding login details into your projects while still creating projects that run end-to-end without user input.
An RMarkdown example below:
```{r}
# Unfortunately, odbc is not on CRAN yet
# So we will need devtools
# install.packages(devtools)
library(devtools)
devtools::install_github("rstats-db/odbc")
# Get connection info from the Windows ODBC Data Source Administrator using the name you set manually.
# If you don't know what this is, just search in the windows start menu for "ODBC Data Source Administrator"
con <- dbConnect(odbc::odbc(), 'MyDataWarehouse')
```
```{sql connection = con, output.var = result}
-- This is sql code, comments need to be marked accordingly
SELECT * FROM SOMETABLE LIMIT 200;
```
```{R}
# And the result is available in the next chunk!
result
```

Pig process multiple file error: ERROR 0: Error while executing ForEach at []

I have 4 files A, B, C, D under the directory /user/bizlog/cpc on HDFS, and the record looks like this:
87465422^C376832^C27786^C21161214^Ckey
Here is my pig script:
cpc_all = load '/user/bizlog/cpc' using PigStorage('\u0003') as (cpcid, accountid, cpcplanid, cpcgrpid, key);
cpc = foreach cpc_all generate accountid, key;
account_group = group cpc by accountid;
account_sort = order account_group by group;
account_key = foreach account_sort generate group, BagToTuple(cpc.key);
store account_key into 'last' using PigStorage('\u0003');
It will get results such as:
376832^Ckey1^Ckey2
Above script suppose to process all the 4 files, but I get this error:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.
Pig Stack Trace
---------------
ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
================================================================================
Oddly if I load one single file such as load '/user/bizlog/cpc/A' then the script will succeed.
If I load each file first and then union them, it will work fine too.
If I put the sort step at the last and the error goes away
The version of hadoop is 0.20.2 and the pig version is 0.12.1, any help will be appreciated

As mentioned in the comments:
I put the sort step at the last and the error goes away
Though I did not find much on the topic, it appears that pig does not like to rearrange the group itself.
As such the 'solution' is to rearrane the output of what is generated for the group, instead of ordering the group itself.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Not able to read Hive table using sparkR submit - apache-spark-sql

Related

Airflow Xcom with SSHOperator

Passing query as variable to Rmarkdown sql chunk

Hive on spark doesn't work in hue

knitr and SQL query

Pig process multiple file error: ERROR 0: Error while executing ForEach at []

Categories

Resources