Spark SQL push down query issue with max function in MS SQL - sql

I want to execute an aggregate function MAX on a table's ID column residing in MS SQL. I am using spark SQL 1.6 and JDBC push down_query approach as I don't want spark SQL to pull all the data on spark side and do MAX (ID) calculation, but when I execute below code I get below exception, whereas If I try SELECT * FROM in code it works as expected.
Code:
def getMaxID(sqlContext: SQLContext,tableName:String) =
{
val pushdown_query = s"(SELECT MAX(ID) FROM ${tableName}) as t"
val maxID = sqlContext.read.jdbc(url = getJdbcProp(sqlContext.sparkContext).toString, table = pushdown_query, properties = getDBConnectionProperties(sqlContext.sparkContext))
.head().getLong(0)
maxID
}
Exception:
Exception in thread "main" java.sql.SQLException: No column name was specified for column 1 of 't'.
at net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java:372)
at net.sourceforge.jtds.jdbc.TdsCore.tdsErrorToken(TdsCore.java:2988)
at net.sourceforge.jtds.jdbc.TdsCore.nextToken(TdsCore.java:2421)
at net.sourceforge.jtds.jdbc.TdsCore.getMoreResults(TdsCore.java:671)
at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:505)
at net.sourceforge.jtds.jdbc.JtdsPreparedStatement.executeQuery(JtdsPreparedStatement.java:1029)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:124)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)

This exception is not related to Spark. You have to provide an alias for the column
val pushdown_query = s"(SELECT MAX(ID) AS max_id FROM ${tableName}) as t"

Related

SQL error when using format() function with pyodbc in Django

I want to execute a command using pyodbc in my Django app. When I do simple update with one column it works great:
cursor.execute("UPDATE dbo.Table SET attr = 1 WHERE id = {}".format(id))
However when I try to use a string as a column value it throws error:
cursor.execute("UPDATE dbo.Table SET attr = 1, user = '{}' WHERE id = {}".format(id, str(request.user.username)))
Here's error message:
('42S22', "[42S22] [Microsoft][ODBC SQL Server Driver][SQL Server]Invalid column name 'Admin'. (207) (SQLExecDirectW)")
Suprisingly this method works:
cursor.execute("UPDATE dbo.Table SET attr = 1, user = 'Admin' WHERE id = {}".format(id))
What seems to be the problem? Why is sql mistaking column value for its name?
As mentioned above, you have your arguments backwards, but if you're going to use cursor.execute(), the far more important thing to do is use positional parameters (%s). This will pass the SQL and values separately to the database backend, and protect you from SQL injection:
from django.db import connection
cursor = connection.cursor()
cursor.execute("""
UPDATE dbo.Table
SET attr = 1,
user = %s
WHERE id = %s
""", [
request.user.username,
id,
])
You've got your format arguments backwards. You're passing id to user, and username to the id WHERE clause.

RDSdataService execute_statement returns (BadRequestException)

I am using boto3 library with executeStatement to get data from an RDS cluster using DATA API.
Query is working fine if i select 1 or 2 columns but as soon as I select another column to query, it returns an error with (BadRequestException) permission denied for relation table_name
I have checked using pgadmin the permissions are intact to query the whole db for the user I am using.
function included in call:
def execute_query(self, sql_query, sql_parameters=[]):
"""
Aurora DataAPI execute query. Generally used for select statements.
:param sql_query: Query
:param sql_parameters: parameters in sql query
:return: DataApi response
"""
client = self.api_access()
response = client.execute_statement(
resourceArn=RESOURCE_ARN,
secretArn=SECRET_ARN,
database='db_name',
sql=sql_query,
includeResultMetadata=True,
parameters=sql_parameters)
return response
function call: No errors
query = '''
SELECT id
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
function call: fails with above error
query = '''
SELECT id,name,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
Is there a horizontal limit on what we can get from DATA API using Boto3? I know there is a limit for 1MB, but it should return something as per the documentation if it exceeds the limit.
Backend is Postgres RDS
UPDATE:
I can select the same columns 10 times and its not a problem
query = '''
SELECT id,event,event,event,event,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
So this means there are some columns that I cannot select.
I didnt know there are column level security in some tables. If there are column level securities in postgres for the user you are using that's obvious I cannot select those columns.

How do I access the output of sql in scala?

I'd like to get access to the results from a sql query that is getting sent over to a redshift database in scala.
val test = spark.sql("""
SELECT * FROM my_table
LIMIT 100""")
test.show
When I check the class it appears to be a sql data set.
test.getClass
res52: Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.Dataset
Unfortunately, it returns this error.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
How do I show the top results from my sql dataset?

Spark sql SQLContext

I'm trying to select data from MSSQL database via SQLContext.sql in Spark application.
Connection works but I'm not able to select data from table, because it always fail on table name.
Here is my code:
val prop=new Properties()
val url2="jdbc:jtds:sqlserver://servername;instance=MSSQLSERVER;user=sa;password=Pass;"
prop.setProperty("user","username")
prop.setProperty("driver" , "net.sourceforge.jtds.jdbc.Driver")
prop.setProperty("password","mypassword")
val test=sqlContext.read.jdbc(url2,"[dbName].[dbo].[Table name]",prop)
sqlContext.sql("""
SELECT *
FROM 'dbName.dbo.Table name'
""")
I tried table name without (') or [dbName].[dbo].[Table name] but still the same ....
Exception in thread "main" java.lang.RuntimeException: [3.14] failure:
``union'' expected but `.' found
dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.1" //%"provided"
// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.6.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1" //%"provided"
I think the problem in your code is that the query you pass to sqlContext has no access to the original table in the source database. It has only access to the tables saved within the sql context, for example with df.write.saveAsTable() or with df.registerTempTable() (df.createTempView in Spark 2+).
So, in your specific case, I can suggest a couple of options:
1) if you want the query to be executed on the source database with the exact syntax of your database SQL, you can pass the query to the "dbtable" argument:
val query = "SELECT * FROM dbName.dbo.TableName"
val df = sqlContext.read.jdbc(url2, s"($query) AS subquery", prop)
df.show
Note that the query needs to be in parentheses, because it will be passed to a "FROM" clause, as specified in the docs:
dbtable: The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
2) If you don't need to run the query on the source database, you can just pass the table name and then create a temp view in the sqlContext:
val table = sqlContext.read.jdbc(url2, "dbName.dbo.TableName", prop)
table.registerTempTable("temp_table")
val df = sqlContext.sql("SELECT * FROM temp_table")
// or sqlContext.table("temp_table")
df.show()

Spark Sql Aggregation from cassandra

this is my query written for mysql database,
SELECT dcm.user, du.full_name, ROUND(AVG(fcg.final)) ,ROUND(AVG(fcg.participation))
FROM dimclassmem dcm LEFT JOIN factGlobal fcg on fcg.class_id=dcm.class_id GROUP BY dcm.user ORDER BY dcm.user
i can run this using java + mysql.
now i want to write this query using Spark Sql.
How can i write Aggregate function in Spark.
I am fetching data from Cassandra table , and perform simple query & it below code works,
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext("local", "Cassandra Connector Test", conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext
.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "Sample", "table" -> "testTable"))
.load()
df.registerTempTable("testTable")
val ddf = sqlContext.sql("select name from testTable order by name desc limit 10")
ddf.show()
but , if i used follwing code it won't work,
val countallRec = sqlContext.sql("Select count(name) from testTable")
countallRec.show()
i am getting below exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/MapPartitionsWithPreparationRDD
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
I want to run above query Using Spark Sql, How can i do it?