How to run sql query in PySpark notebook - apache-spark-sql

I have an SQL query which I run in Azure Synapse analytics , to query data from ADLS.
Can I run the same query in Notebook using PySpark in Azure Synapse analytics?
I googled some ways to run sql in notebook, but looks like some modifications to be done to the code to do this.
%%sql or spark.sql("")
Query
SELECT *
FROM OPENROWSET(
BULK 'https://xxx.xxx.xxx.xxx.net/datazone/Test/parquet/test.snappy.parquet',
FORMAT = 'PARQUET'
)

Read the data lake file and write into a dataframe with saveAsTable and query the table as shown below.
df = spark.read.load('abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/<filename>', format='parquet')
df.write.mode("overwrite").saveAsTable("testdb.test2")
Using %%sql
%%sql
select * from testdb.test2
Using %%pyspark
%%pyspark
df = spark.sql("select * from testdb.test2")
display(df)

Related

Retrieve df from spark.sql : [PARSE_SYNTAX_ERROR] Syntax error at or near 'SELECT'

I'm using a databricks notebook and I'd like to retrieve a dataframe from an SQL execution in Spark. I have:
statement = f""" USER {db}; SELECT * FROM {table}
"""
df = spark.sql(statement)
display(df)
However, unlike when I fire off the same statement in an SQL cell in the notebook, I get the following error:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'SELECT': extra input 'SELECT'(line 1...
Where am I going wrong?
I tried to reproduce the same in my environment and got below results:
This my sample demo table Persons.
Create dataframe by using this code as shown in the below image.
df = sqlContext.sql("select * from Persons")
display(df)

Pyspark not reading all months from hive database

I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.

Querying avro data files stored in Azure Data Lake directly with raw SQL from Databricks

I'm using Databricks Notebooks to read avro files stored in an Azure Data Lake Gen2. The avro files are created by an Event Hub Capture, and present a specific schema. From these files I have to extract only the Body field, where the data which I'm interested in is actually stored.
I already implented this in Python and it works as expected:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path) # 1
df1 = df0.select(df0.Body.cast('string')) # 2
rdd1 = df1.rdd.map(lambda x: x[0]) # 3
data = spark.read.json(rdd1) # 4
Now I need to translate this to raw SQL in order to filter the data directly in the SQL query. Considering the 4 steps above, steps 1 and 2 with SQL are as follows:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro)
SELECT * FROM body_array
With this partial query I get the same as df1 above (step 2 with Python):
Body
[{"id":"a123","group":"0","value":1.0,"timestamp":"2020-01-01T00:00:00.0000000"},
{"id":"a123","group":"0","value":1.5,"timestamp":"2020-01-01T00:01:00.0000000"},
{"id":"a123","group":"0","value":2.3,"timestamp":"2020-01-01T00:02:00.0000000"},
{"id":"a123","group":"0","value":1.8,"timestamp":"2020-01-01T00:03:00.0000000"}]
[{"id":"b123","group":"0","value":2.0,"timestamp":"2020-01-01T00:00:01.0000000"},
{"id":"b123","group":"0","value":1.2,"timestamp":"2020-01-01T00:01:01.0000000"},
{"id":"b123","group":"0","value":2.1,"timestamp":"2020-01-01T00:02:01.0000000"},
{"id":"b123","group":"0","value":1.7,"timestamp":"2020-01-01T00:03:01.0000000"}]
...
I need to know how to introduce the steps 3 and 4 into the SQL query, to parse the strings into json objects and finally get the desired dataframe with columns id, group, value and timestamp. Thanks.
One way I found to do this with raw SQL is as follows, using from_json Spark SQL built-in function and the scheme of the Body field:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro),
data1 AS (SELECT from_json(Body, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') FROM body_array),
data2 AS (SELECT explode(*) FROM data1),
data3 AS (SELECT col.* FROM data2)
SELECT * FROM data3 WHERE id = "a123" --FILTERING BY CHANNEL ID
It performs faster than the Python code I posted in the question, surely because of the use of from_json and the scheme of Body to extract data inside it. My version of this approach in PySpark looks as follows:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path)
df1 = df0.selectExpr("cast(Body as string) as json_data")
df2 = df1.selectExpr("from_json(json_data, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') as parsed_json")
data = df2.selectExpr("explode(parsed_json) as json").select("json.*")

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html

Reading partitioned parquet file into Spark results in fields in incorrect order

For a table with
create table mytable (
..
)
partitioned by (my_part_column String)
We are executing a hive sql as follows:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
data = hc.sql("select * from my_table limit 10")
The values read back show the "my_part_columns" as the FIRST items for each row instead of the last ones.
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1
https://issues.apache.org/jira/browse/SPARK-5049