Querying avro data files stored in Azure Data Lake directly with raw SQL from Databricks - sql

I'm using Databricks Notebooks to read avro files stored in an Azure Data Lake Gen2. The avro files are created by an Event Hub Capture, and present a specific schema. From these files I have to extract only the Body field, where the data which I'm interested in is actually stored.
I already implented this in Python and it works as expected:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path) # 1
df1 = df0.select(df0.Body.cast('string')) # 2
rdd1 = df1.rdd.map(lambda x: x[0]) # 3
data = spark.read.json(rdd1) # 4
Now I need to translate this to raw SQL in order to filter the data directly in the SQL query. Considering the 4 steps above, steps 1 and 2 with SQL are as follows:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro)
SELECT * FROM body_array
With this partial query I get the same as df1 above (step 2 with Python):
Body
[{"id":"a123","group":"0","value":1.0,"timestamp":"2020-01-01T00:00:00.0000000"},
{"id":"a123","group":"0","value":1.5,"timestamp":"2020-01-01T00:01:00.0000000"},
{"id":"a123","group":"0","value":2.3,"timestamp":"2020-01-01T00:02:00.0000000"},
{"id":"a123","group":"0","value":1.8,"timestamp":"2020-01-01T00:03:00.0000000"}]
[{"id":"b123","group":"0","value":2.0,"timestamp":"2020-01-01T00:00:01.0000000"},
{"id":"b123","group":"0","value":1.2,"timestamp":"2020-01-01T00:01:01.0000000"},
{"id":"b123","group":"0","value":2.1,"timestamp":"2020-01-01T00:02:01.0000000"},
{"id":"b123","group":"0","value":1.7,"timestamp":"2020-01-01T00:03:01.0000000"}]
...
I need to know how to introduce the steps 3 and 4 into the SQL query, to parse the strings into json objects and finally get the desired dataframe with columns id, group, value and timestamp. Thanks.

One way I found to do this with raw SQL is as follows, using from_json Spark SQL built-in function and the scheme of the Body field:
CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")
WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro),
data1 AS (SELECT from_json(Body, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') FROM body_array),
data2 AS (SELECT explode(*) FROM data1),
data3 AS (SELECT col.* FROM data2)
SELECT * FROM data3 WHERE id = "a123" --FILTERING BY CHANNEL ID
It performs faster than the Python code I posted in the question, surely because of the use of from_json and the scheme of Body to extract data inside it. My version of this approach in PySpark looks as follows:
path = 'abfss://file_system#storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path)
df1 = df0.selectExpr("cast(Body as string) as json_data")
df2 = df1.selectExpr("from_json(json_data, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') as parsed_json")
data = df2.selectExpr("explode(parsed_json) as json").select("json.*")

Related

Export data from SQL to ADLS using ADF as JSON

I am trying to load data to ADLS gen2 from Azure SQL DB in json format.
Below is the query I am using to load it in JSON format
select k2.[mandt],k2.[kunnr],
'knb1' = (select [bukrs] as 'bukrs' , [pernr]
from [ods_cdc_sdr].[knb1] k1
where k2.mandt=k1.mandt AND K1.kunnr=K2.kunnr
FOR JSON PATH),
'knvp' =(select knvp.vkorg, vtweg from [ods_cdc_sdr].[knvp] knvp where k2.mandt=knvp.mandt AND knvp.kunnr=K2.kunnr FOR JSON PATH)
from [ods_cdc_sdr].[kna1] k2
group by k2.[mandt],k2.[kunnr]
FOR JSON PATH
For one or two records data looks fine but when I am trying to load 1000 and above records, json seems to be splitting also not in a proper format (below is the example)
**{"JSON_F52E2B61-18A1-11d1-B105-00805F49916B"**:"[{\"mandt\":\"172\",\"kunnr\":\"\"},{\"mandt\":\"172\",\"kunnr\":\"0000000001\"},{\"mandt\":\"172\",\"kunnr\":\"0000000004\",\"knvp\":[{\"vkorg\":\"FR12\",\"vtweg\":\"01\"},{\"vkorg\":\"FR12\",\"vtweg\":\"01\"},{\"vkorg\":\"FR12\",\"vtweg\":\"01\"},{\"vkorg\":\"FR12\",\"vtweg\":\"01\"},{\"vkorg\":\"FR12\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000006\"},{\"mandt\":\"172\",\"kunnr\":\"0000000008\",\"knvp\":[{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000012\",\"knvp\":[{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000015\"},{\"mandt\":\"172\",\"kunnr\":\"0000000021\"},{\"mandt\":\"172\",\"kunnr\":\"0000000022\"},{\"mandt\":\"172\",\"kunnr\":\"0000000023\"},{\"mandt\":\"172\",\"kunnr\":\"0000000026\",\"knvp\":[{\"vkorg\":\"IN14\",\"vtweg\":\"01\"},{\"vkorg\":\"IN14\",\"vtweg\":\"01\"},{\"vkorg\":\"IN14\",\"vtweg\":\"01\"},{\"vkorg\":\"IN14\",\"vtweg\":\"01\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000045\",\"knvp\":[{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000046\",\"knvp\":[{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"},{\"vkorg\":\"FR13\",\"vtweg\":\"04\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000048\"},{\"mandt\":\"172\",\"kunnr\":\"0000000050\"},{\"mandt\":\"172\",\"kunnr\":\"0000000054\"},{\"mandt\":\"172\",\"kunnr\":\"0000000057\"},{\"mandt\":\"172\",\"kunnr\":\"0000000058\"},{\"mandt\":\"172\",\"kunnr\":\"0000000060\"},{\"mandt\":\"172\",\"kunnr\":\"0000000065\"},{\"mandt\":\"172\",\"kunnr\":\"0000000085\"},{\"mandt\":\"172\",\"kunnr\":\"0000000086\"},{\"mandt\":\"172\",\"kunnr\":\"0000000089\"},{\"mandt\":\"172\",\"kunnr\":\"0000000090\"},{\"mandt\":\"172\",\"kunnr\":\"0000000092\"},{\"mandt\":\"172\",\"kunnr\":\"0000000106\"},{\"mandt\":\"172\",\"kunnr\":\"0000000124\"},{\"mandt\":\"172\",\"kunnr\":\"0000000129\",\"knvp\":[{\"vkorg\":\"FR40\",\"vtweg\":\"01\"},{\"vkorg\":\"FR40\",\"vtweg\":\"01\"},{\"vkorg\":\"FR40\""}
**{"JSON_F52E2B61-18A1-11d1-B105-00805F49916B"**:",\**"vtweg\":\"01\"},{\"vkorg\":\"FR40\",\"vtweg\":\"01\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000149\"},{\"mandt\":\"172\",\"kunnr\":\"0000000164\"},{\"mandt\":\"172\",\"kunnr\":\"0000000167\"},{\"mandt\":\"172\",\"kunnr\":\"0000000174\"},{\"mandt\":\"172\",\"kunnr\":\"0000000178\"},{\"mandt\":\"172\",\"kunnr\":\"0000000181\"},{\"mandt\":\"172\",\"kunnr\":\"0000000185\",\"knvp\":[{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"},{\"vkorg\":\"FR65\",\"vtweg\":\"01\"}]},{\"mandt\":\"172\",\"kunnr\":\"0000000189\"},{\"mandt\":\"172\",\"kunnr\":\"0000000214\"},{\"mandt\":\"172\",\"kunnr\":\"0000000223\"},{\"mandt\":\"172\",\"kunnr\":\"0000000228\"},{\"mandt\":\"172\",\"kunnr\":\"0000000239\"},{\"mandt\":\"172\",\"kunnr\":\"0000000240\"},{\"mandt\":\"172\",\"kunnr\":\"0000000249\"},{\"mandt\":\"172\",\"kunnr\":\"0000000251\"},{\"mandt\":\"172\",\"kunnr\":\"0000000257\"},{\"mandt\":\"172\",\"kunnr\":\"0000000260\"},{\"mandt\":\"172\",\"kunnr\":\"0000000261\"},{\"mandt\":\"172\",\"kunnr\":\"0000000262\"},{\"mandt\":\"172\",\"kunnr\":\"0000000286\"},{\"mandt\":\"172\",\"kunnr\":\"0000000301\"},{\"mandt\":\"172\",\"kunnr\":\"0000000320\"},{\"mandt\":\"172\",\"kunnr\":\"0000000347\"},{\"mandt\":\"172\",\"kunnr\":\"0000000350\"},{\"mandt\":\"172\",\"kunnr\":\"0000000353\"},{\"mandt\":\"172\",\"kunnr\":\"0000000364\"},{\"mandt\":\"172\",\"kunnr\":\"0000000370\"},{\"mandt\":\"172\",\"kunnr\":\"0000000372\"},{\"mandt\":\"172\",\"kunnr\":\"0000000373\"},{\"mandt\":\"172\",\"kunnr\":\"0000000375\"},{\"mandt\":\"172\",\"kunnr\":\"0000000377\"},{\"mandt\":\"172\",\"kunnr\":\"0000000380\"},{\"mandt\":\"172\",\"kunnr\":\"0000000381\"},{\"mandt\":\"172\",\"kunnr\":\"0000000383\"},{\"mandt\":\"172\",\"kunnr\":\"0000000384\"},{\"mandt\":\"172\",\"kunnr\":\"0000000386\"},{\"mandt\":\"172\",\"kunnr\":\"0000000387\"},{\"mandt\":\"172\",\"kunnr\":\"0000000391\"},{\"mandt\":\"172\",\"kunnr\":\"0000000393\"},{\"mandt\":\"172\",\"kunnr\":\"0000000396\"},{\"mandt\":\"172\",\"kunnr\":\"0000000397\"},{\"mandt\":\"172\",\"kunnr\":\"0000000408\"},{\"mandt\":\"172\",\"kunnr\":\"0000000416\"},{\"mandt\":\"172\",\"kunnr\":\"0000000421\"},{\"mandt\":\"172\",\"kunnr\":\"0000000424\"},{\"mandt\":\"172\",\"kunnr\":\"0000000425\"},{\"mandt\":\"172\",\"kunnr\":\"0000000428\"},{\"mandt\":\"172\",\"kunnr\":\"0000000443\"},{\"mandt\":\"172\",\"kunnr\":\"0000000447\"},{\"mandt\":\"172\",\"kunnr\":\"0000000453\"},{\"mandt"}
**{"JSON_F52E2B61-18A1-11d1-B105-00805F49916B"**:"\":\"172\",\"kunnr\":\"0000000475\"},{\"mandt\":\"172\",\"kunnr\":\"0000000478\"},{\"mandt\":\"172\\",\"kunnr\":\"2100000001\",\"knvp\":[{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"}]},{\"mandt\":\"172\",\"kunnr\":\"2100000002\",\"knvp\":[{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"},{\"vkorg\":\"Y200\",\"vtweg\":\"Z1\"}]}]"}
Please help me how can I get entire message in a proper format```
If you just want to save query result in json format to ADLS. You'd better remove FOR JSON PATH. We can use Azure Data Factory to generate nested JSON.
I created a simple test with two tables. Table Entities and table EntitiesEmails are related through the Id and EntitiyId fields. Usually we can use following sql to return a nested json type array. But ADF will automatically add escape characters '\' to escape double quotes.
SELECT
ent.Id AS 'Id',
ent.Name AS 'Name',
ent.Age AS 'Age',
EMails = (
SELECT
Emails.EntitiyId AS 'Id',
Emails.Email AS 'Email'
FROM EntitiesEmails Emails WHERE Emails.EntitiyId = ent.Id
FOR JSON PATH
)
FROM Entities ent
FOR JSON PATH
I researched out the use of data flow to generate a nested json array. As follows show:
Set source1 to the SQL table Entities .
Set source2 to the SQL table EntitiesEmails .
At Aggregate1 activity, set Group By Aggregate1
Type in expression collect(Email) at Aggregates. It will collect all email addresses into an array.
Data preview is as follows:
Then we can join these two streams at Join1 activity.
Then we can filter out extra columns at Select1 activity.
Then we can sink the result to our json file in ADLS.
Debug output is as follows:

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

Is it possible to change the delimiter of AWS athena output file

Here is my sample code where I create a file in S3 bucket using AWS Athena. The file by default is in csv format. Is there a way to change it to pipe delimiter ?
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client = boto3.client('athena')
# Start Query Execution
response = client.start_query_execution(
QueryString="""
select * from srvgrp
where category_code = 'ACOMNCDU'
""",
QueryExecutionContext={
'Database': 'tmp_db'
},
ResultConfiguration={
'OutputLocation': 's3://tmp-results/athena/'
}
)
queryId = response['QueryExecutionId']
print('Query id is :' + str(queryId))
There is a way to do that with CTAS query.
BUT:
This is a hacky way and not what CTAS queries are supposed to be used for, since it will also create a new table definition in AWS Glue Data Catalog.
I'm not sure about performance
CREATE TABLE "UNIQU_PREFIX__new_table"
WITH (
format = 'TEXTFILE',
external_location = 's3://tmp-results/athena/__SOMETHING_UNIQUE__',
field_delimiter = '|',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1
) AS
SELECT *
FROM srvgrp
WHERE category_code = 'ACOMNCDU'
Note:
It is important to set bucket_count = 1, otherwise Athena will create multiple files.
Name of the table in CREATE_TABLE ... also should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime.
External location should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime. I would advise to embed table name into S3 path.
You need to include in bucketed_by only one of the columns from SELECT.
At some point you would need to clean up AWS Glue Data Catalog from all table defintions that were created in such way

SparkSQL Staging Table Row Count vs Hive Row count

I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.

PySpark and HIVE/Impala

I want to build a classification model in PySpark. My input to this model is result of select query or view from Hive or Impala. is any way to include this query in PySpark code itself instead of storing result in text file feeding to our model
Yes for this you need to use HiveContext with sparkContext.
Here is example:-
sqlContext = HiveContext(sc)
tableData = sqlContext.sql("SELECT * FROM TABLE")
#tableData is a dataframe containing reference to schema of table, check this using tableData.printSchema()
tableData.collect() #collect executes query and provide all rows from sql
or you may refer here
https://spark.apache.org/docs/1.6.0/sql-programming-guide.html