Synapse Spark - Deltalake configs for Schema evolution and write optimizations - azure-synapse

I am looking for databricks equivalent properties in Synapse spark. Please let me know if there are any or workaround for the same.
Using MERGE command to Insert/update the data. However, it does not support schema merge. Is there any property to enable auto merge ?
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled ","true") in Databricks
How to control number of part files or optimize writes with delta Merge command ?
set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

Related

How to use CETAS (Synapse Serverless Pool) in dbt?

In Synapse Serverless Pool, I can use CETAS to create external table and export the results to the Azure Data Lake Storage.
CREATE EXTERNAL TABLE external_table
WITH (
LOCATION = 'location/',
DATA_SOURCE = staging_zone,
FILE_FORMAT = SynapseParquetFormat
)
AS
SELECT * FROM table
It will create an external table name external_table in Synapse and write a parquet file to my staging zone in Azure Data Lake.
How can I do this in dbt?
I was trying to do something very similar and run my dbt project with Synapse Serverless Pool, but ran into several issues. Ultimately I was mislead by CETAS. When you create the external table it creates a folder hierarchy, in which it places the parquet file. If you were to run the same script like the one you have as an example it fails because you cannot overwrite with CETAS. So dbt would be able to run it like any other model, but it wouldn't be easy to overwrite. Maybe if you dynamically made a new parquet every time the script is run and deleted the old one, but that seems like putting a small bandage on the hemorrhaging wound that is the synapse and severless pool interaction. I had to switch up my architecture for this reason.
I was trying to export as a parquet to maintain the column datatypes and descriptions so I didn't have to re-schematize. Also so I could create tables based of incremental points in my pipeline. I ended up finding a way to pull from a database that already had the datatype schemas, using the dbt-synapse adapter. Then if I needed an incremental table, I could materialize it as a table via dbt and dbt-synapse and access it that way.
What is your goal with the exported parquet file?
Maybe we can find another solution?
Here's the dbt-synapse-serverless adapter github where it lists caveats for serverless pools.
I wrote a materialization for CETAS (Synapse Serverless Pool) here: https://github.com/intheroom/dbt-synapse-serverless
It's a forked from dbt-synapse-serverless here: https://github.com/dbt-msft/dbt-synapse-serverless
Also you can use hooks in dbt to use CETAS.

Storing data obtained from cassandra in spark Memory and make it available to other spark job server jobs in same context

I am using spark job server and using spark-sql to get data from a cassandra table as follows
public Object runJob(JavaSparkContext jsc, Config config) {
CassandraSQLContext sq = new CassandraSQLContext(JavaSparkContext.toSparkContext(jsc));
sq.setKeyspace("rptavlview");
DataFrame vadevent = sq.sql("SELECT username,plan,plate,ign,speed,datetime,odo,gd,seat,door,ac from rptavlview.vhistory ");
vadevent.registerTempTable("history");
sq.cacheTable("history");
DataFrame vadevent1 = sq.sql("SELECT plate,ign,speed,datetime FROM history where username='"+params[0]+"' and plan='"+params[1]+"'");
long count = vadevent.rdd().count();
}
But I am getting table not found history.
Can anybody mention how to cache cassandra data in spark memory and reuse the same data either in concurrent requests of same job or as two jobs one for caching and other for querying.
I am using dse5.0.4 so spark version is 1.6.1
You can allow spark jobs to share the state of other contexts. This link goes more in depth.

SQL 2016 PolyBase Compute Pushdown to Hadoop HDI that uses WASBS aka Azure Blob

We have an Azure Hadoop HDI system where most of the files are stored in an Azure Storage Account Blob. Accessing the files from Hadoop requires the WASBS:// file system type.
I want to configure SQL 2016 Polybase to pushdown compute to the HDI cluster for certain queries of data stored in the Azure blobs.
It is possible to use Azure Blobs outside Hadoop in Polybase. I completely understand that the query hint "option (FORCE EXTERNALPUSHDOWN)" will not work on the Blob system.
Is it possible to configure an external data source to use HDI for compute on the blob?
A typical external data source configuration is:
CREATE EXTERNAL DATA SOURCE AzureStorage with (
TYPE = HADOOP,
LOCATION ='wasbs://clustername#storageaccount.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
I believe as long as that WASBS is in there, that pushdown compute will not work.
If I change the above to use HDFS, then I can certainly point to my HDI cluster, but then what would the LOCATION for the EXTERNAL TABLE be?
If this is in WASBS, then how would it be found in HDFS?
LOCATION='/HdiSamples/HdiSamples/MahoutMovieData/'
Surely there is a way to get Polybase to pushdown compute to an HDI cluster where the files are in WASBS. If not, then Polybase does not support the most common and recommended way to setup HDI.
I know the above is a lot to consider and any help is appreciated. If you are really sure it is not possible, just answer NO. Please remember though that I realize Polybase operating on Azure Blobs directly cannot pushdown compute. I want Polybase to connect to HDI and let HDI compute on the blob.
EDIT
Consider the following setup in Azure with HDI.
Note that the default Hadoop file-system is WASBS. That means using a relative path such as /HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt will resolve to wasbs://YourClusterName#YourStorageAccount.blob.core.windows.net/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt.
CREATE EXTERNAL DATA SOURCE HadoopStorage with (
TYPE = HADOOP,
LOCATION ='hdfs://172.16.1.1:8020',
RESOURCE_MANAGER_LOCATION = '172.16.1.1:8050',
CREDENTIAL = AzureStorageCredential
);
CREATE EXTERNAL TABLE [user-ratings] (
Field1 bigint,
Field2 bigint,
Field3 bigint,
Field4 bigint
)
WITH ( LOCATION='/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt',
DATA_SOURCE = HadoopStorage,
FILE_FORMAT = [TabFileFormat]
);
There are many rows in the file in Hadoop. Yet, this query returns 0.
select count(*) from [user-ratings]
When I check the Remote Query Execution plan, it shows:
<external_uri>hdfs://172.16.1.1:8020/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt</external_uri>
Notice the URI is an absolute path and is set to HDFS based on the External Data Source.
The query succeeds and returns zero because it is looking for a file/path that does not exist in the HDFS file-system. "Table not found" is not returned in case of no table. That is normal. What is bad is the real table is stored in WASBS and has many rows.
What this all means is Pushdown Compute is not supported when using Azure Blobs as the Hadoop default file system. The recommended setup is to use Azure Blobs so that the storage is separate from compute. It makes no sense PolyBase would not support this setup, but as of now it appears not to support it.
I will leave this question up in case I am wrong. I really want to be wrong.
If you want PolyBase to pushdown computation to any hadoop/HDI cluster, you need to specify RESOURCE_MANAGER_LOCATION while creating the external data source. The RESOURCE_MANAGER_LOCATION tells SQL server where to submit a MR job.

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.
Results:
For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.
I will add more details about the tests and results when I have time.
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).
See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
Here's the keys I'd recommend for s3a work
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;
And here is the part of output:
PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works.
(and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)

Can Spark SQL be executed against Hive tables without any Map/Reduce (/Yarn) running?

It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?
Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.