CETAS times out for large tables in Synapse Serverless SQL - sql

I'm trying to create a new external table using CETAS (CREATE EXTERNAL TABLE AS SELECT * FROM <table>) statement from an already existing external table in Azure Synapse Serverless SQL Pool. The table I'm selecting from is a very large external table built on around 30 GB of data in parquet format stored in ADLS Gen 2 storage but the query always times out after about 30 minutes. I've tried using premium storage and also tried out most if not all the suggestions made here as well but it didn't help and the query still times out.
The error I get in Synapse Studio is :-
Statement ID: {550AF4B4-0F2F-474C-A502-6D29BAC1C558} | Query hash: 0x2FA8C2EFADC713D | Distributed request ID: {CC78C7FD-ED10-4CEF-ABB6-56A3D4212A5E}. Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. Query timeout expired.
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
Is there a way to resolve this timeout issue or a better way to solve the problem?

This is a limitation of Serverless.
Query timeout expired
The error Query timeout expired is returned if the query executed more
than 30 minutes on serverless SQL pool. This is a limit of serverless
SQL pool that cannot be changed. Try to optimize your query by
applying best practices, or try to materialize parts of your queries
using CETAS. Check is there a concurrent workload running on the
serverless pool because the other queries might take the resources. In
that case you might split the workload on multiple workspaces.
Self-help for serverless SQL pool - Query Timeout Expired
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
It's simple to do in a Data Factory copy job, a Spark job, or AzCopy.

Related

Azure SQL Serverless inbuilt Pool Column/Field Limitations

We have created a SQL Database from our Azure SQL Serverless Pool. We have a table that has over 450 fields.
Whenever we try to extract the table with all the fields the query times out and produces the following error:
Msg 15884, Level 16, State 1, Line 2
Query timeout expired.
However, when I we try to extract just a few fields it successfully gives us all the rows.
Therefore, can someone let me know if there are any limitations on the number fields when extracting tables from Azure SQL Serverless Pool?
Msg 15884, Level 16, State 1, Line 2
Query timeout expired.
This error is because the SQL query takes long time to execute. Unfortunately, timeout settings cannot be modified in Synapse SQL serverless pool. The solution is to either optimize the query or to optimize the data stored in external storage.
Below are some points for better performance.
Try to store data in parquet format than csv or Json file. Parquet files are columnar format and size will be lesser for same data which is stored as csv or Json format.
Do not use the storage account with other workloads during query execution.
In order to query large amount of data, use Azure Data Studio or SQL Server Management Studio than azure synapse studio.
Make sure to have Synapse serverless SQL pool and Storage in the same region.
Refer Microsoft document on Best practices for serverless SQL pool - Azure Synapse Analytics .

Azure Synapse pipeline: How to move incremental updates from SQL Server into synapse for crunching numbers

We are working building a new data pipeline for our project and we have to move incremental updates that happen throughout the day on our SQL servers into Azure synapse for some number crunching.
We have to get updates which occur across 60+ tables ( 1-2 million updates a day ) into synapse to crunch some aggregates and statistics as they happen throughout the day.
One of the requirements is being near real time and doing a bulk import into synapse is not ideal because it takes more than 10 mins to do full compute on all data.
I have been reading about CDC feed into synapse https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-data-capture-feature-portal and it is one possible solution.
Wondering if there are other alternatives to this or suggestions for achieving the end goal of data crunching near real time for DB updates.
Change Data Capture (CDC) is the suited way to capture the changes and add to the destination location (storage/database).
Apart from that, you can also use watermark column to capture the changes in multiple tables in SQL Server.
Select one column for each table in the source data store, which you
can identify the new or updated records for every run. Normally, the
data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
Here is the high-level solution diagram for this approach:
Step-by-Step approach is given in this official document Incrementally load data from multiple tables in SQL Server to Azure SQL Database using PowerShell.

In SQL External table take a time for select and insert data into temp table

I am using external table of main database to Datawarehouse database. While selecting data from external table to # table it take almost 9 min, sometime it take more time. How I can improve this performance of external table?
Is to use the following TSQL to perform the query in the external database and get only the data required. The filter will be applied first in the external database, and then the data from the filter will be received by the database.
When you enable the query's Actual Execution Plan option, you can see
that the query : Select * from PerformanceVarcharNVarchar, brings data
from an external database to the temporal database, and then the
engine applies the filter.
Here is the Official Microsoft Documents :EXECUTE (Transact-SQL) | Docs
Else you can use Azure Data Sync : SQL Data Sync is an Azure SQL Database-based service that allows you to synchronize selected data bidirectionally between multiple databases, both on-premises and in the cloud.
The Original Post has got detailed insights: Lesson Learned #56:
External tables and performance issues | techcommunity

ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.
The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100
We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.
We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Azure SQL Server database - Deleting data

I am currently working on a project that is based on:
Azure EventHub1-->Stream Analytics1-->SQL Server DB
Azure EventHub1-->Stream Analytics2-->Document DB
Both SQL Server and DocumentDB have their respective Stream job, but share the same EventHub stream.
DocumentDB is an archive sink and SQL Server DB is a reporting base that should only houses 3 days of data. This is per reporting and query efficiency requirements.
Daily we receive around 30K messages through EventHub, that are pushed through Stream job (basic SELECT query, no manipulation) to a SQL Server table.
To preserve 3 days of data, we had designed a Logic App that calls a SQL SP that deletes any data, based on date, that is more than 3 days old. Runs every day at 12am.
Also, there is another business rule Logic App that READs from the SQL table to perform business logic checks. Runs every 5 mins.
We noticed that for some strange reason the Logic App for data deletion isn't working and data through months has stacked up to 3 Million rows. The SP can be run manually, as tested in Dev setup.
The Logic App shows Succeeded status, but SP execute step shows an Amber check sign, which when expanded says 3 tries occurred.
I am not sure why the SP doesn't delete old data. My understanding is that because Stream job keep pushing data, the Delete operation in SP can't get a Delete Lock and time out.
Try using Azure Automation instead. Create a runbook that runs the stored procedure. Here you will find an example and step-by-step procedure.