How to Delete First Row of All Tables in Azure Synapse Serverless Pool - azure-synapse

I have create a number of tables in Azure Synapse Analytics and I would like to remove the first row from each of the tables.
Can someone assist with code that will remove the first row from each table.
I tried the following:
DELETE TOP (1)
FROM [dbo].[MyTable]
I got the error:
DML Operations are not supported with external tables.

External tables are read-only. They are just another abstraction layer to lake.
You can't perform DML operations over those files from Synapse.
What you could do is:
Try to employ REJECT_TYPE and REJECT_VALUE in OPTIONS while creating the table.
Only use Serverless Views so that you can filter some data out
Implement a strategy to re-stage data to a distributed pool table by using CTAS and providing the filter there
General rule of thumb is: only get what you need, deletion is usually slow.
Best,
Onur

Related

In SQL External table take a time for select and insert data into temp table

I am using external table of main database to Datawarehouse database. While selecting data from external table to # table it take almost 9 min, sometime it take more time. How I can improve this performance of external table?
Is to use the following TSQL to perform the query in the external database and get only the data required. The filter will be applied first in the external database, and then the data from the filter will be received by the database.
When you enable the query's Actual Execution Plan option, you can see
that the query : Select * from PerformanceVarcharNVarchar, brings data
from an external database to the temporal database, and then the
engine applies the filter.
Here is the Official Microsoft Documents :EXECUTE (Transact-SQL) | Docs
Else you can use Azure Data Sync : SQL Data Sync is an Azure SQL Database-based service that allows you to synchronize selected data bidirectionally between multiple databases, both on-premises and in the cloud.
The Original Post has got detailed insights: Lesson Learned #56:
External tables and performance issues | techcommunity

Azure Synapse: Performant way to promote data from staging schema to prod schema?

I've used ADF/synapse pipeline to extract my fact data from source to staging schema table in dedicated sql pool.
My task now is to enrich the fact data with a surrogate key, which comes via lookup from a small dimension table already in prod schema.
I've been using ADF/synapse dataflow for this, but it seems inefficient because it's moving the data BACK into ADF. I could do this via spark notebook, but also seems like an unnecessary data movement.
So i think best approach is to make a stored procedure on the dedicated sql pool to perform this enrichment. My concern is to ensure this sql proc is coded in a performant way (not a row by row insert).
There are lots of proc examples that create a table from nothing (ctas), but I haven't found examples that do the enrich/append action in a scalable way. What are good sql code practices for this?
I think the best practice for this would be to have your small dimension distributed as REPLICATE and use CTAS to create a HASH distributed table on a suitable column. Consider partition switching in addition if you have enough volume. A simplified example:
CREATE TABLE fact.yourBigTable
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( someColumn )
)
AS
SELECT ...
FROM staging. ...
CREATE TABLE dim.yourSmallTable (
...
)
WITH
(
CLUSTERED INDEX ( someColumn ),
DISTRIBUTION = REPLICATE
);
CTAS is optimised to work well on the MPP infrastructure of dedicated SQL pools. If you do not feel CTAS is appropriate look a straightforward INSERT instead. Dedicated SQL pools now support MERGE (in preview) so that may also be worth a look.
I would agree with you not to use ADF or Dataflows for this as there's nothing faster than a bit of SQL running directly on the server and I reserve these things for things you can't do with SQL, eg orchestrations / running tasks in parallel, advanced transform (eg with notebooks) and so on.

ADF - How should I copy table data from source Azure SQL Database to 6 other Azure SQL Databases?

We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-

How to manage Schema Drift while streaming to BigQuery sharded table

We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben
Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.

when should we go for a external table and internal table in Hive?

I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.