Azure synapse External Table - Parameterise location - azure-synapse

Is it possible to pass a parameter(date) to the location of the Azure Synapse External Table?
LOCATION='/Windows/Folder/File_name_yyyy-MM-dd.csv'
I want to parameterise yyyy-MM-dd and File_name_yyyy-MM-dd.csv as a parameter which changes every day. I tried all below, but nothing allows me to create the table due to syntax errors -
LOCATION->'/Windows/Folder/File_name_'format(getdate(),'yyyy-MM-dd')'.csv'
LOCATION->'/Windows/Folder/File_name_'{format(getdate(),'yyyy-MM-dd')}'.csv'
LOCATION->'/Windows/Folder/File_name_'#{format(getdate(),'yyyy-MM-dd')}'.csv'

Related

Add a column to a delta table in Azure Synapse

I have a delta table that I created in Azure Synapse using a mapping data flow. The data flow reads append-only changes from Dataverse, finds the latest value, and upserts them to the table.
Now, I'd like to add a column to the delta table. When you select Upsert in a mapping dataflow, the Merge Schema is disabled, so it doesn't appear I can use that.
I tried creating a notebook and executing the following SQL, but I get an error.
ALTER TABLE delta.`https://xxxx.dfs.core.windows.net/path/to/table` ADD COLUMNS (mytest STRING)
Error: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path
The path provided is not in the default Synapse container.
How can I alter the table and add the column?
The issue I was running into was that the default Synapse container wasn't present in the storage account. After creating the container, the command executed successfully.
Open Azure Portal
Navigate to Synapse Workspace
Click Properties
Value of Primary ADLS Gen2 file system
This helped me track down the issue. https://learn.microsoft.com/en-us/answers/questions/706694/unable-to-run-sql-queries-in-azure-synapse-error-o.html

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.
I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.
However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).
For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.
I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.
For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.
create database delta_dwh;
create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );
create external file format deltalakeformat
with (format_type = delta);
drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
data_source = deltalakestorage,
file_format = deltalakeformat
);
However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:
select isnull(max(product_skey), 0)
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows
Any thoughts, inputs, suggestions ??
Thanks!
You may try to create initial /dummy data_flow + pipiline to create this empty delta files.
It's only simple workaround.
Create CSV with your sample table data.
Create dataflow with name =initDelta
Use this CSV as source in data flow
In projection panel set up correct data types.
Add filtering after source and setup dummy filter 1=2 etc.
Add sink with delta output.
Put your initDelta dataflow into dummy pipeline and run it.
Folder structure for delta should created.
You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.
I think it should work or I missed something in your problem
I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

Azure SQL External table of Azure Table storage data

Is it possible to create an external table in Azure SQL of the data residing in Azure Table storage?
Answer is no.
I am currently facing similiar issue and this is my research so far:
Azure SQL Database doesn't allow Azure Table Storage as a external data source.
Sources:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
Reason:
The possible data source scenarios are to copy from Hadoop (DataLake/Hive,..), Blob (Text files,csv) or RDBMS (another sql server). The Azure Table Storage is not listed.
The possible external data formats are only variations of text files/hadoop: Delimited Text, Hive RCFile, Hive ORC,Parquet.
Note - even copying from blob in JSON format requires implementing custom data format.
Workaround:
Create a copy pipeline with Azure Data Factory.
Create a copy
function/script with Azure Functions using C# and manually transfer
the data
Yes, there are a couple options. Please see the following:
CREATE EXTERNAL TABLE (Transact-SQL)
APPLIES TO: SQL Server (starting with 2016) Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse
Creates an external table for PolyBase, or Elastic Database queries. Depending on the scenario, the syntax differs significantly. An external table created for PolyBase cannot be used for Elastic Database queries. Similarly, an external table created for Elastic Database queries cannot be used for PolyBase, etc.
CREATE EXTERNAL DATA SOURCE (Transact-SQL)
APPLIES TO: SQL Server (starting with 2016) Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse
Creates an external data source for PolyBase, or Elastic Database queries. Depending on the scenario, the syntax differs significantly. An external data source created for PolyBase cannot be used for Elastic Database queries. Similarly, an external data source created for Elastic Database queries cannot be used for PolyBase, etc.
What is your use case?

Create an Azure Data Factory pipeline to copy new records from DocumentDB to Azure SQL

I am trying to find the best way to copy yesterday's data from DocumentDB to Azure SQL.
I have a working DocumentDB database that is recording data gathered via a web service. I would like to routinely (daily) copy all new records from the DocumentDB to an Azure SQL DB table. In order to do so I have created and successfully executed an Azure Data Factory Pipeline that copies records with a datetime > '2018-01-01', but I've only ever been able to get it to work with an arbitrary date - never getting the date from a variable.
My research on DocumentDB SQL querying shows that it has Mathematical, Type checking, String, Array, and Geospatial functions but no date-time functions equivalent to SQL Server's getdate() function.
I understand that Data Factory Pipelines have some system variables that are accessible, including utcnow(). I cannot figure out, though, how to actually use those by editing the JSON successfully. If I try just including utcnow() within the query I get an error from DocumentDB that "'utcnow' is not a recognized built-in function name".
"query": "SELECT * FROM c where c.StartTimestamp > utcnow()",
If I try instead to build the string within the JSON using utcnow() I can't even save it because of a syntax error:
"query": "SELECT * FROM c where c.StartTimestamp > " + utcnow(),
I am willing to try a different technology than a Data Factory Pipeline, but I have a lot of data in our DocumentDB so I'm not interested in abandoning that, and I have much greater familiarity with SQL programming and need to move the data there for joining and other analysis.
What is the easiest and best way to copy those new entries over every day into the staging table in Azure SQL?
Are you using ADF V2 or V1?
For ADF V2.
I think that you can follow the incremental approach that they recommend, for example you could have a watermark table (it could be in your target Azure SQL database) and two lookups activities, one of the lookups will obtain the previous run watermark value (it could be date, integer, whatever your audit value is) and another lookup activity to obtain the MAX (watermark_value, i.e. date) of your source document and have a CopyActivity that gets all the values where the c.StartTimeStamp<=MaxWatermarkValueFromSource AND c.StartTimeStamp>LastWaterMarkValue.
I followed this example using the Python SDK and worked for me.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell

Azure Machine Learning Write output to Azure SQL Database

I am using Azure Machine Learning to clustering data.
The input data is from an Azure SQL Database, and it works fine.
At the end of everything I want to write the output to a table in the same Azure SQL Database, but I get this error:
Error: Error 1000: AFx Library library exception:
Sql encountered an error: Login failed for user
Anyone any idea?
Thank you very much!
Please follow the instructions and examine the examples provided here to properly use the Export Data module to save the data of ML to Azure SQL Database.
How to Export Data to an Azure SQL Database
Add the Export Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
Connect it to the module that produces the data that you want to export to Azure SQL DB.
For Data destination, select Azure SQL Database. This option supports Azure SQL Data Warehouse as well.
Set the following options specific to Azure SQL Database or Azure SQL Data Warehouse.
Database server name
Type the server name that is generated by Azure. Typically it has the form <generated_identifier>.database.windows.net.
Database name
Type the name of a database on the server you just specified.The database must already exist; the Export Data cannot create it.
Server user account name
Type the user name of an account that has access permissions for the database.
Server user account password
Provide the password for the specified user account.
Comma-separated list of columns to be saved
Type the names of the columns in the experiment that you want to write to the database.
Data table name
Type the name of the table where data will be stored.
For Azure SQL Database, if the table does not exist, it will be created. For Azure SQL Data Warehouse, the table must already exist and have the correct schema, so be sure to create it in advance.
Comma-separated list of datatable columns
Type the names of the columns as you wish them to appear in the destination table. The columns should correspond in order with the column names that you list in Comma-separated list of columns to be saved.
if you are writing to Azure SQL Data Warehouse, the columns names must match those already in the destination table schema.
Number of rows written per SQL Azure operation
Indicate how many rows should be written to the destination table in each batch. By default, the value is set to 50, which is the default batch size for Azure SQL Database. However, you should increase this value if you have a large number of rows to write.
TIP:
For Azure SQL Data Warehouse, we recommend that you set this value to 1. If you use a larger batch size, the size of the command string that is sent to Azure SQL Data Warehouse can exceed the allowed string length, causing an error.
If you don't want to write new results each time you run the experiment, select the Use cached results option. If there are no other changes to module parameters, the experiment will write the data the first time the module is run, and thereafter not perform writes.
However, a write will always be performed if any parameters have been changed in Export Data that would change the results.
Run the experiment.
Find the issue!
I needed to create an specific user with this SQL code:
CREATE USER AMLApplicationUser WITH PASSWORD = '************';
and then add the user to these roles on the database I want to write.
ALTER ROLE db_datareader ADD MEMBER AMLApplicationUser;
ALTER ROLE db_datawriter ADD MEMBER AMLApplicationUser;
I guess only the datawriter role is enough, but I needed datareader too.
So in conclusion, seems that database admin role can be used to read data, but not to write data from AML.
Thank you for your help!