How to resolve special character issue in SQL Server data warehouse - sql-server-2016

I have to load the data from datalake into a SQL Server data warehouse using the polybase tables. I have created the set up for the creation of external tables. I have created the external tables and I am trying to do select * from ext_t1 table but I'm getting ???? for a column in ext_table.
Below is my external table script. I have found the issue with the special character in data. How can we escape the special character and need to use only varchar datatype not nvarchar. Can some help me on this issue?
CREATE EXTERNAL FILE FORMAT [CSVFileFormat_Test] WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N',', STRING_DELIMITER = N'"',DATE_FORMAT='yyyy-MM-dd', FIRST_ROW = 2, USE_TYPE_DEFAULT = True,Encoding='UTF8'))
CREATE EXTERNAL TABLE [dbo].[EXT_TEST1]
( A VARCHAR(10),B VARCHAR(20))
(DATA_SOURCE = [Azure_Datalake],LOCATION = N'/A/Test_CSV/',FILE_FORMAT =csvfileformat,REJECT_TYPE = VALUE,REJECT_VALUE = 1)
Data: (special character in csv for A column as follows)
ÐК Ð’ÐЗМ Завод
ÐК Ð’ÐЗМ ЗаÑтройщик

This is data mismatch issue and this read may help you .
External Table Considerations
Creating an external table is easy, but there are some nuances that need to be discussed.
External Tables are strongly typed. This means that each row of the data being ingested must satisfy the table schema definition. If a row does not match the schema definition, the row is rejected from the load.
The REJECT_TYPE and REJECT_VALUE options allow you to define how many rows or what percentage of the data must be present in the final table. During load, if the reject value is reached, the load fails. The most common cause of rejected rows is a schema definition mismatch. For example, if a column is incorrectly given the schema of int when the data in the file is a string, every row will fail to load.
Data Lake Storage Gen1 uses Role Based Access Control (RBAC) to control access to the data. This means that the Service Principal must have read permissions to the directories defined in the location parameter and to the children of the final directory and files. This enables PolyBase to authenticate and load that data.

Related

How to Set default value of empty data of column in copy activity from csv file using azure data factory v2

I've multiple csv files and multiple tables.
The table name is file name and column name is first row of csv file.
Now I want to add default value of empty string to the sink table.
Consider my scenario,
employee:
id int, name varchar, is_active bit NULL
employee.csv:
id|name|is_active
1|raja|
Now I'm trying to copy the csv data to PostgreSQL table its throwing error.
Expected result is default value if its empty value.
You can use NULLIF in PostgreSQL:
NULLIF(argument_1,argument_2);
The NULLIF function returns a null value if argument_1 equals to argument_2, otherwise it returns argument_1.
This way you can replace NULL value with some other value
If your error is related to Type mismatch then consider typecasting the column first
Thanks!
As per the issue, tried to repro the scenario and here is the following outcome which was successfully copied. You have to use
Source Dataset: employee.csv from Azure Blob Storage
Sink Dataset : Here, I have used the sink as Azure SQL DB for some limitations but as you have used PostgreSQL is almost similar.
Copy Activity Settings:
Under the mapping settings there will be type conversion, where you have to import schema else you can dynamically add
Output:
Alternative to use DataFlow - if you have multiple data fields, you need to use the derived column transformation to generate new columns in your data flow or to modify existing fields.
For more details, refer Derived column transformation in mapping data flow.
You can even refer to this Microsoft Q&A post for more insights: Copy Task failure because of conversion failure

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.
I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.
However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).
For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.
I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.
For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.
create database delta_dwh;
create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );
create external file format deltalakeformat
with (format_type = delta);
drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
data_source = deltalakestorage,
file_format = deltalakeformat
);
However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:
select isnull(max(product_skey), 0)
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows
Any thoughts, inputs, suggestions ??
Thanks!
You may try to create initial /dummy data_flow + pipiline to create this empty delta files.
It's only simple workaround.
Create CSV with your sample table data.
Create dataflow with name =initDelta
Use this CSV as source in data flow
In projection panel set up correct data types.
Add filtering after source and setup dummy filter 1=2 etc.
Add sink with delta output.
Put your initDelta dataflow into dummy pipeline and run it.
Folder structure for delta should created.
You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.
I think it should work or I missed something in your problem
I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

Copy Data from Blob to SQL via Azure data factory

I have two sample files in blob as sample1.csv and sample2.csv as below
data sample
SQL table name sample2, with column Name,id,last name,amount
Created a ADF flow without schema, it results as below
preview data
source settings are allow schema drift checked.
sink setting are auto mapping turned on. allow insert checked. table action none.
I have also tried setting a define schema in dataset, its result are same.
any help here?
my expected outcome would be data in sample1 will inserted null into the column "last name"
If I understand correctly, you said: "my expected outcome would be data in sample1 will inserted null into the column last name", you only need to add a derived column to you sample1.csv file.
You could follow my steps:
I create a sample1.csv file in Blob Storage and a sample2 table in my SQL database:
Using DerivedColumn to create new column last name with null value:
expression: toString(null())
Sink settings:
Run the pipeline and check the data in table:
Hope this helps.
You cannot mix schemas in the same source in the same data flow execution.
Schema Drift will handle changes to the schema on an execution-per-execution basis.
But if you are reading multiple different schemas from a folder, you will get non-deterministic results.
Instead, if you loop through those files in a pipeline ForEach one-by-one, data flow will be able to handle the evolving schema.

Azure Machine Learning Write output to Azure SQL Database

I am using Azure Machine Learning to clustering data.
The input data is from an Azure SQL Database, and it works fine.
At the end of everything I want to write the output to a table in the same Azure SQL Database, but I get this error:
Error: Error 1000: AFx Library library exception:
Sql encountered an error: Login failed for user
Anyone any idea?
Thank you very much!
Please follow the instructions and examine the examples provided here to properly use the Export Data module to save the data of ML to Azure SQL Database.
How to Export Data to an Azure SQL Database
Add the Export Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
Connect it to the module that produces the data that you want to export to Azure SQL DB.
For Data destination, select Azure SQL Database. This option supports Azure SQL Data Warehouse as well.
Set the following options specific to Azure SQL Database or Azure SQL Data Warehouse.
Database server name
Type the server name that is generated by Azure. Typically it has the form <generated_identifier>.database.windows.net.
Database name
Type the name of a database on the server you just specified.The database must already exist; the Export Data cannot create it.
Server user account name
Type the user name of an account that has access permissions for the database.
Server user account password
Provide the password for the specified user account.
Comma-separated list of columns to be saved
Type the names of the columns in the experiment that you want to write to the database.
Data table name
Type the name of the table where data will be stored.
For Azure SQL Database, if the table does not exist, it will be created. For Azure SQL Data Warehouse, the table must already exist and have the correct schema, so be sure to create it in advance.
Comma-separated list of datatable columns
Type the names of the columns as you wish them to appear in the destination table. The columns should correspond in order with the column names that you list in Comma-separated list of columns to be saved.
if you are writing to Azure SQL Data Warehouse, the columns names must match those already in the destination table schema.
Number of rows written per SQL Azure operation
Indicate how many rows should be written to the destination table in each batch. By default, the value is set to 50, which is the default batch size for Azure SQL Database. However, you should increase this value if you have a large number of rows to write.
TIP:
For Azure SQL Data Warehouse, we recommend that you set this value to 1. If you use a larger batch size, the size of the command string that is sent to Azure SQL Data Warehouse can exceed the allowed string length, causing an error.
If you don't want to write new results each time you run the experiment, select the Use cached results option. If there are no other changes to module parameters, the experiment will write the data the first time the module is run, and thereafter not perform writes.
However, a write will always be performed if any parameters have been changed in Export Data that would change the results.
Run the experiment.
Find the issue!
I needed to create an specific user with this SQL code:
CREATE USER AMLApplicationUser WITH PASSWORD = '************';
and then add the user to these roles on the database I want to write.
ALTER ROLE db_datareader ADD MEMBER AMLApplicationUser;
ALTER ROLE db_datawriter ADD MEMBER AMLApplicationUser;
I guess only the datawriter role is enough, but I needed datareader too.
So in conclusion, seems that database admin role can be used to read data, but not to write data from AML.
Thank you for your help!

Keep Column types from Java ResultSet in CSV export

I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!
Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.