Azure Synapse Delta Table Creation and Import Data From ADLS delta lake - azure-data-lake

We have requirement to load the data from ADLS delta data into synapse table. actually, we are writing the delta format data into ADLS gen2 from databricks. now we want to load the data from ADLS gen2(with delta table) to synapse table delta table. below steps we followed to create table but we are getting issues.
CREATE EXTERNAL FILE FORMAT DeltaFileFormat
WITH (  
     FORMAT_TYPE = DELTA  
);   
CREATE EXTERNAL DATA SOURCE test_data_source
WITH
(     LOCATION = 'abfss://container#storage.dfs.core.windows.net/table_metadata/testtable'
        --,CREDENTIAL = <database scoped credential>
); 
CREATE EXTERNAL TABLE testtable (
     job_id int,
     source_type varchar(10),
     server_name varchar(10),
     database_name varchar(15),
     table_name varchar(20),
     custom_query varchar(100),
     source_location varchar(500),
     job_timestamp datetime2,
     job_user varchar(50)
) WITH (
        LOCATION = 'abfss://targetcontainer#targetstorage.dfs.core.windows.net/table_metadata/testtable',
        data_source = test_data_source,
        FILE_FORMAT = DeltaFileFormat
);
select * from testtable; 
while query select statement, below issues throwing exception.
Content of directory on path
'https://container#storage.dfs.core.windows.net_delta_log/.' cannot
be listed.

I also tried and getting similar error.
Content of directory on path 'https://container#storage.dfs.core.windows.net_delta_log/.' cannot be listed.
This error message typically occurs when there is no data or delta files present in the _delta_log directory of the specified ADLS Gen2 storage account. The _delta_log directory is created automatically when you create a Delta table, and it contains the transaction log files for the Delta table.
In below code, the delta table files are located at /demodata/ and the external data source for this example is test_data_source18 which contain file path information and credentials. LOCATION the delta table folder or the absolute file itself, which would be located at /demodata/_delta_log/00000000000000000000.json.
CREATE EXTERNAL FILE FORMAT DeltaFileFormat
WITH (
FORMAT_TYPE = DELTA
);
CREATE EXTERNAL DATA SOURCE test_data_source18
WITH
( LOCATION = 'abfss://demo#dlsg2p.dfs.core.windows.net');
CREATE EXTERNAL TABLE testtable24(
Id varchar(20),
Name varchar(20)
) WITH (
LOCATION = '/demodata/',
data_source = test_data_source18,
FILE_FORMAT = DeltaFileFormat
);
select * from testtable24;
Output:
Reference: delta table with PolyBase

Related

cannot get data from existing external table in synapse analytics database

I have created EXTERNAL TABLE and I was able to get data from it.
Now I get error
"External file access failed because the specified path name '...' does not exist. Enter a valid path and try again."
But file path is correct and file exists in container.
Below is scripts which I used to create external table:
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'XXXX' ;
CREATE EXTERNAL DATA SOURCE AzureStorageContainerName
WITH
( LOCATION = 'wasbs://BlobContainerName#StorageAccountName.blob.core.windows.net' ,
CREDENTIAL = [AzureStorageCredential],
TYPE = HADOOP
) ;
CREATE EXTERNAL FILE FORMAT FileFormat_csv
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS ( FIELD_TERMINATOR = ';',
FIRST_ROW = 1,
USE_TYPE_DEFAULT = FALSE,
ENCODING = 'UTF8')
);
CREATE EXTERNAL TABLE TableExternal
(
[code] [nvarchar](100) ,
[name] [nvarchar](100)
)
WITH (
LOCATION = '/subfolder/FileName.csv',
DATA_SOURCE = AzureStorageContainerName,
FILE_FORMAT = FileFormat_csv
)

Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 4 in Azure Synapse

I have a Spotify CSV file in my Azure Data Lake. I am trying to create external table you SQL serverless pool in Azure Synapse.
I am getting the below error message
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 4 (Track_popularity) in data file https://test.dfs.core.windows.net/data/folder/updated.csv.
I am using the below script
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'SynapseDelimitedTextFormat')
CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat]
WITH ( FORMAT_TYPE = DELIMITEDTEXT ,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
USE_TYPE_DEFAULT = FALSE
))
GO
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'test.dfs.core.windows.net')
CREATE EXTERNAL DATA SOURCE [test.dfs.core.windows.net]
WITH (
LOCATION = 'abfss://data#test.dfs.core.windows.net'
)
GO
CREATE EXTERNAL TABLE updated (
[Artist] nvarchar(4000),
[Track] nvarchar(4000),
[Track_id] nvarchar(4000),
[Track_popularity] bigint,
[Artist_id] nvarchar(4000),
[Artist_Popularity] bigint,
[Genres] nvarchar(4000),
[Followers] bigint,
[danceability] float,
[energy] float,
[key] bigint,
[loudness] float,
[mode] bigint,
[speechiness] float,
[acousticness] float,
[instrumentalness] float,
[liveness] float,
[valence] float,
[tempo] float,
[duration_ms] bigint,
[time_signature] bigint
)
WITH (
LOCATION = 'data/updated.csv',
DATA_SOURCE = [data_test_dfs_core_windows_net],
FILE_FORMAT = [SynapseDelimitedTextFormat]
)
GO
SELECT TOP 100 * FROM dbo.updated
GO
Below is the data sample
My CSV is utf-8 encoding. Not sure what is the issue. The error shows column (Track_popularity). Please advise
I’m guessing you may have a header row that should be skipped. Drop your external table and then drop and recreate the external file format as follows:
CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat]
WITH ( FORMAT_TYPE = DELIMITEDTEXT ,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
USE_TYPE_DEFAULT = FALSE,
FIRST_ROW = 2
))

Create External Table pointing to S3

How do we create an external table using Snowflake sql that points to a directory in S3? Below is the code I tried so far, but didn't work. Any help is highly appreciated.
create external table my_table
(
column1 varchar(4000),
column2 varchar(4000)
)
LOCATION 's3a://<externalbucket>'
Note : The file that I have in the S3 bucket is a csv file (comma seperated, double quotes enclosed and with header).
You will need to update your location to be an external stage, include the file_format parameter, and include the proper expression for the columns.
The location Parameter:
Specifies the external stage where the files containing data to be read are staged.
Additionally you'll need to define the file_format
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#required-parameters
So your statement should look more like this:
create external table my_table
(
column1 varchar as (value:c1::varchar),
column2 varchar as (value:c2::varchar)
)
location = #[namespace.]ext_stage_name[/path]
file_format = (type = CSV)
You may need to define additional paramaters in the file format to handle your file appropriately
Finally I sorted this out. Posting this answer as to make the answer simple to understand especially for the beginners.
Say that I have a csv file in the S3 location in the below format.
Step 1 :
Create a file format in which you can define what type of file it is, field delimiter, data enclosed in double quotes, skip the header of the file etc.
create or replace file format schema_name.pipeformat
type = 'CSV'
field_delimiter = '|'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
skip_header = 1
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Step 2 :
Create a Stage to specify the S3 details and file format.
create or replace stage schema_name.stage_name
url='s3://<path where file is kept>'
credentials=(aws_key_id='****' aws_secret_key='****')
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#required-parameters
Step 3 :
Create the external table based on the Stage name and file format.
create or replace external table schema_name.table_name
(
RollNumber INT as (value:c1::int),
Name varchar(20) as ( value:c2::varchar),
Marks int as (value:c3::int)
)
with location = #stage_name
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html
Step 4 :
Now you should be able to query from the external table.
select *
from schema_name.table_name

Azure SQL bulk insert from blob storage failure: Referenced external data source "MyAzureBlobStorage" not found

I keep getting this error (Referenced external data source "MyAzureBlobStorage" not found.) when loading csv from blob to Azure SQL. I am following this example and I set my blob to be public but the following just does not work:
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://test.blob.core.windows.net/test'
);
BULK INSERT SubscriberQueue
FROM 'inputs.csv'
WITH (DATA_SOURCE = 'MyAzureBlobStorage', FORMAT='CSV');
Any ideas what I am missing here?
If you want to bulk insert from Azure blob, please refer to following script
my csv file
1,Peter,Jackson,pjackson#hotmail.com
2,Jason,Smith,jsmith#gmail.com
3,Joe,Raasi,jraasi#hotmail.com
script
create table listcustomer
(id int,
firstname varchar(60),
lastname varchar(60),
email varchar(60))
Go
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://****.blob.core.windows.net/test'
);
Go
BULK INSERT listcustomer
FROM 'mycustomers.csv'
WITH (DATA_SOURCE = 'MyAzureBlobStorage', FORMAT='CSV');
Go
select * from listcustomer;

Adding a comma separated table to Hive

I have a very basic question which is: How can I add a very simple table to Hive. My table is saved in a text file (.txt) which is saved in HDFS. I have tried to create an external table in Hive which points out this file but when I run an SQL query (select * from table_name) I don't get any output.
Here is an example code:
create external table Data (
dummy INT,
account_number INT,
balance INT,
firstname STRING,
lastname STRING,
age INT,
gender CHAR(1),
address STRING,
employer STRING,
email STRING,
city STRING,
state CHAR(2)
)
LOCATION 'hdfs:///KibTEst/Data.txt';
KibTEst/Data.txt is the path of the text file in HDFS.
The rows in the table are seperated by carriage return, and the columns are seperated by commas.
Thanks for your help!
You just need to create an external table pointing to your file
location in hdfs and with delimiter properties as below:
create external table Data (
dummy INT,
account_number INT,
balance INT,
firstname STRING,
lastname STRING,
age INT,
gender CHAR(1),
address STRING,
employer STRING,
email STRING,
city STRING,
state CHAR(2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///KibTEst/Data.txt';
You need to run select query(because file is already in HDFS and external table directly fetches data from it when location is specified in create statement). So you test using below select statement:
SELECT * FROM Data;
create external table Data (
dummy INT,
account_number INT,
balance INT,
firstname STRING,
lastname STRING,
age INT,
gender CHAR(1),
address STRING,
employer STRING,
email STRING,
city STRING,
state CHAR(2)
)
row format delimited
FIELDS TERMINATED BY ‘,’
stored as textfile
LOCATION 'Your hdfs location for external table';
If data in HDFS then use :
LOAD DATA INPATH 'hdfs_file_or_directory_path' INTO TABLE tablename
The use select * from table_name
create external table Data (
dummy INT,
account_number INT,
balance INT,
firstname STRING,
lastname STRING,
age INT,
gender CHAR(1),
address STRING,
employer STRING,
email STRING,
city STRING,
state CHAR(2)
)
row format delimited
FIELDS TERMINATED BY ','
stored as textfile
LOCATION '/Data';
Then load file into table
LOAD DATA INPATH '/KibTEst/Data.txt' INTO TABLE Data;
Then
select * from Data;
I hope, below inputs will try to answer the question asked by #mshabeen.
There are different ways that you can use to load data in Hive table that is created as external table.
While creating the Hive external table you can either use the LOCATION option and specify the HDFS, S3 (in case of AWS) or File location, from where you want to load data OR you can use LOAD DATA INPATH option to load data from HDFS, S3 or File after creating the Hive table.
Alternatively you can also use ALTER TABLE command to load data in the Hive partitions.
Below are some details
Using LOCATION - Used while creating the Hive table. In this case data is already loaded and available in Hive table.
**LOAD DATA INPATH** option - This Hive command can be used to load data from specified location. Point to remember here is, the data will get MOVED from input path to Hive warehouse path.
Example -
LOAD DATA INPATH 'hdfs://cluster-ip/path/to/data/location/'
Using ALTER TABLE command - Mostly this is used to add data from other locations into the Hive partitions. In this case it is required that all partitions are already defined and the values for the partitions are already known. In case of dynamic partitions this command is not required.
Example -
ALTER TABLE table_name ADD PARTITION (date_col='2018-02-21') LOCATION 'hdfs/path/to/location/'
The above code will map the partition to the specified data location (in this case HDFS). However, the data will NOT MOVED to Hive internal warehouse location.
Additional details are available here