Azure Synapse SQL Serverless, how to create external table from CSV with fields longer than 8Kb? - sql

I have a CSV with more than 500 fields, hosted on an Azure storage account; however I just need a couple of columns, which may contain values longer than 8Kbytes. For this reason, I started by writing a simple query in Azure Synapse SQL Serverless like this:
SELECT TOP 100 C1, C2 FROM OPENROWSET(
BULK 'https://mysa.blob.core.windows.net/my_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0'
) AS [result]
It fails with the error "String or binary data would be truncated while reading column of type 'VARCHAR'". But it does not JUST report this warning, it does not return ANY rows because of this warning.
So, a simple solution is to disable warnings; of course that value is truncacted to 8Kb, but the query doesn't fail this way:
SET ANSI_WARNINGS OFF
SELECT TOP 100 * FROM OPENROWSET(
BULK 'https://mysa.blob.core.windows.net/my_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0'
AS [result]
SET ANSI_WARNINGS ON
Now I need some help to get the final target, which is to build an EXTERNAL TABLE, rather than just a SELECT, leaving the CSV where it is (in other words: I don't want to create a materialized view or a CETAS or a SELECT INTO which would duplicate data).
If I run it this way:
CREATE EXTERNAL TABLE my_CET (
C1 NVARCHAR(8000),
C2 NVARCHAR(8000)
)
WITH (
LOCATION = 'my_file.csv',
DATA_SOURCE = [my_data_source],
FILE_FORMAT = [SynapseDelimitedTextFormat]
)
, it seems working because it successfully creates an external table, however if I try to read it, I get the error "External table my_CET is not accessible because location does not exist or it is used by another process.".
If I try setting ANSI_WARNINGS OFF, it tells me "The option 'ANSI_WARNINGS' must be turned ON to execute requests referencing external tables.".
As said I don't need all the 500 fields hosted in the CSV but just a couple of them, including the one which I should truncate data to 8KB as I did in the above example.
If I use a CSV file where no field is larger than 8KB, the external table creation works correctly, but I couldn't manage to make it work when some values are longer than 8Kb.

I think when creating an external table from a csv you have to bring in all the columns. I am sure someone can correct me if I am wrong.
Depending on what you want to do, you could create a view from the external table using a select query. e.g.
CREATE VIEW my_CET_Vw
AS
SELECT C1,
C2
FROM my_CET

Related

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.
I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.
However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).
For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.
I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.
For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.
create database delta_dwh;
create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );
create external file format deltalakeformat
with (format_type = delta);
drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
data_source = deltalakestorage,
file_format = deltalakeformat
);
However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:
select isnull(max(product_skey), 0)
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows
Any thoughts, inputs, suggestions ??
Thanks!
You may try to create initial /dummy data_flow + pipiline to create this empty delta files.
It's only simple workaround.
Create CSV with your sample table data.
Create dataflow with name =initDelta
Use this CSV as source in data flow
In projection panel set up correct data types.
Add filtering after source and setup dummy filter 1=2 etc.
Add sink with delta output.
Put your initDelta dataflow into dummy pipeline and run it.
Folder structure for delta should created.
You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.
I think it should work or I missed something in your problem
I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

Flink SQL table backed by CSV with header

I have been searching on google for the last 20 minutes. I have gone through the Apache Flink documentation too specifically the CSV format but haven't found a way to skip the first header row.
The CSV looks like this:
AccountKey,InstrumentCode,Quantity,BuySell,Price
SB11,MSFT,100,Buy,250.57
SB11,AMZN,125,Sell,309
I really am looking for a property to set which will make the CSV reader skip the header row. The following table definition does not work:
CREATE TABLE trades (
AccountNo varchar(10),
Symbol varchar(10),
Quantity integer,
BuySell varchar(4),
Price decimal
) WITH (
'connector' = 'filesystem',
'path' = '/mnt/d/Work/Github/Flink-Samples/data/trade-data.csv',
'format' = 'csv'
);
The error I see on sql-client.sh is:
[ERROR] Could not execute SQL statement. Reason:
java.lang.NumberFormatException: For input string: "Quantity"
I expect the CSV reader reads the rows when I execute a
select * from trades
So far, the only way I have found is to put a # character to make the header appear as a comment and use the following table definition:
CREATE TABLE trades (
AccountNo varchar(10),
Symbol varchar(10),
Quantity integer,
BuySell varchar(4),
Price decimal
) WITH (
'connector' = 'filesystem',
'path' = '/mnt/d/Work/Github/Flink-Samples/data/trade-data.csv',
'format' = 'csv',
'csv.allow-comments' = 'true'
);
Flink's connectors are designed for big amounts of data. Usually, CSV files are split into multiple files for efficient parallel processing.
In this case, a header would not make much sense because it would not be clear if the header is only located in the first file (which one is the first file?) and or in every file.
If there are headers in your file, you have the following options:
Remove the header beforehand with a different tool.
Pre-process the CSV file using a CSV format configured with a single STRING column and use OFFSET / LIMIT to skip first row before writing into a CSV file again. This is mostly useful for batch processing.
Enable ignoring parse errors and use OFFSET / LIMIT to definitely skip the first row.
when you make csv-backed tables , you are just saying to read data from csv file , so you can't have headers in the file .
if you want to import csv file to an already existed table , you need ETL ,It's almost the case in any sql engine.
this blog is helpful how to transfer csv file into flink :
From Streams to Tables and Back Again
What eshirvana said, plus this workaround:
Use the csv.ignore-parse-errors option so that NumberFormatException fails silently.
This may be treading into hack territory, but one option to stay in the SQL context is to create a view that excludes the header row.
For example, the view below excludes the header row since column AccountNo would (likely) only equal AccountKey on the header row in the CSV.
CREATE VIEW trades_no_header AS SELECT * FROM trades WHERE AccountNo <> 'AccountKey';

How to resolve special character issue in SQL Server data warehouse

I have to load the data from datalake into a SQL Server data warehouse using the polybase tables. I have created the set up for the creation of external tables. I have created the external tables and I am trying to do select * from ext_t1 table but I'm getting ???? for a column in ext_table.
Below is my external table script. I have found the issue with the special character in data. How can we escape the special character and need to use only varchar datatype not nvarchar. Can some help me on this issue?
CREATE EXTERNAL FILE FORMAT [CSVFileFormat_Test] WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N',', STRING_DELIMITER = N'"',DATE_FORMAT='yyyy-MM-dd', FIRST_ROW = 2, USE_TYPE_DEFAULT = True,Encoding='UTF8'))
CREATE EXTERNAL TABLE [dbo].[EXT_TEST1]
( A VARCHAR(10),B VARCHAR(20))
(DATA_SOURCE = [Azure_Datalake],LOCATION = N'/A/Test_CSV/',FILE_FORMAT =csvfileformat,REJECT_TYPE = VALUE,REJECT_VALUE = 1)
Data: (special character in csv for A column as follows)
ÐК Ð’ÐЗМ Завод
ÐК Ð’ÐЗМ ЗаÑтройщик
This is data mismatch issue and this read may help you .
External Table Considerations
Creating an external table is easy, but there are some nuances that need to be discussed.
External Tables are strongly typed. This means that each row of the data being ingested must satisfy the table schema definition. If a row does not match the schema definition, the row is rejected from the load.
The REJECT_TYPE and REJECT_VALUE options allow you to define how many rows or what percentage of the data must be present in the final table. During load, if the reject value is reached, the load fails. The most common cause of rejected rows is a schema definition mismatch. For example, if a column is incorrectly given the schema of int when the data in the file is a string, every row will fail to load.
Data Lake Storage Gen1 uses Role Based Access Control (RBAC) to control access to the data. This means that the Service Principal must have read permissions to the directories defined in the location parameter and to the children of the final directory and files. This enables PolyBase to authenticate and load that data.

How to use Vertica's COPY LOCAL as an sql statement from MATLAB on Windows

I'm trying to insert around 80 million records created using MATLAB into Vertica Database table. I wanted to know if we can call COPY LOCAL statement in MATLAB as a regular sql statement using exec(conn, sql). For test purpose, I tried with a dat file having around 4 million records as following:
sqlstmnt = 'COPY schema.table_name (FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL ''/my/file/full/path/test1.dat''';
results = exec(conn,sqlstmnt);
But it gave an error in results.Message like:
[Vertica]JDBC A ResultSet was expected but not generated from query "COPY schema.table_name(FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL '/my/file/full/path/test1.dat'". Query not executed.
I have the data in the '.dat' file in the order in which the columns are mentioned in COPY LOCAL.
I could not find any helpful resource explaining this error.
I have this test1.dat file which I'm able to insert using COPY from vsql but since I run my codes in MATLAB with many iterations,each iteration producing about a million records, I would want to insert them during each iteration. Any help will be really great.
COPY command return ResultSet that includes the amount of loaded data , i see two main options
1) results =exec(conn,sqlstmnt);
2)results = runsqlscript(conn,'nameOfSQLScriptthatIncludeTheCopyCommand.sql')
I hope you will find it useful
Thanks
I just finish reviewing you’re your input sample data .
i see major problem with the mapping of the input csv to the target table .
Main issues are :
1) Lines are broken into 2 lines ( you should prefer having one sample per line and avoid brock it into 2 lines )
Eg : "1,20150101,0,2,2573,2714,1,8.147237e-01
50,48,49,54,45,48,51,-28 12:11:46"
2) when you define data types on vertica table ,eg: timestamp the data on the csv must reflect to it ( what you have is "-28 12:11:46" , this will not work )
After you fix all this issues , make sure you test it using vsql , then go and try it with matlab
I hope you will find it useful.

Insert large amount of data efficiently with SQL

Hi I often have to insert a lot of data into a table. For example, I would have data from excel or text file in the form of
1,a
3,bsdf
4,sdkfj
5,something
129,else
then I often construct 6 insert statements in this example and run the SQL script. I found this was slow when I have to send thousands of small packages to server, it also causes extra overhead to the network.
What's your best way of doing this?
Update: I'm using ORACLE 10g.
Use Oracle external tables.
See also e.g.
OraFaq about external tables
What Tom thinks about external tables
René Nyffenegger's notes about external tables
A simple example that should get you started
You need a file located in a server directory (get familiar with directory objects):
SQL> select directory_path from all_directories where directory_name = 'JTEST';
DIRECTORY_PATH
--------------------------------------------------------------------------------
c:\data\jtest
SQL> !cat ~/.gvfs/jtest\ on\ 192.168.xxx.xxx/exttable-1.csv
1,a
3,bsdf
4,sdkfj
5,something
129,else
Create an external table:
create table so13t (
id number(4),
data varchar2(20)
)
organization external (
type oracle_loader
default directory jtest /* jtest is an existing directory object */
access parameters (
records delimited by newline
fields terminated by ','
missing field values are null
)
location ('exttable-1.csv') /* the file located in jtest directory */
)
reject limit unlimited;
Now you can use all the powers of SQL to access the data:
SQL> select * from so13t order by data;
ID DATA
---------- ------------------------------------------------------------
1 a
3 bsdf
129 else
4 sdkfj
5 something
Im not sure if this works in Oracle but in SQL Server you can use BULK INSERT sql statement to upload data from a txt or a csv file.
BULK
INSERT [TableName]
FROM 'c:\FileName.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
Just make sure that the table columns correctly matches whats in the txt file. With a more complicated solution you may want to use a format file see the following:
http://msdn.microsoft.com/en-us/library/ms178129.aspx
There are alot of ways to speed this up.
1) Do it in a single transaction. This will speed things up by avoiding connection opening / closing.
2) Load directly as a CSV file. If you load data as a CSV file, the "SQL" statements aren't required at all. in MySQL the "LOAD DATA INFILE" operation accomplishes this very intuitively and simply.
3) You can also simply dump the whole file as text into a table called "raw". And then let the database parse the data on its own using triggers. This is a hack, but it will simplify your application code and reduce network usage.