How to get filename when loading data - azure-synapse

I'm using Azure Synapse/SQL Pools/Data Warehouse (insert any other brand names I may have missed!) to load data from Azure blob store.
I'm doing this via an external table using polybase.
I want to capture the source file for each row of data.
I've tried to test using OPENROWSET, but this does not appear to work
SELECT
*,
x.filename() AS [filename]
FROM
OPENROWSET(
WITH (
DATA_SOURCE = [Analytics_AzureStorage],
LOCATION = N'2022/06/21',
FILE_FORMAT = [CompressedTSV]
)
) x
Msg 103010, Level 16, State 1, Line 1
Parse error at line: 5, column: 5: Incorrect syntax near 'OPENROWSET'.
How can I load the filename to a table in the Azure Warehouse Synapse Pool?
Edit:
The OPENROWSET function is not supported in dedicated SQL pool.
which explains why it does not work, is there a COPY/Polybase equivalent command for getting the file name?

your syntax is wrong. WITH should come later.
Have a look at the syntax here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset
OPENROWSET
( { BULK 'unstructured_data_path' , [DATA_SOURCE = <data source name>, ]
FORMAT= ['PARQUET' | 'DELTA'] }
)
[WITH ( {'column_name' 'column_type' }) ]
[AS] table_alias(column_alias,...n)

Related

Partitioning Data in SQL On-Demand with Blob Storage as Data Source

In Amazon Redshift there is a way to create a partition key when using your S3 bucket as a data source. Link.
I am attempting to do something similar in Azure Synapse using the SQL On-Demand service.
Currently I have a storage account that is partitioned such that it follows this scheme:
-Sales (folder)
- 2020-10-01 (folder)
- File 1
- File 2
- 2020-10-02 (folder)
- File 3
- File 4
To create a view and pull in all 4 files I ran the command:
CREATE VIEW testview3 AS SELECT * FROM OPENROWSET ( BULK 'Sales/*/*.csv', FORMAT = 'CSV', PARSER_VERSION = '2.0', DATA_SOURCE = 'AzureBlob', FIELDTERMINATOR = ',', FIRSTROW = 2 ) AS tv1;
If I run a query of SELECT * FROM [myview] I receive data from all 4 files.
How can I go about creating a partition key so that I could run a query such as
SELECT * FROM [myview] WHERE folderdate > 2020-10-01
so that I can only analyze data from Files 3 and 4?
I know I can edit my OPENROWSET BULK statement but I want to be able to get all the data from my container at first and then constrain searches as needed.
Serverless SQL can parse partitioned folder structure's using the filename (where you wish to load a specific file or files) and filepath (where you wish to load all files in this said path). More information on syntax and usage is available on documentation online.
In your case, you can parse all files from '2020-10-01' and beyond using the filepath syntax such as filepath(1) > '2020-10-01'
To expand on the answer from Raunak I ended up with the following syntax for my query.
DROP VIEW IF EXISTS testview6
GO
CREATE VIEW testview6 AS
SELECT *,
r.filepath(1) AS [date]
FROM OPENROWSET (
BULK 'Sales/*/*.csv',
FORMAT = 'CSV', PARSER_VERSION = '2.0',
DATA_SOURCE = 'AzureBlob',
FIELDTERMINATOR = ',',
FIRSTROW = 2
) AS [r]
WHERE r.filepath(1) IN ('2020-10-02');
You can adjust the granularity of your partitioning by the addition of extra wildcards (*) and r.filepath(x) statements.
For instance you can create your query such as:
DROP VIEW IF EXISTS testview6
GO
CREATE VIEW testview6 AS
SELECT *,
r.filepath(1) AS [year],
r.filepath(2) as [month]
FROM OPENROWSET (
BULK 'Sales/*-*-01/*.csv',
FORMAT = 'CSV', PARSER_VERSION = '2.0',
DATA_SOURCE = 'AzureBlob',
FIELDTERMINATOR = ',',
FIRSTROW = 2
) AS [r]
WHERE r.filepath(1) IN ('2020')
AND r.filepath(2) IN ('10');

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

Importing OLAP metadata in SQL Server via linked server results in out-of-range date

Currently, I am trying to extract metadata from an OLAP cube in SQL Server (via a linked server) using this simple query:
select *
into [dbo].[columns_metadata]
from openquery([LINKED_SERVER], '
select *
from $System.TMSCHEMA_COLUMNS
')
But in the result set, there is a column named RefreshedTime with values 31.12.1699 00:00:00.
Because of this value, the query gives this error message:
Msg 8114, Level 16, State 9, Line 1 Error converting data type (null)
to datetime.
The problem is that I need to run the query without specifying the columns in the SELECT statement.
Do you know a trick to avoid this error?
I know you wanted not to have to mention the columns explicitly, but in case nobody can suggest a way to have it handle the 1699-12-31 dates, then you can fallback to this:
select *
into [dbo].[columns_metadata]
from openquery([LINKED_SERVER], '
SELECT [ID]
,[TableID]
,[ExplicitName]
,[InferredName]
,[ExplicitDataType]
,[InferredDataType]
,[DataCategory]
,[Description]
,[IsHidden]
,[State]
,[IsUnique]
,[IsKey]
,[IsNullable]
,[Alignment]
,[TableDetailPosition]
,[IsDefaultLabel]
,[IsDefaultImage]
,[SummarizeBy]
,[ColumnStorageID]
,[Type]
,[SourceColumn]
,[ColumnOriginID]
,[Expression]
,[FormatString]
,[IsAvailableInMDX]
,[SortByColumnID]
,[AttributeHierarchyID]
,[ModifiedTime]
,[StructureModifiedTime]
,CStr([RefreshedTime]) as [RefreshedTime]
,[SystemFlags]
,[KeepUniqueRows]
,[DisplayOrdinal]
,[ErrorMessage]
,[SourceProviderType]
,[DisplayFolder]
from $System.TMSCHEMA_COLUMNS
')

saved data frame is not shown correctly in sql server

I have data frame named distTest which have columns with UTF-8 format. I want to save the distTest as table in my sql database. My code is as follows;
library(RODBC)
load("distTest.RData")
Sys.setlocale("LC_CTYPE", "persian")
dbhandle <- odbcDriverConnect('driver={SQL Server};server=****;database=TestDB;
trusted_connection=true',DBMSencoding="UTF-8" )
Encoding(distTest$regsub)<-"UTF-8"
Encoding(distTest$subgroup)<-"UTF-8"
sqlSave(dbhandle,distTest,
tablename = "DistBars", verbose = T, rownames = FALSE, append = TRUE)
I considered DBMSencoding for my connection and encodings Encoding(distTest$regsub)<-"UTF-8"
Encoding(distTest$subgroup)<-"UTF-8"
for my columns. However, when I save it to sql the columns are not shown in correct format, and they are like this;
When I set fast in sqlSave function to FALSE, I got this error;
Error in sqlSave(dbhandle, Distbars, tablename = "DistBars", verbose =
T, : 22001 8152 [Microsoft][ODBC SQL Server Driver][SQL
Server]String or binary data would be truncated. 01000 3621
[Microsoft][ODBC SQL Server Driver][SQL Server]The statement has been
terminated. [RODBC] ERROR: Could not SQLExecDirect 'INSERT INTO
"DistBars" ( "regsub", "week", "S", "A", "F", "labeled_cluster",
"subgroup", "windows" ) VALUES ( 'ظâ€', 5, 4, 2, 3, 'cl1', 'ط­ظ…ظ„
ط²ط¨ط§ظ„ظ‡', 1 )'
I also tried NVARCHAR(MAX) for utf-8 column in the design of table with fast=false the error gone, but the same error with format.
By the way, a part of data is exported as RData in here.
I want to know why the data format is not shown correctly in sql server 2016?
UPDATE
I am fully assured that there is something wrong with RODBC package.
I tried inserting to table by
sqlQuery(channel = dbhandle,"insert into DistBars
values(N'7من',NULL,NULL,NULL,NULL,NULL,NULL,NULL)")
as a test, and the format is still wrong. Unfortunately, adding CharSet=utf8; to connection string does not either work.
I had the same issue in my code and I managed to fix it eliminating rows_at_time = 1 from my connection configuration.

Exporting SQL Query to a local text file

This is for an approach that WRITES to a local file.
I am using SQL WorkBench and I'm connected to an AWS Redshift instance (which uses postgresql). I would like to run the query and have data exported from AWS Redshift to a local csv or text file. I have tried:
SELECT transaction_date ,
Variable 1 ,
Variable 2 ,
Variable 3 ,
Variable 4 ,
Variable 5
From xyz
into OUTFILE 'C:/filename.csv'
But I get the following error:
ERROR: syntax error at or near "'C:/filename.csv'"
Position: 148
into OUTFILE 'C:/filename.csv'