Managing escape character for an external table in Azure Synapse Analytics - azure-synapse

I've an ADF pipeline that reads a SAP table and then writes to an ADLS gen2 sink in csv format.
The SAP table has an address field having the comma character (",") between the street
andthe house number: this comma is a character to consider and it isn't a column delimiter.
So, in ADF for the sink data set I've:
column delimiter = comma;
row delimiter = default;
encoding = default (UTF-8);
escape character = backslash;
quote character = no quote character.
Inside Synapse Analytics (SQL servless pool), in order to create a related external table
from the corresponding ADLS gen2 csv it was created an external file format with these options:
format type = DELIMITEDTEXT;
format options = (FIELD_TERMINATOR = N',', USE_TYPE_DEFAULT = False).
Viewing the data in the SQL external table the data next the address are wrong
because the escape character was bad interpreted: the backslash was interpreted
as a field terminator.
Now, any suggests to me to solve a such issue? Thanks

Unfortunately, you can't specify escape characters for external tables in Synapse SQL.
It is not supported for now.
There are 2 ways to achieve your scenario:
1. Change how files are generated from ADF
By adding quote character " you can omit the escape character from the ADF.
This way serverless SQL pool will be able to read your files.
ADF configuration:
column delimiter = comma;
row delimiter = default;
encoding = default (UTF-8);
escape character = no escape character;
quote character = "
2. Use OPENROWSET
This scenario can be achieved with OPENROWSET.
Here is an example of it:
SELECT *
FROM OPENROWSET(
BULK 'path',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIELDTERMINATOR =',',
ESCAPECHAR = '\\'
) AS [r];
You can specify escape character this way: ESCAPECHAR = '\\'
Reference in docs Query CSV files - Escape characters
You can create a new feature request and the team will triage it accordingly.
Azure feedback

Related

How to escape double quotes within a data when it is already enclosed by double quotes

I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.

Copy csv & json data from S3 to Redshift

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).
As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

PIG - Read each line into its field

Is there a way I can read each line from a logfile into its own field. I thought with ('\n') as a delimiter I should be able to achieve that.
File - test
Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
Node name: test0041
CLIENT USER:[6] 'oracle'
So I would like to read this into three fields as
filename - Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
nodename - Node name: test0041
username - CLIENT USER:[6] 'oracle'
I tried this but it didnt help.
A = LOAD 'test' using PigStorage ('\n') AS (filename, nodename, username);
You can't use '\n' as a delimiter with PigStorage. According to the Pig10 docs:
Record Deliminters – For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.
If you want to parse the log file you'll have to write a custom loader.
If your file is like that small, why don't you do a pre-processing of the file like converting \n to \t for example before LOAD?