character slash is not being read by hive on using OpenCSVSerde

character slash is not being read by hive on using OpenCSVSerde - hive

I have defined a table on top of files present in hdfs. I am using the OpenCSV
Serde to read from the file. But, '\' slash characters in the data are getting omitted in the final result set.
Is there a hive serde property that I am not using correctly. As per the documentation, escapeChar = '\' should fix this problem. But, the problem persists.
CREATE EXTERNAL TABLE `tsr`(
`last_update_user` string COMMENT 'from deserializer',
`last_update_datetime` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='\"',
'separatorChar'=',',
'serialization.encoding'='UTF-8')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://edl/hive/db/tsr'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='1',
'numRows'='1869',
'rawDataSize'='0',
'serialization.null.format'='',
'totalSize'='144640',
'transient_lastDdlTime'='1524479930')
Sample Output:
DomainUser1 , 2017-07-04 19:07:27
Expected Result:
Domain\User1 , 2017-07-04 19:07:27
EDIT 1: I have tried both '\\' and '\' as the escapeChar and both have the same problem

Unfortunately the csv serde in Hive does not support multiple characters as separator/quote/escape, it looks like you want to use 2 backlslahes as escapeChar (which is not possible) consideirng than OpenCSVSerde only support a single character as escape (actually it is using CSVReader which only supports one). I am not aware about any other SerDe that supports multiple characters in Hive, you can always implement your own udf with other library, not the most popular option (nobody wants to support its own stuffs :)). I would recommend use a different character as escape, hopefully one not present in your data. A second option would be modify your data during your ingestion to replace \ by \\

In the document the "escapeChar"= "\\" is mentioned with two backslashes. please check it.
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)

I had a similar issue it can be solved changing the "escapeChar" = "\" to something else for example "escapeChar" ="\n".

Related

Managing escape character for an external table in Azure Synapse Analytics

I've an ADF pipeline that reads a SAP table and then writes to an ADLS gen2 sink in csv format.
The SAP table has an address field having the comma character (",") between the street
andthe house number: this comma is a character to consider and it isn't a column delimiter.
So, in ADF for the sink data set I've:
column delimiter = comma;
row delimiter = default;
encoding = default (UTF-8);
escape character = backslash;
quote character = no quote character.
Inside Synapse Analytics (SQL servless pool), in order to create a related external table
from the corresponding ADLS gen2 csv it was created an external file format with these options:
format type = DELIMITEDTEXT;
format options = (FIELD_TERMINATOR = N',', USE_TYPE_DEFAULT = False).
Viewing the data in the SQL external table the data next the address are wrong
because the escape character was bad interpreted: the backslash was interpreted
as a field terminator.
Now, any suggests to me to solve a such issue? Thanks

Unfortunately, you can't specify escape characters for external tables in Synapse SQL.
It is not supported for now.
There are 2 ways to achieve your scenario:
1. Change how files are generated from ADF
By adding quote character " you can omit the escape character from the ADF.
This way serverless SQL pool will be able to read your files.
ADF configuration:
column delimiter = comma;
row delimiter = default;
encoding = default (UTF-8);
escape character = no escape character;
quote character = "
2. Use OPENROWSET
This scenario can be achieved with OPENROWSET.
Here is an example of it:
SELECT *
FROM OPENROWSET(
BULK 'path',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
FIELDTERMINATOR =',',
ESCAPECHAR = '\\'
) AS [r];
You can specify escape character this way: ESCAPECHAR = '\\'
Reference in docs Query CSV files - Escape characters
You can create a new feature request and the team will triage it accordingly.
Azure feedback

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"

You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?

The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.

Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).

As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Why HIVE must split string with "\01"? [b4 0.11.0 is yes, after 0.11.0 you can specified]

There is some string in a hive table, I use transform method that replace some char, my mapper script like this:
<?php
$strFrom = "\7";
$strTo = "\1"; // "|" it works well
$fd = fopen("php://stdin", "r");
while($line = fgets($fd)){
$outStr = str_replace($strFrom, $strTo, $line);
print $outStr;
}
fclose($fd);
my hive sql like this:
select transform (value)
using 'home/php/bin/php -c home/php/etc/php.ini replace.php'
as (v1 string)
from test_tbl
actually I try to replace string from "\7" to "\1", but I find it seems replace correctly, but it just output the first column. One input like this:
a\7b\7c\7d
then it output like this:
a
yeah, just one column!
If I replace it to "|", it output:
a|b|c|d
So I am confused, why must hive split string with "\1"? How can I forbid it? I just want to get:
a\1b\1c\1d

I found my answer in here.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines.
As of Hive 0.11.0 the separator used can be specified, in earlier versions it was always the ^A character (\001)
Thanks for all guys that seen this question.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

character slash is not being read by hive on using OpenCSVSerde - hive

In the document the "escapeChar"= "\\" is mentioned with two backslashes. please check it. WITH SERDEPROPERTIES ( "separatorChar" = "\t", "quoteChar" = "'", "escapeChar" = "\\" )

I had a similar issue it can be solved changing the "escapeChar" = "\" to something else for example "escapeChar" ="\n".

Related

Managing escape character for an external table in Azure Synapse Analytics

" replaced by ""

How to remove new line characters from data rows in Presto/AWS Athena?

Custom delimiter while exporting Google Cloud SQL to CSV

Why HIVE must split string with "\01"? [b4 0.11.0 is yes, after 0.11.0 you can specified]

Categories

Resources