Copy csv & json data from S3 to Redshift - sql

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;

You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

Related

How Splunk field contains double quote

When use Splunk, if we have log
key="hello"
Search in Splunk by
* | table a
we can see hello value
We might print out value with double quote, if we don't escape
key="hel"lo"
We'll see key value is hel. Value breaks before the position of quote
If try to escape double quote with \,
key="hel\"lo"
We'll see key value is hel\
Is use single quote around
key='hel"lo'
We'll see key value include single quotes, it's 'hello"lo'. In this case, search criteria should be
* key="'ab\"c'" | table a
single quotes are parts of value
Question is how to include double quote as part of value?
Ideally, there should be a way to escape double quotes, input
key="hel\"lo"
should match the query
key="hel\"lo"
But it's not.
I have this problem for many years. Splunk value is dynamic, it could contain double quotes. I'm not going to use JSON my log format.
I'm curious why there is no answer in Splunk's official website.
Someone can help? Thanks.
| makeresults
| eval bub="hell\"o"
| table bub
Puts a double-quote mark right in the middle of the bub field
If you want to search for the double-quote mark, use | where match() like this:
| where match(bub,"\"")
Ideally, the data source would not generate events with embedded quotes without escaping them. Otherwise, how would a reader know the quote is embedded and not mismatched? This is the problem Splunk is struggling with.
The fix is to create your own parser using transforms.
In props.conf:
[mysourcetype]
TRANSFORMS-parseKey = parse_key
In transforms.conf:
[parse_key]
REGEX = (\w+)="(.*\".*)"
FORMAT = $1::$2
Of course, this regex is simplified. You'll need to modify it to match your data.

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"
You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).
As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Escape all special characters in Presto

Uploading data(CSV) to S3 and then to Presto. But due to problems with the data inside the files we have problems uploading from S3 to Presto.
The metadata are correctly formed but because of problems in column B, they are failing.
A;B;DATE
EPA;Ørsted Energy Sales & Distribution;2019-01-11 12:10:13
EPA;De MARIA GærfaPepeer A/S; 2019-02-12 12:10:13
EPA;Scan Convert A/S; 2019-02-11 11:10:12
EPA;***Mega; 2019-02-11 11:10:13
EPA;sAYSlö-SähAAdkö Oy; 2019-02-11 11:11:11
We are adding replacement formulas in previous step (Informatica Cloud), to add \ and read the values correctly.
Is there a list of characters we should look for and add the \ ?
The problem is that according to standard if Your B column could contain separator then You should add quotation on that column. If there are quotations inside ( what for 99% can happen ) then You should add escape character before.
A;B;DATE
EPA;"company";01/01/2000
EPA;"Super \"company\""; 01/01/2000
EPA,"\"dadad\" \;"; 01/01/2000
I had similar problem, it's quite easy to solve it with regular expression :
In Your scenario You can search for :
(^EPA;) and replace it with: $1" ==> s/(^EPA;)/$1"/g
(;[0-9]{1,2}/[0-9]{1,2}) and replace it with: "$1 ==> s/\s*(;[0-9]{1,2}/[0-9]{1,2})/"$1/g
Final step would be global backslashes enrichment:
s/([^;"]|;")(")([^;\n])/$1\\$2$3/g
Please take a look on that:
https://fullouterjoin.wordpress.com/2019/04/05/dealing-with-broken-csv-strings-with-missing-escape-characters-powercenter/