Escape all special characters in Presto - amazon-s3

Uploading data(CSV) to S3 and then to Presto. But due to problems with the data inside the files we have problems uploading from S3 to Presto.
The metadata are correctly formed but because of problems in column B, they are failing.
A;B;DATE
EPA;Ørsted Energy Sales & Distribution;2019-01-11 12:10:13
EPA;De MARIA GærfaPepeer A/S; 2019-02-12 12:10:13
EPA;Scan Convert A/S; 2019-02-11 11:10:12
EPA;***Mega; 2019-02-11 11:10:13
EPA;sAYSlö-SähAAdkö Oy; 2019-02-11 11:11:11
We are adding replacement formulas in previous step (Informatica Cloud), to add \ and read the values correctly.
Is there a list of characters we should look for and add the \ ?

The problem is that according to standard if Your B column could contain separator then You should add quotation on that column. If there are quotations inside ( what for 99% can happen ) then You should add escape character before.
A;B;DATE
EPA;"company";01/01/2000
EPA;"Super \"company\""; 01/01/2000
EPA,"\"dadad\" \;"; 01/01/2000
I had similar problem, it's quite easy to solve it with regular expression :
In Your scenario You can search for :
(^EPA;) and replace it with: $1" ==> s/(^EPA;)/$1"/g
(;[0-9]{1,2}/[0-9]{1,2}) and replace it with: "$1 ==> s/\s*(;[0-9]{1,2}/[0-9]{1,2})/"$1/g
Final step would be global backslashes enrichment:
s/([^;"]|;")(")([^;\n])/$1\\$2$3/g
Please take a look on that:
https://fullouterjoin.wordpress.com/2019/04/05/dealing-with-broken-csv-strings-with-missing-escape-characters-powercenter/

Related

How Splunk field contains double quote

When use Splunk, if we have log
key="hello"
Search in Splunk by
* | table a
we can see hello value
We might print out value with double quote, if we don't escape
key="hel"lo"
We'll see key value is hel. Value breaks before the position of quote
If try to escape double quote with \,
key="hel\"lo"
We'll see key value is hel\
Is use single quote around
key='hel"lo'
We'll see key value include single quotes, it's 'hello"lo'. In this case, search criteria should be
* key="'ab\"c'" | table a
single quotes are parts of value
Question is how to include double quote as part of value?
Ideally, there should be a way to escape double quotes, input
key="hel\"lo"
should match the query
key="hel\"lo"
But it's not.
I have this problem for many years. Splunk value is dynamic, it could contain double quotes. I'm not going to use JSON my log format.
I'm curious why there is no answer in Splunk's official website.
Someone can help? Thanks.
| makeresults
| eval bub="hell\"o"
| table bub
Puts a double-quote mark right in the middle of the bub field
If you want to search for the double-quote mark, use | where match() like this:
| where match(bub,"\"")
Ideally, the data source would not generate events with embedded quotes without escaping them. Otherwise, how would a reader know the quote is embedded and not mismatched? This is the problem Splunk is struggling with.
The fix is to create your own parser using transforms.
In props.conf:
[mysourcetype]
TRANSFORMS-parseKey = parse_key
In transforms.conf:
[parse_key]
REGEX = (\w+)="(.*\".*)"
FORMAT = $1::$2
Of course, this regex is simplified. You'll need to modify it to match your data.

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Copy csv & json data from S3 to Redshift

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

SQL functions in NetSuite saved search results - how to fix these functions?

I am trying to achieve the following in the context of NetSuite saved search results output.
1. Remove every character after the first hyphen (-) or a colon (:) including space right before either of these characters.
So for e.g.
Input: test 123 - xyz : 123
this should output as test 123 -> this should even remove the space that you see right before the hyphen.
I tried the below two codes
SUBSTR({custitem123}, 0, INSTR({custitem123}, '-')-1)
SUBSTR({custitem123}, 0, INSTR({custitem123}, ':')-1)
And these work fine on their own- so I am trying to combine these in one single formula that will look for either of these and remove all characters after them -- apart from this, it should also look for any space right before the hyphen or colon and replace it with nothing. Not sure how you would achieve this.
2. Remove all non-alphabet characters & space before the alphabet characters (if any).
for e.g. Input: 1. Test XYZ
This should have Output as:
Test XYZ
I tried achieving this by using the below formula-
TRIM({class}, '[^A-Za-z ]', '')
The problem with this approach is it fails to replace the space character before the first alphabet of Test. I understand this is because I told it to skip replacing space characters. What I don't know is how do I tell it to only replace the space that it finds before the first alphabet character.
In short, how do I make sure the output is:
Test XYZ
And not
Test XYZ (that has a space before Test)
You can use regexp_substr as
regexp_substr({custitem123}, '[^-]+') to extract test 123 only from Input: test 123 - xyz : 123
if you add trim also, then you can get whitespaces around trimmed as
e.g. trim(regexp_substr({custitem123}, '[^-]+')) gives test 123 as trimmed output.
use RTRIM instead of Trim to remove the trailing whitespaces like this:
RTRIM(regexp_substr({custitem123}, '[^-]+'))
test 123 - xyz : 123 resolves to test 123
Also thanks for asking this question helped me solve my own similar issue :D

Postgres 9.3 end-of-copy marker corrupt - Any way to change this setting?

I am trying to stream data through an AWK program to a Postgres COPY command. This works great usually. However, in my data recently I have been getting long text stings containing '\.' values.
Postgres Documentation mentions this combination of characters represents the end-of-data marker, http://www.postgresql.org/docs/9.2/static/sql-copy.html, and I am getting the associated errors when trying to insert with COPY.
My question is, is there a way to turn this off? Perhaps change the end-of-data marker to a different combination of characters? Or do I have to alter/remove these strings before trying to insert using the COPY command?
You can try to filter your data through sed 's:\\:\\\\:g' - this would change every \ in your data to \\, which is a correct escape sequence for single backslash in copy data.
But I think not only backslash would be problematic. Also newlines should be encoded by \n, carriage returns as \r and tabs as \t (tab is a default field delimiter in copy).