Remove double quotes from Data - azure-data-lake

I am getting data in csv file with double quotes around string columns but while reading csv file using U-sql i am getting errors because of double quotes in data as well.
I am thinking of replacing double quotes which is in data at first step then read that file but not sure how to do that as we have double quotes everywhere.
Any suggestions would be appreciated or if someone can help me giving the powershell or .net code to do the same that would be great help as I am not good in .net or powershell.
Sample Data
“Name”;”Department”
“Abc”;”Education”Teaching”
“Cde”;”Test”Another”
It should be
“Name”;”Department”
“Abc”;”EducationTeaching”
“Cde”;”TestAnother”

You can use a regex find/replace in Visual Studio Code. For example (and assuming that the data only contains letters, you can edit the regex as needed):
Find regex: "([a-zA-Z]+)"([a-zA-Z]+)"
Replace string: "$1$2"
Input string: "Name";"Department" "Abc";"Education"Teaching" "Cde";"Test"Another"
Output string: "Name";"Department" "Abc";"EducationTeaching" "Cde";"TestAnother"

So it seems that your quotes are not the standard [Char]34. Instead they are [Char]8220; [Char]8221
So we need to do a replace in powershell
$TEST = #"
“Name”;”Department” “Abc”;”Education”Teaching” “Cde”;”Test”Another”
"#
$TEST | %{
$_ = $_ -replace [char]8220, '"'
$_ = $_ -replace [char]8221, '"'
$_ -replace '"([a-zA-Z]+)"([a-zA-Z]+)"','"$2 $1"'
}
this would make the output :
"Name";"Department" "Abc";"Teaching Education" "Cde";"Another Test"

You could also do this in a custom row processor. Have the initial data read the CSV file into a variable as a single column row (raw data). Then pass each row through a row processor to parse the data and remove the offending characters. I've done something similar for handling Fixed Width text files.

Related

Remove space between empty quotes in csv file using powershell

I have a csv file with many empty quotes and I want to remove them using powershell. Tried various solutions but it didn't work.
Sample data :" ","abc",""," ","123"
Expected output:,"abc",,"123"

How to escape double quotes within a data when it is already enclosed by double quotes

I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).
As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Visual Studio Code Snippet Variable Transform not working

I'm trying to make a snippet that inserts the last two directorys of the current filepath.
My code:
${TM_DIRECTORY/\\(.*)\\([a-zA-Z]+)\\([a-zA-Z]+)/$1\\$2/}
So when Filepath is
"...\htdocs\projectname\src"
the output should be
"projectname\src".
But instead I get this result:
${TM_DIRECTORY/(.*)\\([a-zA-Z]+)\\([a-zA-Z]+)/$1/}
What am I doing wrong?
Problem:
The issue is the code converts \\ to \. For example if you want to write \w, then you have to write \\win snippet.
The same way.. You have to write \\\\ in snippet json, so that it shall convert into //.
Solution:
${TM_DIRECTORY/.*?\\\\([a-zA-Z]+\\\\[a-zA-Z]+)$/$1/}
or, I think you should use \w instead of [a-zA-Z] because the directory name can contain some characters like - or _ etc.
${TM_DIRECTORY/.*?\\\\(\\w+\\\\\\w+)$/$1/}

Quoting -replace & variables

This is in response to my previous question:
PowerShell: -replace, regex and ($) dollar sign woes
My question is: why do these 2 lines of code have different output:
'abc' -replace 'a(\w)', '$1'
'abc' -replace 'a(\w)', "$1"
AND according to the 2 articles below, why doesn't the variable '$1' in single quotes get used as a literal string? Everything in single quotes should be treated as a literal text string, right?
http://www.computerperformance.co.uk/powershell/powershell_quotes.htm
http://blogs.msdn.com/b/powershell/archive/2006/07/15/variable-expansion-in-strings-and-herestrings.aspx
When you use single quotes you tell PowerShell to use a string literal meaning everything between the opening and closing quote is to be interpreted literally.
When you use double quotes, PowerShell will interpret specific characters inside the double quotes.
See get-help about_quoting_rules or click here.
The dollar sign has a special meaning in regular expressions and in PowerShell. You want to use the single quotes if you intend the dollar sign to be used as the regular expression.
In your example the regex a(\w) is matching the letter 'a' and then a word character captured in back reference #1. So when you replace with $1 you are replacing the matched text ab with back reference match b. So you get bc.
In your second example with using double quotes PowerShell interprets "$1" as a string with the variable $1 inside. You don't have a variable named $1 so it's null. So the regex replaced ab with null which is why you only get c.
In your second line:
'abc' -replace 'a(\w)', "$1"
Powershell replaces the $1 before it gets to the regex replace operation, as others have stated. You can avoid that replacement by using a backtick, as in:
'abc' -replace 'a(\w)', "`$1"
Thus, if you had a string in a variable $prefix which you wanted to include in the replacement string, you could use it in the double quotes like this:
'abc' -replace 'a(\w)', "$prefix`$1"
The '$1' is a regex backreference. It's created by the regex match, and it only exists within the context of that replace operation. It is not a powershell variable.
"$1" will be interpreted as a Powershell variable. If no variable called $1 exists, the replacement value will be null.
Since I cannot comment or upvote, David Rogers' answer worked for me. I needed to use both RegEx backreference as well as a Powershell variable in a RexEx replace.
I needed to understand what the backtick did before I implemented it, here is the explanation: backtick is Powershell's escape character.
My usecase
$new = "AAA"
"REPORT.TEST998.TXT" -Replace '^([^.]+)\.([^.]+)([^.]{3})\.', "`$1.`$2$new."
Result
REPORT.TESTAAA.TXT
Alternatives
Format string
"REPORT.TEST998.TXT" -Replace '^([^.]+)\.([^.]+)([^.]{3})\.', ('$1.$2{0}.' -f )
Comments
as per https://get-powershellblog.blogspot.com/2017/07/bye-bye-backtick-natural-line.html I'll probably use the format string method to avoid the use of backticks.
Here's the powershell 7 version where you don't have to deal with a single quoted $1, with a script block as the second argument, replacing 'ab' with 'b':
'abc' -replace 'a(\w)', {$_.groups[1]}
bc