spark csv output : getting empty quotes instead of blank or space - apache-spark-sql

I am trying to find a solution to have the empty quotes in the csv file replaced with a blank space with no quotes. I tried below, but i get a 0 instead of a space in the csv file. Can someone lmk how to get a space instead of "" in the final csv file when the value is empty, thanks.
```outputDS
.write()
.mode(SaveMode.Append)
.option("header", true)
.option("delimiter", "\t")
.option("emptyValue", null)
.option("encoding", "UTF-8")
.csv(fullOutputPath);```

Related

Remove space between empty quotes in csv file using powershell

I have a csv file with many empty quotes and I want to remove them using powershell. Tried various solutions but it didn't work.
Sample data :" ","abc",""," ","123"
Expected output:,"abc",,"123"

How to read $ character while reading a csv using pandas dataframe

I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this

How to escape double quotes within a data when it is already enclosed by double quotes

I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.

Remove double quotes from Data

I am getting data in csv file with double quotes around string columns but while reading csv file using U-sql i am getting errors because of double quotes in data as well.
I am thinking of replacing double quotes which is in data at first step then read that file but not sure how to do that as we have double quotes everywhere.
Any suggestions would be appreciated or if someone can help me giving the powershell or .net code to do the same that would be great help as I am not good in .net or powershell.
Sample Data
“Name”;”Department”
“Abc”;”Education”Teaching”
“Cde”;”Test”Another”
It should be
“Name”;”Department”
“Abc”;”EducationTeaching”
“Cde”;”TestAnother”
You can use a regex find/replace in Visual Studio Code. For example (and assuming that the data only contains letters, you can edit the regex as needed):
Find regex: "([a-zA-Z]+)"([a-zA-Z]+)"
Replace string: "$1$2"
Input string: "Name";"Department" "Abc";"Education"Teaching" "Cde";"Test"Another"
Output string: "Name";"Department" "Abc";"EducationTeaching" "Cde";"TestAnother"
So it seems that your quotes are not the standard [Char]34. Instead they are [Char]8220; [Char]8221
So we need to do a replace in powershell
$TEST = #"
“Name”;”Department” “Abc”;”Education”Teaching” “Cde”;”Test”Another”
"#
$TEST | %{
$_ = $_ -replace [char]8220, '"'
$_ = $_ -replace [char]8221, '"'
$_ -replace '"([a-zA-Z]+)"([a-zA-Z]+)"','"$2 $1"'
}
this would make the output :
"Name";"Department" "Abc";"Teaching Education" "Cde";"Another Test"
You could also do this in a custom row processor. Have the initial data read the CSV file into a variable as a single column row (raw data). Then pass each row through a row processor to parse the data and remove the offending characters. I've done something similar for handling Fixed Width text files.

CSV to BQ: empty fields instead of null values

I have a pipeline that is loading a CSV file from GCS into BQ. The details are here: Import CSV file from GCS to BigQuery.
I'm splitting the CSV in a ParDo into a TableRow where some of the fields are empty.
String inputLine = c.element();
String[] split = inputLine.split(',');
TableRow output = new TableRow();
output.set("Event_Time", split[0]);
output.set("Name", split[1]);
...
c.output(output);
My question is, how can I have the empty fields show up as a null in BigQuery? Currently they are coming through as empty fields.
It's turning up in BigQuery as an empty String because when you use split(), it will return an empty String for ,, and not null in the Array.
Two options:
Check for empty String in your result array and don't set the field in output.
Check for empty String in your result array and explicitly set null for the field in output.
Either way will result in null for BigQuery.
Note: be careful splitting Strings in Java like this this. split() will remove leading and trailing empties. Use split("," -1) instead. See here.
BTW: unless you're doing some complex/advanced transformations in Dataflow, you don't have to use a pipeline to load in your CSV files. You could just load it or read it directly from GCS.