Spark write CSV not writing unicode character - dataframe

I have string containing unicode character(ctrl-B) as the last character in one column of the dataframe.
After writing it in CSV using spark it doesn't have last unicode character(ctrl-B) in string.
df.show()
+------------+-------+
|a | b|
+------------+-------+
| 25|0^B^B0^B|
+------------+-------+
df.write.format("com.databricks.spark.csv").save("/home/test_csv_data")
vim /home/test_csv_data/part*
25,0^B^B0
It doesn't have the last ctrl-B character.
But if I write it in ORC or parquet format using spark then the last ctrl-B is present.
Please guide me on this, why is it happening. How can I get ctrl-B in csv at the end ?

'^B' is considered as a white space, and the default setting of ignoreTrailingWhiteSpace is true, which will remove it, so you can set it to false:
df.write.option("ignoreTrailingWhiteSpace","false").format("com.databricks.spark.csv").save("/home/test_csv_data")

Related

Pandas to_csv adds new rows when data has special characters

My data has multiple columns including a text column
id text date
8950026 Make your EMI payments only through ABC 01-04-2021 07:43:54
8950969 Pay from your Bank Account linked on XXL \r 01-04-2021 02:16:48
8953627 Do not share it with anyone. -\r 01-04-2021 08:04:57
I used pandas to_csv to export my data. That works well for my first row but for the next 2 rows, it creates a new line and moves the date to the next line and adds to the total rows. Basically my output csv will have 5 rows instead of 3
df_in.to_csv("data.csv", index = False)
What is the best way to handle the special character "\r" here? I tried converting the text variable to string in pandas (Its dtype is object now) but that doesn't help . I can try and remove all \r in the end of text in my dataframe before exporting, but is there a way to modify to_csv to export this in the right format ?
**** EDIT****
This question below is similar and I can solve the problem by replacing all instances of \r in my dataframe but how can this be solved by not replacing? Does to_csv have options to handle these
Pandas to_csv with escape characters and other junk causing return to next line

Copy csv & json data from S3 to Redshift

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

Pentaho - CSV Input - Incoming field trim type - unexpected behaviour

When using CSV Input utility in Pentaho (V7), we use trim type both to achieve the below. But it doesn't work as expected.
Here are the test data and expected output vs actual output
|Incoming Data |Expected Output |Actual Output |
|<space>abc<space> |abc |abc |
|abc<space> |abc |abc |
|<space>abc |abc |abc |
|"<space>abc<space>" |<space>abc<space> |abc |
|"<space>abc<space>"<space> |<space>abc<space> |abc |
|<space>"<space>abc<space>" |<space>abc<space> |"<space>abc |
|<space>"<space>abc<space>"<space> |<space>abc<space> |"<space>abc |
|"abc"<space> |abc |abc |
|<space>"abc" |abc |"abc |
|<space>"abc"<space> |abc |"abc |
Can someone please guide me on this?
If there's no technical reason for using CSV-Input, use Text-File-Input instead. TFI handles CSV input much better. If possible, you should talk to the CSV producer about data quality, though.
UPDATE: TFI 6.1.0.1-196 preview output
Not so bad, when we accept that trimming in Kettle always is done to the field value, i.e. you can't protect leading or trailing spaces from trimming as expected in testcases 4 and 5.
It looks like the CSV input doesn't deal correctly with badly formed CSV data (surprise!). Having extra spaces between the delimiter and enclosure characters apparently doesn't sit well with the step. The trim function looks inside the enclosure to trim spaces, not outside.
I've tested the Text File Input step, which should be the default choice for CSV files as marabu says. Unfortunately, it gives the same undesired results as in the question.
The solution is to remove the double quotes (circled in red) from the enclosure definition box in the CSV input step. The step will then correctly trim spaces outside of the strings, quoted or not. You then put the data through a "Replace in String" step to replace the " by nothing (red underlines).

parse text file and remove white space

I have a file written from a Cobalt program that produces a pipe delimeted file. The file contains white space and "null" spaces that I need to get rid of and then rewrite the file. What is the best way to do that? I have sql server and visual studio that can be used to write the script, but not sure the best one to use or exactly how. The script will need to read through many different files in a folder. The data is being converted from an old system into a new one. Also, I would need to keep spaces between words, ie a business name or an address. I was going to use sql, but can only find examples reading fields in a database.
Example file (one line):
0000000009|LName |FName | | | | | | |1|1|0|000|000|000000000|
1||null null null| | null null|null null null null| |1|0|
Desired output:
0000000009|LName|Fname|||||||1|1|0|000|000|000000000|1||||||1|0|
Thanks!!
You said you can use visual studio, so this example uses c#.
I suppose you will load your file content into a string, then you can apply some replaces:
s.Replace("null", string.Empty).Replace(" |", "|").Replace("| ", "|").Replace("| |", "||");
I know there are probably a lot of much more elegant solution: this is quick and dirty but it will output the string you need.
Hope this helps.

Data between quotes and field separator

In the example given below, the last line is not uploaded. I get an error:
Data between close double quote (") and field separator:
This looks like a bug since all the data between pipe symbol should be treated as a single field.
Schema: one:string,two:string,three:string,four:string
Upload file:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
The first and second line above is processed. But not the third.
Update:
Can some please explain why does the following work except the third line?
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
forth line | enclosed | {"GPRS","MCC_DETECTED":false,"MNC_DETECTED":false} | how does this work?
fifth line | with | {"start quote"} and | a word after quotes
There can be some fancy explanation to this. From the end user perspective this is absurd.
From the CSV RFC4180 page: "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
You probably want to do this:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | " ""start quote"" and " | a word after quotes
More about our CSV input format here.
Using --quote worked perfectly.
bq load
--source_format CSV --quote ""
--field_delimiter \t
--max_bad_records 10
-E UTF-8
destination table
Source files
API V2
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.quote
bq command
--quote: Quote character to use to enclose records. Default is ". To indicate no quote character at all, use an empty string.
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
You can use the other flags also while uploading the data. I used the bq tool with following flags
bq load -F , --source_format CSV --skip_leading_rows 1 --max_bad_records 1 --format csv -E UTF-8 yourdatset gs://datalocation.
Try loading every time with bq shell.
I had to load 1100 columns. While trying with the console with all the error options, it threw lot many errors. Ignoring the errors in the console means loosing records.
Hence tried with the shell and succeeded loading all the records.
Try the following:
bq load --source_format CSV --quote "" --field_delimiter \t --allow_jagged_rows --ignore_unknown_values --allow_quoted_newlines --max_bad_records 10 -E UTF-8 {dataset_name}.{table_name} gs://{google_cloud_storage_location}/* {col_1}:{data_type1},{col_2}:{data_type2}, ....
References:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bigquery_load_table_gcs_csv-cli
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options