Parse data from text file into comma delimited values - sql

I have thousands of records like below in a line spaced text file. I am trying to create a delineated file of some sort to import in SQL. Be it by script, function, even excel I just can't get it.
RECORD #: #####
NAME: Tim
DOB: 01/01/2012
SEX: male
DATE: 07/19/2012
NOTES IN PARAGRAPH FORM
END OF RECORD
RECORD #: #####
NAME: Tim
DOB: 01/01/2012
SEX: male
DATE: 07/19/2012
NOTES IN PARAGRAPH FORM
END OF RECORD
Desired output:
RECORD #: #####,NAME: Tim,DOB: 01/01/2012,SEX: male,DATE: 07/19/2012,NOTES IN PARAGRAPH FORM
RECORD #: #####,NAME: Tim,DOB: 01/01/2012,SEX: male,DATE: 07/19/2012,NOTES IN PARAGRAPH FORM

A plan:
Use .ReadAll() to load your input file into memory (fallback: line by line reading, "END OF RECORD" triggers processing of record)
Use Split(sAll, "END OF RECORD") to get an array of records (strings). For Each sRecord
Use Split(sRecord, EOL, 6) to get 5 'one line fields' and 1 text/notes/memo field that may contain EOLs or not
Use one RegExp ("\w+\s*#?:\s*(.+)") (fallback: specialized RegExps) to cut the data from the 'one line fields', trim leading/trailing whitespace from the 6th
Transform fields as needed: string data should be quoted, EOLs and quotes in the 6th should (probably) be excaped, using a standard date format (yyyy-mm-dd) may avoid problems later
.WriteLine *Join*(aFields, sSep) to output.csv
Describe the format of your output.csv in a schema.ini file (choose easy/save column names!)
Use the import facility of your DBMS or ADO to import the .csv into the database
Feel free to ask for details.

Related

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"
You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

Pandas to_csv adds new rows when data has special characters

My data has multiple columns including a text column
id text date
8950026 Make your EMI payments only through ABC 01-04-2021 07:43:54
8950969 Pay from your Bank Account linked on XXL \r 01-04-2021 02:16:48
8953627 Do not share it with anyone. -\r 01-04-2021 08:04:57
I used pandas to_csv to export my data. That works well for my first row but for the next 2 rows, it creates a new line and moves the date to the next line and adds to the total rows. Basically my output csv will have 5 rows instead of 3
df_in.to_csv("data.csv", index = False)
What is the best way to handle the special character "\r" here? I tried converting the text variable to string in pandas (Its dtype is object now) but that doesn't help . I can try and remove all \r in the end of text in my dataframe before exporting, but is there a way to modify to_csv to export this in the right format ?
**** EDIT****
This question below is similar and I can solve the problem by replacing all instances of \r in my dataframe but how can this be solved by not replacing? Does to_csv have options to handle these
Pandas to_csv with escape characters and other junk causing return to next line

skipLeadingRows=1 in external table definition

In the below example, how can I set the skip leading row option?
bq --location=US query --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER#CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'
Regards,
Sreekanth
Flags options can be found under the installation home folder (I marked in bold below the flag you are looking for)
/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: : The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: : Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP

Line contains invalid enclosed character data or delimiter at position

I was trying to load the data from the csv file into the Oracle sql developer, when inserting the data I encountered the error which says:
Line contains invalid enclosed character data or delimiter at position
I am not sure how to tackle this problem!
For Example:
INSERT INTO PROJECT_LIST (Project_Number, Name, Manager, Projects_M,
Project_Type, In_progress, at_deck, Start_Date, release_date, For_work, nbr,
List, Expenses) VALUES ('5770','"Program Cardinal
(Agile)','','','','','',to_date('', 'YYYY-MM-DD'),'','','','','');
The Error shown were:
--Insert failed for row 4
--Line contains invalid enclosed character data or delimiter at position 79.
--Row 4
I've had success when I've converted the csv file to excel by "save as", then changing the format to .xlsx. I then load in SQL developer the .xlsx version. I think the conversion forces some of the bad formatting out. It worked at least on my last 2 files.
I fixed it by using the concatenate function in my CSV file first and then uploaded it on sql, which worked.
My guess is that it doesn't like to_date('', 'YYYY-MM-DD'). It's missing a date to format. Is that an actual input of your data?
But it could also possibly be the double quote in "Program Cardinal (Agile). Though I don't see why that would get picked up as an invalid character.

How to select specific data from text file in VB?

I am making an Arduino weather station and I am outputting the data to a simple text file. But I want to make a like a year log of the highest and lowest temps. So my question is how can I select only the data between some symbols and then use it in VISUAL BASIC...For example my text file contains this string:"[29.11.2015 AT: 19:19:43] MR t. C:| 22.18 |Out t. C:| 7.36 |Aqu. H20 t. C:| 23.12 |Light(MR):| 1.63 | Door in MR:CLOSED!" and as you can see all the data is surrounded by these "|", can I make vb to get only this data and then compare it to previous one?
Take a look at TextFieldParser. Your delimiter will be the "|".