Parse text file into SQL table - sql

How would I transform the following block of text which is excerpt from one application log in txt format:
ID: 1
Name: John
ID: 2
Name: Doe
into the following format:
ID Name
1 John
2 Doe

I'd write an AutoHotKey script to read the log file, parse it, and output to the new format.
You can use Loop, Parse to read the contents based on the colon : delimeter. It's quite neat!

Related

REGEXP_SUBSTR with a URL

I have a string in which I'm trying to extract a URL from. When I run it on this RegEx site, it works fine.
The Regex Pattern is: http:\/\/GNTXN.US\/\S+
The message I'm extracting from is below, and lives in a column called body in my SQL database.
Test Message: We want to hear from you! Take our 2022 survey & tell us what matters most to you this year: http://GNTXN.US/qsx Text STOP 2 stop/HELP 4 help
But when I run the following in SQL:
SELECT
body,
REGEXP_SUBSTR(body, 'http:\/\/GNTXN.US\/\S+') new_body
FROM
table.test
It returns no value. I have to imagine it's something to do with the backslashes in the URL, but I've tried everything.
The new_body output should read as http://GNTXN.US/qsx
In mysql you just need to escape the \
select body, REGEXP_SUBSTR(body, 'http:\\/\\/GNTXN.US\\/\\S+') as new_body
from table.test;
new_body output:
http://GNTXN.US/qsx

Green Plum Gpload issue while loading a text file using yaml file

I am trying to load text file that is delimited with Pipe (|) to Green Plum Table. But because of some special characters in column like ' ÉCLAIR' causing the load fail. Is there is any option in Greenplum Gpload that will load the data in Table without issue.
I am using yaml file like this :
GPLOAD:
INPUT:
- SOURCE:
FILE: [ /testfile.dat ]
- FORMAT: TEXT
- DELIMITER: '|'
- ENCODING: 'LATIN1'
- NULL_AS: ''
- ERROR_LIMIT: 10000
- ERROR_TABLE:
is there is any other option in Gpload that we can use to load the file ?
I am creating the file to load from Teradata and because of teradata columns has special character it is causing issue in Greenplum as well.
You can try adding: - ESCAPE: 'OFF' in the input section.
You may need to change the ENCODING to something that recognizes those special characters. LATIN9 maybe?
Jim McCann
Pivotal

Grep for Multiple instances of string between a substring and a character?

Can you please tell me how to Grep for every instance of a substring that occurs multiple times on multiple lines within a file?
I've looked at
https://unix.stackexchange.com/questions/131399/extract-value-between-two-search-patterns-on-same-line
and How to use sed/grep to extract text between two words?
But my problem is slightly different - each substring will be immediately preceded by the string: name"> and will be terminated be a < character immediately after the last character of the substring I want.
So one line might be
<"name">Bob<125><adje></name><"name">Dave<123><adfe></name><"name">Fred<125><adfe></name>
And I would like the output to be:
Bob
Dave
Fred
Although awk is not the best tool for xml processing, it will help if your xml structure and data simple enough.
$ awk -F"[<>]" '{for(i=1;i<NF;i++) if($i=="\"name\"") print $(++i)}' file
Bob
Dave
Fred
I doubt that the tag is <"name"> though. If it's <name>, without the quotes change the condition in the script to $i=="name"
gawk
awk -vRS='<"name">|<' '/^[A-Z]/' file
Bob
Dave
Fred

Trouble with opening files in python

I was trying to write a program where I can upload a file that has first and last names in it and create a new file with the first letter of each first name followed by the last name.
The file that I created is a textfile and the lines would be like this:
firstname1 lastname1 (for example, john smith)
firstname2 lastname2 (for example jane jones)
firstname 3 lastname3 (for example jane doe)
etc...
I want to create a file that would look like this:
jsmith
jjones
jdoe
The issue that I am getting is that when I open the file in python it gives me all of these weird unwanted characters before getting to the actual text of the file. The book I am using to learn from doesn't say anything about this which is why i am posting here.
For example when I upload the file and run the following command:
newfile=open("example.file.rtf","r")
for i in newfile:
print(i)
I get this:
{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf540
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww9000\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\ql\qnatural\pardirnatural
\f0\fs24 \cf0 name 1\
name 2\
name 3 \
name 4 \
The actual text that I wrote in the textfile was just this:
name 1
name 2
name 3
name 4
Why is this happening? Why wouldn't it just show the plain text? If I can't get it to do that, how can I get around this issue for when I run loops through the file.
You are writing the file in RTF ("Rich Text") format, which is not plain text. Those "weird unwanted characters" are being written there by your editor. Use a plain text editor like Notepad to create your file, or explicitly save it as plain text.

Parse data from text file into comma delimited values

I have thousands of records like below in a line spaced text file. I am trying to create a delineated file of some sort to import in SQL. Be it by script, function, even excel I just can't get it.
RECORD #: #####
NAME: Tim
DOB: 01/01/2012
SEX: male
DATE: 07/19/2012
NOTES IN PARAGRAPH FORM
END OF RECORD
RECORD #: #####
NAME: Tim
DOB: 01/01/2012
SEX: male
DATE: 07/19/2012
NOTES IN PARAGRAPH FORM
END OF RECORD
Desired output:
RECORD #: #####,NAME: Tim,DOB: 01/01/2012,SEX: male,DATE: 07/19/2012,NOTES IN PARAGRAPH FORM
RECORD #: #####,NAME: Tim,DOB: 01/01/2012,SEX: male,DATE: 07/19/2012,NOTES IN PARAGRAPH FORM
A plan:
Use .ReadAll() to load your input file into memory (fallback: line by line reading, "END OF RECORD" triggers processing of record)
Use Split(sAll, "END OF RECORD") to get an array of records (strings). For Each sRecord
Use Split(sRecord, EOL, 6) to get 5 'one line fields' and 1 text/notes/memo field that may contain EOLs or not
Use one RegExp ("\w+\s*#?:\s*(.+)") (fallback: specialized RegExps) to cut the data from the 'one line fields', trim leading/trailing whitespace from the 6th
Transform fields as needed: string data should be quoted, EOLs and quotes in the 6th should (probably) be excaped, using a standard date format (yyyy-mm-dd) may avoid problems later
.WriteLine *Join*(aFields, sSep) to output.csv
Describe the format of your output.csv in a schema.ini file (choose easy/save column names!)
Use the import facility of your DBMS or ADO to import the .csv into the database
Feel free to ask for details.