How do I use awk split file to multiline records? - awk

On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.
I want to split the file into multiline records corresponding to slides in the deck.
Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
I've tried resetting the record separator, RS = "\n^[A-Z]" and RS = "\n^[[:alnum:]][[:upper:]]", and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.
The cleaned text looks like this:
Welcome
++ Class will focus on:
– Basics of SQL syntax
– SQL concepts analogous to Excel concepts
Who Am I
++ Self-taught on LAMP(ython) stack
++ Plus some DNS, bash scripting, XML / XSLT
++ Prior professional experience:
– Office of Management and Budget
– Investment banking (JP Morgan, UBS, boutique)
– MBA, University of Chicago
Roadmap
+ Preliminaries
+ What is SQL
+ Excel vs SQL
+ Moving data from Excel to SQL and back
+ Query syntax basics
- Running queries
- Filtering, grouping
- Functions
- Combining tables
+ Using queries for analysis
Some 'slides' have blank lines, some don't.
Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.
How do I do those things?
EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.

In awk you could try to collect records using:
/^[[:upper:]]/ {
if (r>0) print rec
r=1; rec=$0 RS; next
}
{
rec=rec $0 RS
}
END {
print rec
}
To remove bullets you could use
gsub (/•/,"++",rec)

You might try using the "textutil" utility built into OSX to convert the file within a script to save you doing it all by hand. Try typing the following into Terminal window and pressing to move to the next page:
man textutil
Once you have got some converted text, try posting that so people can see what the inputs look like, then maybe someone can help you split it up how you want.

Related

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?
When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.
This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

Python: Search Journal.txt for dates and write the corresponding text into a new file for Evernote import

I've been learning Python for a week now and am currently at Exercise 26 (learnpythonthehardway). So I know nothing. I tried searching but couldn't find what I need.
What I need:
I want to write a script that breaks my Journal.txt file into several text files to be able to import them into Evernote. Evernote pulls the title for a note from the first line of the .txt
This is a random example of the date format I used in my Journal.txt
1/13/2013, 09/02/2012, so I'm afraid the date is not consistent. I know about:
if 'blabla' in open('example.txt').read():
but don't know how to use it with a date. Please help me to elicit date corresponding Journal entries from a large file into a new one. This is literally all I got so far:
Journal = open("D:/Australien/Journal.txt", 'r').read()
Consider doing it like recommended here - replacing YOUR_TEXT_HERE with a search pattern for a date, e. g. [0-9]+\/[0-9]+\/[0-9]+.
awk "/[0-9]+\/[0-9]+\/[0-9]+/{n++}{print > n \".txt\" }" a.txt
If you don't have awk installed on your PC, you can fetch it here.

Parsing a slash and apostrophe with Python regex

I am attempting to parse a Wikipedia SQL dump with the Python regular expressions library. The ultimate goal is to import this dump into PostgreSQL, but I know the apostrophes in strings need to be doubled, beforehand.
Every apostrophe in a string in this dump is preceded by a backwards slash, though, and I'd rather not remove the backwards slashes.
(42,'Thirty_Years\'_War',33,5,0,0)
Using the command
re.match(".*?([\w]+?'[\w\s]+?).*?", line)
I cannot identify the apostrophe in the middle of 'Thirty_Years\'_War', when 'line' is parsed from a text file.
For comparison, these lines work fine when parsed (sans the last line).
The person's car
The person's car's gasoline
Hodges' Harbrace Handbook
'Hodges' Harbrace Handbook'
portspeople',1475,29,0,0),(42,'Thirty_Years\'_War',33,5,0,0)
Correct and expected output (sans the last line):
The person''s car
The person''s car''s gasoline
Hodges'' Harbrace Handbook
('Hodges'' Harbrace Handbook')
portspeople',1475,29,0,0),(42,'Thirty_Years\'_War',33,5,0,0)
Using the command
re.match(".*?([\w\\]+?'[\w\s]+?).*?", line)
breaks it.
The person''s car
The person''''s car''''s gasoline
Hodges'' Harbrace Handbook
(''''''''Hodges'''''''' Harbrace Handbook'''''''')
portspeople'''''''''''''''',1475,29,0,0),(42,''''''''''''''''Thirty_Years\''''''''''''''''_War'''''''''''''''',33,5,0,0)
Is it stuck in some sort sort of loop? What is the correct regex code to use?
I am not thinking about SQL injection attacks because this script is only going to be used for parsing dumps of Wikipedia articles (that don't contain examples of SQL injection attacks).
If the dump consists of things like the string you provided, you could try something like this:
re.findall(r"[^,\(\)]+")
Where the character class contains all known separators.
EDIT: Only use regex for parsing when there is no better way :)
Most Python database interfaces will take care of quoting SQL statements for you. For example, with the psycopg driver, you would write something like:
mystring="""This is 'a string' that contains single quotes."""
c.execute('INSERT INTO mytable (mycolumn) VALUES (%s)', mystring)
...and the database driver will take care of correctly quoting the values for you. Look at some of the examples in the documentation. In fact, their first example is remarkably like this one.

vb.net VB 2010 Underscore and small rectangles in string outputs?

I've made some good progress with my first attempt at a program, but have hit another road block. I'm taking standard output (as a string) froma console CMD window (results of dsquery piped to dsget) and have found small rectangles in the output. I tried using Regex to clean the little bastards but it seems they are related to the _ (underscore), which I need to keep (to return 2000/NT logins). Odd thing is - when I copy the caharcter and paste it into VS2K10 Express it acts like a carrige return??
Any ideas on finding out what these little SOB's are -- and how to remove them?
Going to try using /U or /A CMD switch next..
The square is often just used whenever a character is not displayable. The character could very well be a CR. You can use a Regular Expression to just get normal characters or remove the CR LF characters using string.replace.
You mentioned that you are using the string.replace function, and I am wondering if you are replacing the wrong character or something like that. If all your trying to do is remove a carriage return I would skip the regular expressions and stick with the string.replace.
Something like this should work...
strInputString = strInputString.replace(chr(13), "")
If not could you post a line or two of code.
On a side note, this might give some other examples....
Character replacement in strings in VB.NET

Tool to format lines of text into array

I frequently come across this problem. I have a file:
something
something2
something3
which I want output as:
"something","something2","something3"
any quick tool for this, preferably online?
If its just a one off thing, it'd be pretty easy to just do it with a search & replace in any advanced-ish text editor...
For example in notepad++:
Do a Replace (CTRL+H)
Set "Search Mode" to "Extended"
Find: \r\n
Replace with: ","
(of course you'll need an extra quote at the very start & very end of the file).
If you need to do it more than once, writing a small script/program that did a regular expression replace over the file would be fairly straight forward too.
Edit: If you really wanted to do it online, you could use an online regular expression tester (in this case you want to use \n as the regex and "," as your replace pattern, leaving the other settings alone).
A quick Python hack?
lines = open('input.txt').xreadlines()
txt = ','.join(['"%s"' % x for x in lines])
open('output.txt', 'w').write(txt)