Python: Search Journal.txt for dates and write the corresponding text into a new file for Evernote import - file-io

I've been learning Python for a week now and am currently at Exercise 26 (learnpythonthehardway). So I know nothing. I tried searching but couldn't find what I need.
What I need:
I want to write a script that breaks my Journal.txt file into several text files to be able to import them into Evernote. Evernote pulls the title for a note from the first line of the .txt
This is a random example of the date format I used in my Journal.txt
1/13/2013, 09/02/2012, so I'm afraid the date is not consistent. I know about:
if 'blabla' in open('example.txt').read():
but don't know how to use it with a date. Please help me to elicit date corresponding Journal entries from a large file into a new one. This is literally all I got so far:
Journal = open("D:/Australien/Journal.txt", 'r').read()

Consider doing it like recommended here - replacing YOUR_TEXT_HERE with a search pattern for a date, e. g. [0-9]+\/[0-9]+\/[0-9]+.
awk "/[0-9]+\/[0-9]+\/[0-9]+/{n++}{print > n \".txt\" }" a.txt
If you don't have awk installed on your PC, you can fetch it here.

Related

Cannot move a text object (variable) outside a function

I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?
When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.
This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

Apache Pig filtering out carriage returns

I'm fairly new with apache pig and trying to work with some fixed width text. In pig, I'm reading every line in as a chararray (I know I can use fixedwidthloader, but am not in this instance). One of the fields I'm working with is an email field and one entry has a carriage return that generates extra lines of output in the finished data dump (I show 12 rows instead of the 9 I'm expecting). I know which entry has the error but I'm unable to filter it out using pig.
Thus far I've tried to use pig's REPLACE to replace on \r or \uFFFD and even tried a python UDF which works on the command line but not when I run it as a UDF through PIG. Anyone have any suggestions? Please let me know if more details are required.
My original edit with a solution turned out to only work part of the time. This time I had to clean the data before I ran it through pig. On the raw data file I did a perl -i -pe 's/\r//g' filename to remove the rogue carriage return.

How do I use awk split file to multiline records?

On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.
I want to split the file into multiline records corresponding to slides in the deck.
Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
I've tried resetting the record separator, RS = "\n^[A-Z]" and RS = "\n^[[:alnum:]][[:upper:]]", and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.
The cleaned text looks like this:
Welcome
++ Class will focus on:
– Basics of SQL syntax
– SQL concepts analogous to Excel concepts
Who Am I
++ Self-taught on LAMP(ython) stack
++ Plus some DNS, bash scripting, XML / XSLT
++ Prior professional experience:
– Office of Management and Budget
– Investment banking (JP Morgan, UBS, boutique)
– MBA, University of Chicago
Roadmap
+ Preliminaries
+ What is SQL
+ Excel vs SQL
+ Moving data from Excel to SQL and back
+ Query syntax basics
- Running queries
- Filtering, grouping
- Functions
- Combining tables
+ Using queries for analysis
Some 'slides' have blank lines, some don't.
Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.
How do I do those things?
EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.
In awk you could try to collect records using:
/^[[:upper:]]/ {
if (r>0) print rec
r=1; rec=$0 RS; next
}
{
rec=rec $0 RS
}
END {
print rec
}
To remove bullets you could use
gsub (/•/,"++",rec)
You might try using the "textutil" utility built into OSX to convert the file within a script to save you doing it all by hand. Try typing the following into Terminal window and pressing to move to the next page:
man textutil
Once you have got some converted text, try posting that so people can see what the inputs look like, then maybe someone can help you split it up how you want.