Cannot move a text object (variable) outside a function - dataframe

I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark

Related

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?
When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.
This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

Extra quote marks being added to String field in dataframe

I'm trying to do some text processing on entries in a tsv file so I loaded it in as a dataframe and I'm trying to add a quotation mark at the beginning of a certain entry in the dataframe. So the code I'm using to do this is as follows
episode_info.loc[i, 'word'] = "\"" + episode_info.loc[i, "word"]
but the result I'm getting when I look at the output is """help" instead of just "help and the previous entry is just help so I don't know why this isn't working
Okay I printed out the entries in question to terminal and it looks like it was printing out the correct thing. I guess when I viewed it in Sublime, which is what I was using, the quotation marks were being formatted weirdly. Apologies for the unnecessary question.

Import text from a .txt file using keywords in random positions

I'm new in this great platform and I have a question in Visual Basic.net.
I would like to import data from a txt file (or if you prefer a richtextbox!) using keywords that can be placed in a random position within the txt file. For example a txt like this:
keyword 25
or like this:
keyword 25
In both cases the application should be able to recognise the line because of the presence of the keyword and get the number (25) that will be saved in a variable. Of course this number can vary in different files.
I was thinking to use a code similar to this one:
If line.StartsWith(keyword) Then
.....
End If
but the problem is that the keyword is not always placed as first char (there can be spaces before) and I don't know the line where this keyword is placed int the txt file.
Then I would even ask you how to get the number 25 that can be also placed in random position after the keyword (but for sure on the same line).
I hope everything is clear and thanks if you can help me.
You may consider using .TrimStart() on the lines as you read them, like so:
If line.TrimStart.StartsWith(keyword) Then
.......
End If

Removing handling newlines in a simple text import class

I have an input file that I want to use the string SPLIT function on for each line, depending on the Type field. However, the description field sometimes has data that has new lines in it so it messes up my file reader since it uses streamreader's readline() function
Handled:
Type|Name|User|Description
Type|Name|User|Description
Unhandled:
Type|Name|User|Description line 1
Description Line 2
Type|Name|User|Description
Besides not being able to validate on 'Type' for each line and keep reading the file for when the next Type field appears, are there any ways folks can come up with to properly read this file?
My solution was to have the file maker replace newline characters in their description field with another unique character that I can later add back in. I'm still interested in solutions from the file reader's perspective though
I know I'm talking to myself a lot here, but I found another solution, which is to remove remove line feeds, since the output file creator wrote out carriage returns for each line.
You could easily set a conditional statement to see if the Split array contains more than one element, which would indicate that it's a line you want to parse.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));