I have an input file that I want to use the string SPLIT function on for each line, depending on the Type field. However, the description field sometimes has data that has new lines in it so it messes up my file reader since it uses streamreader's readline() function
Handled:
Type|Name|User|Description
Type|Name|User|Description
Unhandled:
Type|Name|User|Description line 1
Description Line 2
Type|Name|User|Description
Besides not being able to validate on 'Type' for each line and keep reading the file for when the next Type field appears, are there any ways folks can come up with to properly read this file?
My solution was to have the file maker replace newline characters in their description field with another unique character that I can later add back in. I'm still interested in solutions from the file reader's perspective though
I know I'm talking to myself a lot here, but I found another solution, which is to remove remove line feeds, since the output file creator wrote out carriage returns for each line.
You could easily set a conditional statement to see if the Split array contains more than one element, which would indicate that it's a line you want to parse.
Related
I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark
Been searching around for this for a couple hours, can't find anything which will do this correctly. When writing a string to a text file, a blank line is outputted at the end.
writeString = New StreamWriter(path, False)
writeString.WriteLine("Hello World")
writeString.Flush()
writeString.Close()
This will write the following to file:
Hello World
(Blank Line)
I've tried removing last character of string (both as regular string with varString.Substring(0, varString.Length - 1) and also as a list of string with varList.RemoveAt(varList.Count - 1)) but it just removes the literal last character.
I've also tried using Replace(vbCrLf, "") and many variations of it but again, they only remove literal new lines created in the string, not the new line at the end that is magically created.
Preferably, I'm seeking a method which will be able to remove that magical newline before the string is ever written to the file. I found methods which read from the file and then write back to it which would require Write > Read > Write, but in all cases the magical new line still appeared. :(
If it's important to note: The file will contain a string which may contain actual new lines (it's 'Song Artist - Song Title', though can contain other information and new lines can be added if the user wishes). That text file is then read by other applications (such as mIRC etc) of which output the contents by various means depending on application.
Eg. If an application were to read it and output it into a textbox.. the new line will additionally output to that textbox.. which is a problem! I have no control of the applications which will read the file as input considering it's the client which decides the application, so the removal of the new line needs to be done when outputted.
Help is appreciated~!
Use the Write method instead of WriteLine. The WriteLine method is the one adding a blank 0 length line to the file because it is terminating the "Hello World" string with a newline.
writeString.Write("Hello World")
I need to parse a CSV file with blocks of text being processed in different ways according to certain rules, e.g.
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
customerone,columnone<br>
customertwo,columntwo<br>
singlevalueone
singlevaluetwo
singlevalueone_otherruleapplies
singlevaluethree_otherruleapplies
Each block of text will be grouped so the first three rows will be parsed using certain rules and so on. Notice that the last two groups have only one single column but each group must be handled in a different way.
I have the chance to propose the customer the format of the file so I'm thinking to propose the following.
[group 1]
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
[group N]
rowN
A kind of sections like the INI files from some years ago. However I'd like to hear your comments because I think there must be a better way to handle this.
I proposed to use XML but the customer prefers the text files.
Any suggestions are welcome.
m0dest0.
Ps. using VB.net and VS 2008
You can use regular expression groups set to either an enum line mode if each line has the same format, or to an enum multi-line if the format is not constrained to a single line. For each line in multiline you can include \n in your pattern to cross multiple lines to find you pattern. If its on a single line you don't need to include \n also know as Carriage return line feed in your regex matching pattern.
vb.net as well as many other modern programming language has extensive support for grouping operations. You can use index groups, or named groups.
Each name such as header1 or whatever you want to name it would be in this format: <myname>
See this link for more info: How do I access named capturing groups in a .NET Regex?.
Good luck.
I had lots of data in a .rtf file(having usernames and passwords).How can I fetch that data into a table. I'm using sqlite3.
I had created a "userDatabase.sql" in that I had created a table "usersList" having fields "username","password". I want to get the list of data in the "list.rtf" file in to my table "usersList". Please help me .
Thanks in advance.
Praveena.
I would write a little parser. Re-save the .rtf as a txt-file and assume it look like this:
user1:pass1
user2:pass2
user5:pass5
Now do this (in your code):
open the .txt file (NSString -stringWithContentsOfFile:usedEncoding:error:)
read line by line
for each line, fetch user and password (NSArray -componentsSeparatedByString)
store user/password into your DB
Best,
Christian
Edit: for parsing excel-sheets I recommend export as CSV file and then do the same
Parsing RTF files is mostly trivial. They're actually text, not binary (like doc pdf etc).
Last I used it, I remember the file format wasn't too difficult either.
Example:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 Username Password\par
Username2 Password2\par
UsernameN PasswordN\par
}
Do a regular expression match to get the last { ... } part. By sure to match { not \{.
Next, parse the text as you want, but keep in mind that:
everything starting with a \ is escaped, I would write a little function to unescape the text
the special identifier \par is for a new line
there are other special identifiers, such as \b which toggles bolding text
the color change identifier, \cfN changes the text color according to the color table defined in the file header. You would want to ignore this identifier since we're talking about plain text.
I have a CSV output on one of my applications. This produces a file from of web form data.
In some cases I am getting a carriage return character in my notes field. This causes an error when importing the file. I would like to remove this character.
The issue appears to be happening when users paste information into the form from word documents or holding down the shift key and pressing enter.
The field is ntext and populated in a multi line text box control.
I have been trying to remove this with a replace function but some carriage return characters seem to be getting through.
SQL
REPLACE(Fieldname), CHAR(13) + CHAR(10), ' ') AS new_Fieldname
It may be best to replace the characters separately, as they do not always occur together or in that order:
REPLACE(REPLACE(Fieldname, CHAR(13),' '), CHAR(10), ' ') AS new_Fieldname
Note that you may have a carriage return + line feed, or just a carriage return (depending on the source platform, the source of the data etc.). So you will probably need to handle both cases.
You can read CSVs with carriage return in them. The carriage return should be in a string represented field (i.e. surrounded by quotes). This allows you to read lines and incldue them in your field. If you are reading your CSV one line at a time, you need to maintain state between lines and append the data as necessary.
In .Net, the easiest way to read a CSV is using the Microsoft.VisualBasic.FileIO.textFileParser object (yes, you can use this in C# if you add a reference). This reads even the nastiest CSVs I've thrown at it with ease.
In Word, there are different kinds of new-line characters. Maybe you should also search/replace the other ones.
I'm not sure which are all the different possibilities, at least the paragraph mark is one that I know of.