Extra quote marks being added to String field in dataframe - pandas

I'm trying to do some text processing on entries in a tsv file so I loaded it in as a dataframe and I'm trying to add a quotation mark at the beginning of a certain entry in the dataframe. So the code I'm using to do this is as follows
episode_info.loc[i, 'word'] = "\"" + episode_info.loc[i, "word"]
but the result I'm getting when I look at the output is """help" instead of just "help and the previous entry is just help so I don't know why this isn't working

Okay I printed out the entries in question to terminal and it looks like it was printing out the correct thing. I guess when I viewed it in Sublime, which is what I was using, the quotation marks were being formatted weirdly. Apologies for the unnecessary question.

Related

Cannot move a text object (variable) outside a function

I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?
When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.
This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

Escape all commas in line except first and last

I have a CSV file which I'm trying to import to a SQL Server table. The file contains lines of 3 columns each, separated by a comma. The only problem is that some of the data in the second column contains an arbitrary number of commas. For example:
1281,I enjoy hunting, fishing, and boating,smith317
I would like to escape all occurrences of commas in each line except the first and the last, such that the result of this line would be:
1281,I enjoy hunting\, fishing\, and boating,smith317
I know I will need some type of regular expression to accomplish this task, but my knowledge of regular expressions is very limited. Currently, I'm trying to use Notepad++ find/replace with regex, but I am open to other ideas.
Any help would be greatly appreciated :-)
Okay, could be a manual stuff. Do this:
Normal find all the , and replace it with \,. Escape everything.
Regex find ^(.*)(\\,) and replace it with $1,.
Regex find (\\,)(.*)$ and replace it with ,$2.
Worked for me in Sublime Text 2.

Dealing with commas in csv files csv-river plugin

I am trying to index data present in csv file to elasticsearch server. The problem is the string itself contain multiple "," so during indexing it is giving indexoutofbound exception.
How to handle commas using csv-river plugin.
Edit:
The example file would be:
MESSAGE_ID,PARENT_MESSAGE_ID,THREAD_ID,FORUM_ID,FORUMINDEX,USER_ID,SUBJECT,BODY,MODVALUE,FORUM_NAME,CATEGORY_NAME,LIKES,DISLIKES,IS_ROOT_MESSAGE,IS_QUESTION
244,195,103,4,3,341,Re: The most stupidest program I've ever seen--Amazon,"I know nothing of your case, but I do know that throwing around terms like ""stupid idiot"" doesn't exactly help your side any.",1,"Order Management, Shipping, Feedback & Returns",Sell on Amazon,,,no,no
you need to enclose your fields in quotes. If the field contains a quote, you need to escape it with a preceding quote.
For example:
"field1","field2","field3 with, commas","field4","field ""5"" with quotes","field6"

Removing handling newlines in a simple text import class

I have an input file that I want to use the string SPLIT function on for each line, depending on the Type field. However, the description field sometimes has data that has new lines in it so it messes up my file reader since it uses streamreader's readline() function
Handled:
Type|Name|User|Description
Type|Name|User|Description
Unhandled:
Type|Name|User|Description line 1
Description Line 2
Type|Name|User|Description
Besides not being able to validate on 'Type' for each line and keep reading the file for when the next Type field appears, are there any ways folks can come up with to properly read this file?
My solution was to have the file maker replace newline characters in their description field with another unique character that I can later add back in. I'm still interested in solutions from the file reader's perspective though
I know I'm talking to myself a lot here, but I found another solution, which is to remove remove line feeds, since the output file creator wrote out carriage returns for each line.
You could easily set a conditional statement to see if the Split array contains more than one element, which would indicate that it's a line you want to parse.