Unicode error in the tokenization step only when doing stop words removal in python 2 - pandas

I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.
df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
'The wireless internet was unreliable. ', 'i am still her . :). ',
'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
'positive', 'negative', 'neutral', 'positive', 'neutral']})
But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.
But when I add this line of the code to remove stop_words
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']
It keeps raises this error:
ValueError: All sentences should be Unicode-encoded!
Also, the error raises in the tokenization step:
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
I want to know what is happening here that it causes this error, and the correct solution to fix the code.
(I have tried different encodings like uff-8 , etc but non worked)

I don't know the reason yet but when I did
df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')
it worked.
Still very curious to know why this is happening only when I do stop words removal

Related

tcolorbox in RMarkdown with shortcuts

I'm using the LaTex library tcolorbox in a RMarkdown document to list R code with tbclisting{...}. It works fine when I use the full command is used in the document
\begin{tcblisting}{colback=red!5!white, colframe=red!50!black,listing only,before skip=5 cm,
title=R code for finding and plotting frequencies from sorted data - Figure 3.5,hbox, enhanced, drop fuzzy shadow, listing options={language=R,keywordstyle=\color{blue}},before=\begin{center}, after=\end{center}}
some text
\end{tcblisting}
Due to the length of the command and the multiple use of similar boxes, changing each box to reflect, say, a new color, is tedious and error prone. I'd like to create a short cut using
\newtcblisting{mybox}[1]{%colback=red!5!white, colframe=red!50!black,listing only,before skip=5 cm,title={#1},hbox, enhanced, drop fuzzy shadow, listing options={language=R,keywordstyle=\color{blue}}, before=\begin{center}, after=\end{center}}
in the preamble as in the LaTex documentation, then implemented using
when I refer to the predefined box using
\begin{mybox}{my box title}
some text
\end{mybox}
but the compiler-to-pdf gives me an error message
**!Package pgfkeys Error: I do not know the key '/tcb/[' and I am going to ignore it. Perhaps you misspelled it.**
I'm thinking that 1) RMarkdown/tcolorbox doesn't support this or 2) something is wrong with my syntax. For 2) I've tried putting the newtcblisting definition in the preamble header-includes: section with the "-" preceding it (no good) and in the main body of the document w/o the "-". Also no good.
Can anyone help with this?

Correct Malformed CSV and pull corrected data back into a dataframe

UPDATE BELOW.....
Have automated csv data dumping into our backend and it looks like there are some malformed items buried in the data. There is a job family title that errantly has a \n in between two words. Which is wrecking our data, so that's the problem.
I want to read in the csv as wholetext, regexp_replace the title with the correction, then load this fixed wholetext into a new dataframe as if I loaded up a correct csv to start with.. Here's the madness of where I'm at right now: Lol.
# Import in the functions I need
# from pyspark.sql.functions import col
# Looks like there is a job family title with an issue. There's a carriage return / line feed between two words messing up the csv
# This needs to be patched before we actually pull the data into the dataframes to begin work
data_requisitions_patch0 = spark.read.text('abfss://container#somethingcool.dfs.core.windows.net/Data/brokencsv.csv', wholetext=True)
data_requisitions_patch0.collect()
data_requisitions_new = data_requisitions_patch0
# print(data_requisitions_patch0)
# data_requisitions_patch0.printSchema()
# data_requisitions_patch0.show()
data_requisitions_patch1 = data_requisitions_patch0 \
.withColumn("value", regexp_replace(col('value'), 'Job - Starting\n', 'Job - Starting'))
data_requisitions_patch1.collect()
print('patch0')
data_requisitions_new.count()
print('patch1')
data_requisitions_patch1.count()
# print('Patch0 dataframe: ' + data_requisitions_patch0.count())
# print('Patch0 dataframe: ' + data_requisitions_patch1.count())
# data_requisitions_test0 = spark(data_requisitions_patch1, header=True)
# data_requisitions_test1 = spark.read.csv('abfss://container#somethingcool.dfs.core.windows.net/Data/brokencsv.csv', header=True)
# data_requisitions_test0.count()
# data_requisitions_test0.printSchema()
# data_requisitions_test1.count()
# data_requisitions_test1.printSchema()
It's obviously a mess right now, I'm trying to troubleshoot is the regexp_replace is working, but not having much luck. Then it occurred to me that I have a single row single column dataframe. Now I'm attempting to try to figure how how to take the dataframe post the 'patch' and turn that back into a normal csv'ed dataframe like everything was ok to begin with.
I left in all my testing nonsense, thought was that you might see where my head is... Unsure if that was helpful or not. Links have been faked, obviously.
First off: Am I going in the right direction? No part of this is really working.. I can't even get the counts to work. test1.count() does return... but test0.count() doesn't? I don't even really care about the counts, that's me just trying to figure out why it's not working.
Secondly: Malformed csv -> wholetext dataframe -> regexp fix the problem -> fixed dataframe with correct headers, rows, like normal.
How off am I?
=======
UPDATE
Made some great progress, I ended up splitting the wholetext dataframe on \n line feeds and exploded that into rows. That works great. Now the dataframe has exactly how many rows it's supposed to have. Now working on trying to figure out how to re-map the columns to get those created correctly.
Thoughts are to take in the header row and try to use that as a map? I don't know, still researching.
I wasn't approaching this right... Was handling this like a typical C# project, pull data from the db and process. But this doesn't really deal well with that. Ended up putting the processed data into the dataframe itself and ran my if checks from contained columns. Works fantastic, and it's a lot faster than trying to extract the data to do the checks.

unexpected indent python 3.7 pycharm

I am very new to python. But I want to extract some data of job postings from an online job portal.
With the following code I wanted to extract the title of the job posting of a particular website:
def jobtitle(soup):
jobs=[]
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
jobtitle(soup)
I receive this error message:
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
^
IndentationError: unexpected indent
I tried many different things that were recommend on other sites, but nothing worked. I just don't know what the problem is. I tried different whitespace, but I just don't understand what I am doing wrong.
Any ideas? I would be really grateful!
Thanks a lot :-)
Remove the indent on the first for line.
The first for statement should be directly under the jobs=[] declaration.
def jobtitle(soup):
jobs=[]
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
jobtitle(soup)

How to solve “illegal character” error when saving Google Script for Google Analytics

I'm working on automating Google analytics to get results into a google drive dashboard and coming up with an error code in "illegal character. (line 7, file "code")" How do i fix this problem? Here is the full script. I appreciate the help in advance. Thanks.
// Get Data.
var results = Analytics.Data.Ga.get(
tableId,
startDate,
endDate,
'ga:visitors,ga:visits,ga:pageviews',
{‘dimensions’: ‘ga:date’});
// Output to spreadsheet.
var sheet = SpreadsheetApp.getActiveSpreadsheet().insertSheet();
sheet.getRange(2, 1, results.getRows().length, headerNames.length)
.setValues(results.getRows());
// Make Sandwich.
Not sure if line 7 in the error message is the same line 7 in the code you pasted, but if it is, then it looks like the single quotes are different. Use the straight single quote ('), and not the curvy one.
Here's what I mean:
{‘dimensions’: ‘ga:date’});
should be
{'dimensions': 'ga:date'});
Hope this helps.

Can't find link using Mechanize follow_link()

I just starting looking at the Python version of Mechanize today. I took most of this code from the first example on http://wwwsearch.sourceforge.net/mechanize/. The documentation of this module is very sparse and I have no idea how to debug this.
I am trying to find and follow the first link with the text "Careers". When I run this I get this error "mechanize._mechanize.LinkNotFoundError". Can anyone tell me what I am doing wrong?
import re
import mechanize
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.open("http://www.amazon.com/")
response1 = br.follow_link(text_regex=r"Careers", nr=1)
assert br.viewing_html()
print br.title()
I just tried the sample code myself, and it looks like the problem is with the nr argument. It's not documented anywhere but in the source code (which is far more informative than the documentation!), and it states that:
nr: matches the nth link that matches all other criteria (default 0)
Because the nr argument is 0-based, when you gave the argument of 1, it was looking for the second mention of Careers, which was obviously nothing.
Because it defaults to 0, or the first link found, you can set the nr argument to 0, or leave it off entirely.