Remove stopwords using open refine - jython

Following this example https://github.com/OpenRefine/OpenRefine/wiki/Recipes#removeextract-words-contained-in-a-file
I'm trying to remove stopwords listed in a file using open refine
Example: you want to remove from a text all stopwords contained in a file on your desktop. In this case, use Jython.
with open(r"C:\Users\ettor\Desktop\stopwords.txt",'r') as f :
stopwords = [name.rstrip() for name in f]
return " ".join([x for x in value.split(' ') if x not in stopwords])
Unfortunately got Internal error

Yes, this script works as you can see in this screencast.
I changed it a bit to ignore the letter case.
with open(r"~\Desktop\stopwords.txt",'r') as f :
stopwords = [name.rstrip().lower() for name in f]
return " ".join([x for x in value.split(' ') if x.lower() not in stopwords])
In an Open Refine's Python script, "internal error" often means a syntax error, such as a forgotten parenthesis or bad indentation.

Related

How to escape characters in SQL code in an R Markdown chunk?

```
{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(odbc)
library(DBI)
library(dbplyr)
```
```{sql, connection=con, output.var="df"}
SELECT DB_Fruit.Pear, Store.Name, Cal.Year, Sales.Qty FROM DB_Fruit
```
#> Error: unexpected symbol in "SELECT DB_Fruit.Pear"
I'm attempting to run SQL code in an R Markdown chunk as shown above. I'm getting the "unexpected symbol" error shown above. My best guess is that I need to escape the underscore with something such as \_ or \\_ but neither of those makes my error go away.
If I instead query using DBI (shown below) I do not get any errors:
df <- dbGetQuery(con,'
SELECT DB_Fruit.Pear, Store.Name, Cal.Year, Sales.Qty
FROM DB_Fruit
')
Maybe the dbGetQuery function is able to interpret things such as underscores _ correctly whereas the regular R Markdown parser can't? Or maybe there's blank spaces that have been copy/pasted as some weird unicode characters that again dbGetQuery function is able to interpret whereas the regular R Markdown parser can't?
What's the likely culprit and what do I do about it?
Your chunk header probably should be
{SQL, connection=con, output.var="df"}
instead of
{r SQL, connection=con, output.var="df"}
You have to use "Chunk Output Inline" in the Rmarkdown document
---
editor_options:
chunk_output_type: inline
---

apache uima ruta - non english sentence processing

I tested RUTA script with two different languages(English, Korean).
I wanted to get same result that is splitted by word. but Korean sentence was not splitted by word.
Script :
DECLARE Last1;
W {-> Last1};
Document : "This is a sample."
Result :
This ,
is ,
a ,
sample
Document : "이것은 샘플입니다."
Result :
"" (nothing)
The result that I want to get :
이것은 , 샘플입니다
the result is nothing. I want to know how can I detect non-english word as a word in Ruta.
I hope your help!!!
I solved using 'split'.
Sentence{-> SPLIT(SPACE)};
(apache uima rota-core 2.6.1)
anyway, I want to know how to separate the unicode words using reserved keyword "W".

How to add asterisk to a list of filenames and then make it a line using Notepad++

I have a list of file names (about 4000).
For example:
A-67569
H-67985
J-87657
K-85897
...
I need to put an asterisk before and after each file name. And then make it a line format.
Example:
*A-67569* *H-67985* *J-87657* *K-85897* so on...
Note that there is a space between filenames.
Forgot to mention, I'm trying to do this with Notepad++
How can I do it?
Please advise.
Thanks
C# example for list to string plus edits
List<string> list = new List<string> { "A - 67569"), "H-67985", "J-87657", "K-85897"};
string outString = "";
foreach(string item in list)
{
outString += "*" + item + "* ";
}
content of outstring: *A - 67569* *H-67985* *J-87657* *K-85897*
Use the Replace of your Notedad++ (Search > Replace..)
Select Extended (\n \r \t \0 \x...) on the bottom of the Replace window
In the field Find what write '\r\n' and in the field Replace with write * *
Replace all
Note, that you should manually place the single asterisk before the first and after the last words.
If this won't work, in step 2. instead of \r\n try to use only \n or \r.
You can use Regular expression in the search Mode.
Find what:
(\S+)(\R|$)
Replace with:
*$1
Note the space after de number one
For the archive
A-67569
H-67985
J-87657
K-85897
Output:
*A-67569 *H-67985 *J-87657 *K-85897
Explication of regex:
(\S+) Mean find one or more caracters is not a blank.
(\R|$) Mean find any end of line or end of file
(\S+)(\R|$) Mean find any gorup of caracters not blank ho end with end of line or end of file.
Explication of Replace with
When you use the $ simpbol, you are using a reference to the groups finded, $1 is the first group, in this case the group (\S+).

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

how to find a word in ASCII file using python

I want to find a word and its index but the problem is I am only getting its first position while the word appear more than one time in file. The file's content is,
[MAKE DATA:STUDENT1=AENIE:AGE14,STUDENT2=JOHN:AGE15,STUDENT3=KELLY:AGE14,STUDENT4=JACK:AGE16,STUDENT5=SNOW:AGE16;SET RECORD:STUDENT1=GOOD,STUDENT2=,STUDENT3=BAD,STTUDENT4=,STUDENT5=GOOD]
following is my code,
import sys,os,csv
x = str(raw_input("Enter file name :")) + '.ASCII'
fp = open(x,'r')
data = fp.read()
fp.close()
found = data.find("STUDENT1")
print found
here the word "STUDENT1" appear two time while my code gives its only 1st index position. I want its second index position too. Similarly a word may appear several times in file so how can I find its all index position?
Use the optional start parameter to str.find() to search the string again starting after the previous match:
found = data.find("STUDENT1")
while found != -1:
print found
found = data.find("STUDENT1", found+1)
It would be slightly more efficient (but less concise) to use found+len("STUDENT1") instead of found+1.
Alternatively you could use the re.finditer():
import re
for match in re.finditer("STUDENT1", data):
print match.start()