apache uima ruta - non english sentence processing - apache

I tested RUTA script with two different languages(English, Korean).
I wanted to get same result that is splitted by word. but Korean sentence was not splitted by word.
Script :
DECLARE Last1;
W {-> Last1};
Document : "This is a sample."
Result :
This ,
is ,
a ,
sample
Document : "이것은 샘플입니다."
Result :
"" (nothing)
The result that I want to get :
이것은 , 샘플입니다
the result is nothing. I want to know how can I detect non-english word as a word in Ruta.
I hope your help!!!

I solved using 'split'.
Sentence{-> SPLIT(SPACE)};
(apache uima rota-core 2.6.1)
anyway, I want to know how to separate the unicode words using reserved keyword "W".

Related

Kusto Query Language - Extract all between two Characters

I'm working on extracting an email address from the additionalextensions column in Sentinel. I've found a regex that works perfectly in a calculator, extracting everything after a colon (:) up to a semicolon followed by the latter s (;s). However, it does not work in Kusto I suspect because its using a lookback?
Below is the regex that worked in the calculator:
(?<=:).*(?=;s)
This is data from one of the logs:
cat=EXFILTRATION;account=O365:email.address#test.org.uk;start=1659975196000;end=165997519600
When using the calculator, it returns the below:
email.address#test.org.uk
However, when trying to use this in Kusto, it returns the original data. Is anyone able to come up with a way I can achieve this in KQL?
extracting everything after a colon (:) up to a semicolon followed by the latter s (;s).
you don't have to use a regular expression.
for instance, using the parse operator:
print input = 'cat=EXFILTRATION;account=O365:email.address#test.org.uk;start=1659975196000;end=165997519600'
| parse input with * ":" email_address ";s" *
input
email_address
cat=EXFILTRATION;account=O365:email.address#test.org.uk;start=1659975196000;end=165997519600
email.address#test.org.uk

Find and Replace script for textWrangler

I'm trying to make a script that will find and replace some xml structure.
I've made it all flat, no spaces, no return.
So far I've made :
tell application "TextWrangler"
replace text "<key>amenity</key><string>drinking_water</string>" using text "<key>ico</key>
<string>NPpotableGPS30</string>"
end tell
The script editor says :
Error in TextWrangler : text "<key>amenity</key><string>drinking_water</string>" don't have the message « replace ».
This is my first script so I'm probably doing something wrong, but what ?
The key "text" seems to be misplaced. The following appears to correctly work (the sequence from searching onwards being optional for defining the document to process, according to actual working situation):
tell application "TextWrangler"
replace "<key>amenity</key><string>drinking_water</string>" using "<key>ico</key><string>NPpotableGPS30</string>" searching in text 1 of text document "..." options {search mode:grep, starting at top:true}
end tell

Postgresql database search with regex

I'm using PostgreSQL database with VB.NET and ODBC (Windows).
I'm searching sentences for whole words by combining SELECT with a regular expression, like this:
"SELECT dtbl_id, name
FROM mytable
WHERE name ~*'" + "( |^)" + TextBox1.Text + "([^A-z]|$)"
This searches well in some cases but because of syntax errors in text (or other reasons) it sometimes fails. For example, if I have the sentence
BILLY IDOL: WHITE WEDDING
the word "white" will be found. But if I have
CLASH-WHITE RIOT
then "white" will not be found, because there is no space between start of word "white".
The simplest solution would be to temporarily change or replace characters in the sentences :,.\/-= etc to spaces.
Is this possible to do in single SELECT line to be suitable for use with .NET/ODBC? Maybe inside the same regular expression?
If it is, how?
Try this:
SELECT 'CLASH-WHITE RIOT' ~ '[[:<:]]WHITE[[:>:]]';
[[:<:]] and [[:>:]] simply mean beginning and end of a word respectively
more info you can find at: http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

REGEX for complete word matching

OK So i am confused (obviously)
I'm trying to return rows (from Oracle) where a text field contains a complete word, not just the substring.
a simple example is the word 'I'.
Show me all rows where the string contains the word 'I', but not simply where 'I' is a substring somewhere as in '%I%'
so I wrote what i thought would be a simple regex:
select REGEXP_INSTR(upper(description), '\bI\b') from mytab;
expecting that I should be detected with word boundaries. I get no results (or rather the result 0 for each row.
what i expect:
'I am the Administrator' -> 1
'I'm the administrator' -> 0
'Am I the administrator' -> 1
'It is the infamous administrator' -> 0
'The adminisrtrator, tis I' -> 1
isn't the /b supposed to find the contained string by word boundary?
tia
I believe that \b is not supported by your flavor of regex :
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm#i1007670
Therefore you could do something like :
(^|\s)word(\s|$)
To at least ensure that your "word" is separated by some whitespace or it's the whole string.
Oracle doesn't support word boundary anchors, but even if it did, you wouldn't get the desired result: \b matches between an alphanumeric character and a non-alphanumeric character. The exact definition of what an alnum is differs between implementations, but in most flavors, it's [A-Za-z0-9_] (.NET also considers Unicode letters/digits).
So there are two boundaries around the I in %I%.
If you define your word boundary as "whitespace before/after the word", then you could use
(^|\s)I(\s|$)
which would also work at the start/end of the string.
Oracle native regex support is limited. \b or < cannot be used as word delimiters. You may want Oracle Text for word search.

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.
Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.
You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt
The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words
You can match the words and remove them using regular expressions.