getGloss() Vs getDescription() in rita wordnet - wordnet

I'm using rita wordnet for my college project. I need to extract definitions of each and every word. When I am using getGloss(), a string with many one or more definitions is getting returned. Since I want only one definition per word, I turned to getDescription().
So I'm just wondering that every word in rita wordnet contains its definition in getDescription() too or just in getGloss()??
Also, If you can please help me with this.. How to retrieve words from rita wordnet?? I can't find any method to retrieve words like getWord() or something like that??

So I'm just wondering that every word in rita wordnet contains its definition in getDescription() too or just in getGloss()??
getDescription() returns the definition;
getGloss() returns the definition plus an example sentence (gloss?).
Looks like getGloss() is a superset of getDescription() in RiTa.
Following are the implementations of those methods in rita.RiWordNet.java:
/**
* Returns full gloss for 1st sense of 'word' with 'pos'
*/
public String getGloss(String word, String pos)
{
Synset synset = getSynsetAtIndex(word, pos, 1);
return getGloss(synset);
}
/**
* Returns description for <code>word</code> with <code>pos</code> or null if
* not found
*/
public String getDescription(String word, String pos)
{
String gloss = getGloss(word, pos);
return WordnetUtil.parseDescription(gloss);
}
How to retrieve words from rita wordnet??
Not sure what kinds of words you'd like to retrieve. If you were talking about word definition I believe you can do that with getAllGlosses() and getAllSynsets().
You may even write a getAllDescription() at your disposal.
Or check the reference of the RiTa library here to look for the methods you desire.

The documentation shows that 'getDescription()' has been removed. You can use getGloss() or getExamples(), or getSynset(), as you need. Some examples:
RiWordNet ww = new RiWordNet("/path/to/WordNet");
System.out.println(Arrays.asList(rw.getSynset("dog", "n", 1)));
System.out.println(Arrays.asList(rw.getAllGlosses("dog", "n")));
System.out.println(Arrays.asList(rw.getGloss("dog", "n")));
System.out.println(Arrays.asList(rw.getExamples("dog", "n")));

Related

Find synonym for a specific word with Wikidata SPARQL

I'm trying to construct a query for https://query.wikidata.org/ that finds synonyms for a given string. I know it will involve finding lemmas of the particular string (the string is "kind" in this example) and the P5973 synonym property. This returns zero results:
select ?lexeme ?lemma ?synonym WHERE {
?lexeme wikibase:lemma ?lemma.
FILTER (?lemma ='kind')
?lexeme wdt:P5973 ?synonym .
}
I've used the string kind in this example. I would like to return synonyms for all meanings of the word kind, so both empathetic and type would be correct responses.

Efficient Way In SQL Server To Remove All "InvalidXMLCharacters" From a NVARCHAR

As part of this answer, I determined that one of the things that can break a OLAP cube is feeding in values to it (in the dimension names/values/etc) that contain characters that are considered "InvalidXMLCharacters". Now I would like to filter out these values so that they never end up in the OLAP cubes I've building in SQL. Often I find myself importing this input data from one table into another. Something like the following:
INSERT INTO [dbo].[DestinationTableThatWillBeReferencedInMyOLAPCube]
SELECT TextDataColumn1, TextDataColumn2, etc...
FROM [dbo].[SourceTableContainingColumnsWithValuesWithInvalidXMLCharacters]
WHERE XYZ...
Is there an efficient way to remove all "InvalidXMLCharacters" from my Columns in this query?
The obvious solution that comes to mind would be some sort of Regex, though from the previously stated linked posts, that might be quite complex, and I'm not sure of the performance implications around this.
Another idea I've had is to Convert the Columns to "XML" data type, but that will error if they contain invalid characters, not very helpful for removing them...
I've looked around and don't see many other cases where developers are trying to do exactly this, has thing been tackled any other way in another post that I haven't found?
.NET CLR integration in SQL Server could be helpful.
Here is a small c# example for you. You can use it as a starting point for your needs. Its most important line is using XmlConvert.IsXmlChar(ch) call to remove invalid XML characters.
c#
void Main()
{
// https://www.w3.org/TR/xml/#charsets
// ===================================
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string content = "fafa\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content).Dump("Clean string");
Console.WriteLine(IsValidXmlString(content)); // True
}
// Define other methods and classes here
static string RemoveInvalidXmlChars(string text)
{
return new string(text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray());
}
static bool IsValidXmlString(string text)
{
bool rc = true;
try
{
XmlConvert.VerifyXmlChars(text);
}
catch
{
rc = false;
}
return rc;
}

Using Hibernate Search (Lucene), I Need to Be Able to Search a Code With or Without Dashes

This is really the same as it would be for a social security #.
If I have a code with this format:
WHO-S-09-0003
I want to be able to do:
query = qb.keyword().onFields("key").matching("WHOS090003").createQuery();
I tried using a WhitespaceAnalyzer.
Using StandardAnalyzer or WhitespaceAnalyzer both have the same problem. They will index 'WHO-S-09-0003' as is which means that when you do a search it will only work if you have hyphens in the search term.
One solution to your problem would be to implement your own TokenFilter which detects the format of your codes and removes the hyphens during indexing. You can use AnayzerDef to build a chain of toekn filters and an overall custom analyzer. Of course you will have to use the same analyzer when searching, but the Hibernate Search query DSL will take care of that.
actually you can implement your own method like this one:
private String specialCharacters(String keyword) {
String [] specialChars = {"-","!","?"};
for(int i = 0; i < specialChars.length; i++ )
if(keyword.indexOf(specialChars[i]) > -1)
keyword = keyword.replace(specialChars[i], "\\"+specialChars[i]);
return keyword;
}
as you know lucene has special chars, so if you want escape special chars than you should insert before that char double backslashes...

Good way to format data in a large DataTable

I have a large data.DataTable and some formatting rules to apply. I'm sure this is not a unique problem.
For example, the LASTNAME column has a value of "Jones" but my formatting rule requires it be 10 characters padded with spaces on the right and uppercase only. Like: "JONES "
My initial thought is to loop through each row and generate a string. But, I wonder if I could accomplish this more efficiently with a DataView, LINQ or something else.
Can someone point me in a direction?
It really depends how you display the results. I would say if you display it in a grid, the easiest would be to do a quick loop, no real performance harm there in a datatable.
If you display the records individually you can create an extension method for your string, and simply call it like this for example. LastName.Padded()
public static class StringExtensions
{
public static string Padded(this string s)
{
return s.ToUpper().PadRight(10);
}
}

Hyphens in Lucene

I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word?
Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter?
Thanks!
(Edit)
My code looks like this:
StandardAnalyzer sa = new StandardAnalyzer();
TokenStream ts = sa.TokenStream("field", new StringReader("semi-final"));
while (ts.IncrementToken())
{
string t = ts.ToString();
Console.WriteLine("Token: " + t);
}
This is the explanation for the tokenizer in lucene
- Splits words at punctuation
characters, removing punctuation.
However, a dot that's not followed by
whitespace is considered part of a
token.
- Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and
is not split.
- Recognizes email addresses and internet hostnames as one token.
Found here
this explains why it would be splitting your word.
This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:
St-Constant
Saint Constant
Saint-Constant
your stuck with the task of having
both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.
Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections
example
saint-jean-sur-richelieu
if i come up with anything else I will let you know.
I would recommend you use the WordDelimiterFilter from Solr (you can use it in just your Lucene application as a TokenFilter added to your analyzer, just go get the java file for this filter from Solr and add it to your application).
This filter is designed to handle cases just like this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
If you're looking for a port of the WordDelimiterFilter then I advise a google of WordDelimiter.cs, I found such a port here:
http://osdir.com/ml/attachments/txt9jqypXvbSE.txt
I then created a very basic WordDelimiterAnalyzer:
public class WordDelimiterAnalyzer: Analyzer
{
#region Overrides of Analyzer
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
result = new WordDelimiterFilter(result, 1, 1, 1, 1, 0);
return result;
}
#endregion
}
I said it was basic :)
If anyone has an implementation I would be keen to see it!
You can write your own tokenizer which will produce for words with hyphen all possible combinations of tokens like that:
semifinal
semi
final
You will need to set proper token offsets to tell lucene that semi and semifinal actually start at the same place in document.
The rule (for the classic analyzer) is from is written in jflex:
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
// punctuation
P = ("_"|"-"|"/"|"."|",")