Why WordNet and JWI stemmer gives "ord" and "orde" in result of "order" stemming? - wordnet

I'm working on a project using WordNet and JWI 2.4.0.
Currently, I'm putting a lot of words within the included stemmer, it seems to work, until I asked for "order".
The stemmer answers me that "order", "orde", and "ord", are the possible stems of "order".
I'm not a native english speaker, but... I never saw the word "ord" in my life... and when I asked the WordNet dictionary for this definition : obviously there is nothing. (in BabelNet online, I found that it is a Nebraska's town !)
Well, why is there this strange stem ?
How can I filter the stems that are not present in the WordNet dictionary ? (because when I re-use the stemmed words, "orde" is making the program crash)
Thank you !
ANSWER : I didn't understood well what was a stem. So, this question has no sense.
Here is some code to test :
package JWIExplorer;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Arrays;
import java.util.Date;
import java.util.Iterator;
import java.util.List;
import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.morph.WordnetStemmer;
public class TestJWI
{
public static void main(String[] args) throws IOException
{
List<String> WordList_Research = Arrays.asList("dog", "cat", "mouse");
List<String> WordList_Research2 = Arrays.asList("order");
String path = "./" + File.separator + "dict";
URL url;
url = new URL("file", null, path);
System.out.println("BEGIN : " + new Date());
for (Iterator<String> iterstr = WordList_Research2.iterator(); iterstr.hasNext();)
{
String str = iterstr.next();
TestStem(url, str);
}
System.out.println("END : " + new Date());
}
public static void TestStem(URL url, String ResearchedWord) throws IOException
{
// construct the dictionary object and open it
IDictionary dict = new Dictionary(url);
dict.open();
// First, let's check for the stem word
WordnetStemmer Stemmer = new WordnetStemmer(dict);
List<String> StemmedWords;
// null for all words, POS.NOUN for nouns
StemmedWords = Stemmer.findStems(ResearchedWord, null);
if (StemmedWords.isEmpty())
return;
for (Iterator<String> iterstr = StemmedWords.iterator(); iterstr.hasNext();)
{
String str = iterstr.next();
System.out.println("Local stemmed iteration on : " + str);
}
}
}

Stems do not necessarily need to be words by themselves. "Order" and "Ordinal" share the stem "Ord".
The fundamental problem here is that stems are related to spelling, but language evolution and spelling are only weakly related (especially in English). As a programmer, we'd much rather describe a stem as a regex, e.g. ^ord[ie]. This captures that it's not the stem of "ordained"

Related

How to use Google translate for free? Maybe you have some analogs [duplicate]

If I pass a string (either in English or Arabic) as an input to the Google Translate API, it should translate it into the corresponding other language and give the translated string to me.
I read the same case in a forum but it was very hard to implement for me.
I need the translator without any buttons and if I give the input string it should automatically translate the value and give the output.
Can you help out?
You can use google script which has FREE translate API. All you need is a common google account and do these THREE EASY STEPS.
1) Create new script with such code on google script:
var mock = {
parameter:{
q:'hello',
source:'en',
target:'fr'
}
};
function doGet(e) {
e = e || mock;
var sourceText = ''
if (e.parameter.q){
sourceText = e.parameter.q;
}
var sourceLang = '';
if (e.parameter.source){
sourceLang = e.parameter.source;
}
var targetLang = 'en';
if (e.parameter.target){
targetLang = e.parameter.target;
}
var translatedText = LanguageApp.translate(sourceText, sourceLang, targetLang, {contentType: 'html'});
return ContentService.createTextOutput(translatedText).setMimeType(ContentService.MimeType.JSON);
}
2) Click Publish -> Deploy as webapp -> Who has access to the app: Anyone even anonymous -> Deploy. And then copy your web app url, you will need it for calling translate API.
3) Use this java code for testing your API:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLEncoder;
public class Translator {
public static void main(String[] args) throws IOException {
String text = "Hello world!";
//Translated text: Hallo Welt!
System.out.println("Translated text: " + translate("en", "de", text));
}
private static String translate(String langFrom, String langTo, String text) throws IOException {
// INSERT YOU URL HERE
String urlStr = "https://your.google.script.url" +
"?q=" + URLEncoder.encode(text, "UTF-8") +
"&target=" + langTo +
"&source=" + langFrom;
URL url = new URL(urlStr);
StringBuilder response = new StringBuilder();
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
return response.toString();
}
}
As it is free, there are QUATA LIMITS: https://docs.google.com/macros/dashboard
Use java-google-translate-text-to-speech instead of Google Translate API v2 Java.
About java-google-translate-text-to-speech
Api unofficial with the main features of Google Translate in Java.
Easy to use!
It also provide text to speech api. If you want to translate the text "Hello!" in Romanian just write:
Translator translate = Translator.getInstance();
String text = translate.translate("Hello!", Language.ENGLISH, Language.ROMANIAN);
System.out.println(text); // "Bună ziua!"
It's free!
As #r0ast3d correctly said:
Important: Google Translate API v2 is now available as a paid service. The courtesy limit for existing Translate API v2 projects created prior to August 24, 2011 will be reduced to zero on December 1, 2011. In addition, the number of requests your application can make per day will be limited.
This is correct: just see the official page:
Google Translate API is available as a paid service. See the Pricing and FAQ pages for details.
BUT, java-google-translate-text-to-speech is FREE!
Example!
I've created a sample application that demonstrates that this works. Try it here: https://github.com/IonicaBizau/text-to-speech
Generate your own API key here. Check out the documentation here.
You may need to set up a billing account when you try to enable the Google Cloud Translation API in your account.
Below is a quick start example which translates two English strings to Spanish:
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.Arrays;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.json.gson.GsonFactory;
import com.google.api.services.translate.Translate;
import com.google.api.services.translate.model.TranslationsListResponse;
import com.google.api.services.translate.model.TranslationsResource;
public class QuickstartSample
{
public static void main(String[] arguments) throws IOException, GeneralSecurityException
{
Translate t = new Translate.Builder(
GoogleNetHttpTransport.newTrustedTransport()
, GsonFactory.getDefaultInstance(), null)
// Set your application name
.setApplicationName("Stackoverflow-Example")
.build();
Translate.Translations.List list = t.new Translations().list(
Arrays.asList(
// Pass in list of strings to be translated
"Hello World",
"How to use Google Translate from Java"),
// Target language
"ES");
// TODO: Set your API-Key from https://console.developers.google.com/
list.setKey("your-api-key");
TranslationsListResponse response = list.execute();
for (TranslationsResource translationsResource : response.getTranslations())
{
System.out.println(translationsResource.getTranslatedText());
}
}
}
Required maven dependencies for the code snippet:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-translate</artifactId>
<version>LATEST</version>
</dependency>
<dependency>
<groupId>com.google.http-client</groupId>
<artifactId>google-http-client-gson</artifactId>
<version>LATEST</version>
</dependency>
I’m tired of looking for free translators and the best option for me was Selenium (more precisely selenide and webdrivermanager) and https://translate.google.com
import io.github.bonigarcia.wdm.ChromeDriverManager;
import com.codeborne.selenide.Configuration;
import io.github.bonigarcia.wdm.DriverManagerType;
import static com.codeborne.selenide.Selenide.*;
public class Main {
public static void main(String[] args) throws IOException, ParseException {
ChromeDriverManager.getInstance(DriverManagerType.CHROME).version("76.0.3809.126").setup();
Configuration.startMaximized = true;
open("https://translate.google.com/?hl=ru#view=home&op=translate&sl=en&tl=ru");
String[] strings = /some strings to translate
for (String data: strings) {
$x("//textarea[#id='source']").clear();
$x("//textarea[#id='source']").sendKeys(data);
String translation = $x("//span[#class='tlid-translation translation']").getText();
}
}
}
You can use Google Translate API v2 Java. It has a core module that you can call from your Java code and also a command line interface module.

Create data property XSD:string with Jena

the problem sounds so simple: I would like to create an data property for an individual as XSD:string in my ontology.
I can create properties of XSD:DateTime, XSD:Float or XSD:int, but if I use XSD:string, I get a untyped property!
I created a minimal example, which create an ontology with one class, one individual an two data properties. A DateTime, which works like expected and one string, which has no type in the ontology.
I tried with Jena versions 3.4 and 3.0.1 and have no idea who to fix it.
package dataproperty;
import java.io.FileOutputStream;
import org.apache.jena.datatypes.xsd.XSDDatatype;
import org.apache.jena.ontology.OntModel;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.Property;
import org.apache.jena.rdf.model.Resource;
import org.apache.jena.rdf.model.ResourceFactory;
public class DataProperty {
public static void main(String[] args) throws Exception {
OntModel model = ModelFactory.createOntologyModel();
String OWLPath = "DataProp.owl";
try{
String NS = "http://www.example.org/ontology.owl#";
//Create Ontology
model.createClass(NS+"Test");
Resource r = model.createResource(NS+"Test");
model.createIndividual(NS+"Indi1", r);
r = model.createResource(NS+"Indi1");
model.createDatatypeProperty(NS+"Name");
model.createDatatypeProperty(NS+"Date");
//Add Data Properties
Property p = model.getProperty(NS+"Name");
model.add(r, p, ResourceFactory.createTypedLiteral("MyName", XSDDatatype.XSDstring));
p = model.getProperty(NS+"Date");
model.add(r, p, ResourceFactory.createTypedLiteral("2017-08-12T09:03:40", XSDDatatype.XSDdateTime));
//Store the ontology
FileOutputStream output = null;
output = new FileOutputStream(OWLPath);
model.write(output);
}catch (Exception e) {
System.out.println("Error occured: " + e);
throw new Exception(e.getMessage());
}
}
}
It is not untyped in RDF 1.1 - it's written in short form (better compatibility).
e.g.
https://www.w3.org/TR/turtle/
Section 2.5.1
"If there is no datatype IRI and no language tag, the datatype is xsd:string."

Need help about #Get #Post #Put #Delete in Restful webservice

I need your help and advice. This is my first project in jersey. I don't know much about this topic. I'm still learning. I created my school project. But I have some problems on the web service side. Firstly I should explain my project. I have got 3 tables in my database. Movie,User,Ratings
Here my Movie table columns. I will ask you some points about Description column of the Movie table. I will use these methods to these columns.
Movie= Description (get,put,post and delete) I have to use all methods in this page.
movieTitle (get)
pictureURL (get,put)
generalRating (get,post)
I built my Description page. But I'm not sure if its working or not .(My database is not ready to check them). Here is my page. I wrote this page, looking at the example pages. Can u help me to find the problems and errors. I just want to do simple methods get(just reading data), post(update existing data), put(create new data), delete(delete specific data) these things.What should I do now, is my code okay or do you have alternative advice? :( I need your help guys ty
package com.vogella.jersey.first;
import javax.servlet.http.HttpServletResponse;
import javax.ws.rs.Consumes;
import javax.ws.rs.GET;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import java.util.List;
import javax.ejb.*;
import javax.persistence.*;
import javax.ws.rs.*;
import javax.ws.rs.DELETE;
import javax.ws.rs.FormParam;
import javax.ws.rs.OPTIONS;
import javax.ws.rs.PUT;
import javax.ws.rs.PathParam;
import javax.ws.rs.core.Context;
#Path("/Movie/Description")
public class Description {
private Moviedao moviedao = new Moviedao();
#GET
#Path("/Description/")
#Produces(MediaType.APPLICATION_XML)
public Description getDescriptionID(#PathParam("sample6") string sample6){
return moviedao.getDescriptionID(id);
}
#POST
#Path("/Description/")
#Produces(MediaType.APPLICATION_XML)
#Consumes(MediaType.APPLICATION_FORM_URLENCODED)
public void updateDescription(#PathParam("sampleID")int sampleID,
#PathParam("sample2Description")string sample2Description)
throws IOException {
Description description = new Description (sampleID, sample2Description);
int result = moviedao.updateDescription(description);
if(result == 1){
return SUCCESS_RESULT;
}
return FAILURE_RESULT;
}
#PUT
#Path("/Description")
#Produces(MediaType.APPLICATION_XML)
#Consumes(MediaType.APPLICATION_FORM_URLENCODED)
public String createUser(#FormParam("sample8ID") int sample8ID,
#FormParam("sample8Description") String sample8Description,
#Context HttpServletResponse servletResponse) throws IOException{
Description description = new Description (sample8ID, sample8Description);
int result = movidao.addDescription(description);
if(result == 1){
return SUCCESS_RESULT;
}
return FAILURE_RESULT;
}
#DELETE
#Path("/Description/{descriptionID}")
#Produces(MediaType.APPLICATION_XML)
public String deleteUser(#PathParam("descriptionID") int descriptionID){
int result = moviedao.deleteDescription(descriptionID);
if(result == 1){
return SUCCESS_RESULT;
}
return FAILURE_RESULT;
}
#OPTIONS
#Path("/Description")
#Produces(MediaType.APPLICATION_XML)
public String getSupportedOperations(){
return "<operations>GET, PUT, POST, DELETE</operations>";
}
}
I just want to do simple methods get(just reading data), post(update
existing data), put(create new data), delete(delete specific data)
these things
POST should be used to create resources and PUT should be used to update resources.
Your class already has webservice path /Movie/Description, so there is no need to repeat word Description in the methods.
Also, I would recommend keep path names in lower case e.g. /movie/description

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

How do I create a custom directive for Apache Velocity

I am using Apache's Velocity templating engine, and I would like to create a custom Directive. That is, I want to be able to write "#doMyThing()" and have it invoke some java code I wrote in order to generate the text.
I know that I can register a custom directive by adding a line
userdirective=my.package.here.MyDirectiveName
to my velocity.properties file. And I know that I can write such a class by extending the Directive class. What I don't know is how to extend the Directive class -- some sort of documentation for the author of a new Directive. For instance I'd like to know if my getType() method return "BLOCK" or "LINE" and I'd like to know what should my setLocation() method should do?
Is there any documentation out there that is better than just "Use the source, Luke"?
On the Velocity wiki, there's a presentation and sample code from a talk I gave called "Hacking Velocity". It includes an example of a custom directive.
Also was trying to come up with a custom directive. Couldn't find any documentation at all, so I looked at some user created directives: IfNullDirective (nice and easy one), MergeDirective as well as velocity build-in directives.
Here is my simple block directive that returns compressed content (complete project with some directive installation instructions is located here):
import java.io.IOException;
import java.io.StringWriter;
import java.io.Writer;
import org.apache.velocity.context.InternalContextAdapter;
import org.apache.velocity.exception.MethodInvocationException;
import org.apache.velocity.exception.ParseErrorException;
import org.apache.velocity.exception.ResourceNotFoundException;
import org.apache.velocity.exception.TemplateInitException;
import org.apache.velocity.runtime.RuntimeServices;
import org.apache.velocity.runtime.directive.Directive;
import org.apache.velocity.runtime.parser.node.Node;
import org.apache.velocity.runtime.log.Log;
import com.googlecode.htmlcompressor.compressor.HtmlCompressor;
/**
* Velocity directive that compresses an HTML content within #compressHtml ... #end block.
*/
public class HtmlCompressorDirective extends Directive {
private static final HtmlCompressor htmlCompressor = new HtmlCompressor();
private Log log;
public String getName() {
return "compressHtml";
}
public int getType() {
return BLOCK;
}
#Override
public void init(RuntimeServices rs, InternalContextAdapter context, Node node) throws TemplateInitException {
super.init(rs, context, node);
log = rs.getLog();
//set compressor properties
htmlCompressor.setEnabled(rs.getBoolean("userdirective.compressHtml.enabled", true));
htmlCompressor.setRemoveComments(rs.getBoolean("userdirective.compressHtml.removeComments", true));
}
public boolean render(InternalContextAdapter context, Writer writer, Node node)
throws IOException, ResourceNotFoundException, ParseErrorException, MethodInvocationException {
//render content to a variable
StringWriter content = new StringWriter();
node.jjtGetChild(0).render(context, content);
//compress
try {
writer.write(htmlCompressor.compress(content.toString()));
} catch (Exception e) {
writer.write(content.toString());
String msg = "Failed to compress content: "+content.toString();
log.error(msg, e);
throw new RuntimeException(msg, e);
}
return true;
}
}
Block directives always accept a body and must end with #end when used in a template. e.g. #foreach( $i in $foo ) this has a body! #end
Line directives do not have a body or an #end. e.g. #parse( 'foo.vtl' )
You don't need to both with setLocation() at all. The parser uses that.
Any other specifics i can help with?
Also, have you considered using a "tool" approach? Even if you don't use VelocityTools to automatically make your tool available and whatnot, you can just create a tool class that does what you want, put it in the context and either have a method you call to generate content or else just have its toString() method generate the content. e.g. $tool.doMyThing() or just $myThing
Directives are best for when you need to mess with Velocity internals (access to InternalContextAdapter or actual Nodes).
Prior to velocity v1.6 I had a #blockset($v)#end directive to be able to deal with a multiline #set($v) but this function is now handled by the #define directive.
Custom block directives are a pain with modern IDEs because they don't parse the structure correctly, assuming your #end associated with #userBlockDirective is an extra and paints the whole file RED. They should be avoided if possible.
I copied something similar from the velocity source code and created a "blockset" (multiline) directive.
import org.apache.velocity.runtime.directive.Directive;
import org.apache.velocity.runtime.RuntimeServices;
import org.apache.velocity.runtime.parser.node.Node;
import org.apache.velocity.context.InternalContextAdapter;
import org.apache.velocity.exception.MethodInvocationException;
import org.apache.velocity.exception.ResourceNotFoundException;
import org.apache.velocity.exception.ParseErrorException;
import org.apache.velocity.exception.TemplateInitException;
import java.io.Writer;
import java.io.IOException;
import java.io.StringWriter;
public class BlockSetDirective extends Directive {
private String blockKey;
/**
* Return name of this directive.
*/
public String getName() {
return "blockset";
}
/**
* Return type of this directive.
*/
public int getType() {
return BLOCK;
}
/**
* simple init - get the blockKey
*/
public void init( RuntimeServices rs, InternalContextAdapter context,
Node node )
throws TemplateInitException {
super.init( rs, context, node );
/*
* first token is the name of the block. I don't even check the format,
* just assume it looks like this: $block_name. Should check if it has
* a '$' or not like macros.
*/
blockKey = node.jjtGetChild( 0 ).getFirstToken().image.substring( 1 );
}
/**
* Renders node to internal string writer and stores in the context at the
* specified context variable
*/
public boolean render( InternalContextAdapter context, Writer writer,
Node node )
throws IOException, MethodInvocationException,
ResourceNotFoundException, ParseErrorException {
StringWriter sw = new StringWriter(256);
boolean b = node.jjtGetChild( 1 ).render( context, sw );
context.put( blockKey, sw.toString() );
return b;
}
}