Lucene's WordnetSynonymParser - lucene

I am trying to use Lucene's WordnetSynonymParser class to create a synonym filter, but I'm not sure which of the prolog files I'm meant to be passing into the parse() function.
The documentation says:
See http://wordnet.princeton.edu/man/prologdb.5WN.html for a
description of the format.
so I've downloaded the prolog files, but I'm not sure which ones I should be passing in, and how I go about it.
Could someone please point me in the right direction?
Thanks for your help
EDIT:
Thanks to femtoRgon for pointing me in the direction of wn_s.pl. I have now got the following code:
Analyzer tempanalyzer = new SimpleAnalyzer(Version.LUCENE_40);
WordnetSynonymParser synparser = new WordnetSynonymParser(true, true, tempanalyzer);
FileReader doctoread = new FileReader("wn_s.pl");
synparser.parse(doctoread);
SynonymMap synmap = synparser.build();
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
EnglishAnalyzer enganalyzer = new EnglishAnalyzer(Version.LUCENE_40);
CharArraySet engstopset = enganalyzer.getDefaultStopSet();
Tokenizer source = new StandardTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new SynonymFilter(source, synmap, true);
filter = new StandardFilter(Version.LUCENE_40, filter);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);
/*TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);*/
return new TokenStreamComponents(source, filter);
}
};
which I then plan on passing into IndexWriterConfig, however I get the following compile error:
IndexFilesDB.java:133: cannot find symbol
symbol : method parse(java.io.FileReader)
location: class org.apache.lucene.analysis.synonym.WordnetSynonymParser
synparser.parse(doctoread);
I still don't fully understand WordnetSynonymParser, is it an error to do with the class or it just a simple error where the file is not being passes in correctly?
Thanks for your help.

wn_s.pl contains the synset pointers (that is, it defines groups of synonyms), which is what you need for a synonym filter, to my knowledge. I'd start with that.

Related

Graph traversal name to graph name mapping

Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
I am using the below messy code but it's error-prone if both graphs are using the same underlying storage.
Map<String, String> graphTraversalToNameMap = new ConcurrentHashMap<String, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
String currentGraphString = ( (GraphTraversalSource) graphManager.getAsBindings().get(traversalSource)).getGraph().toString();
graphNameTraversalMap.put(currentGraphString, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()){
String graphName = graphNamesIterator.next();
String currentGraphString = graphManager.getGraph(graphName).toString();
String traversalSource = graphNameTraversalMap.get(currentGraphString);
graphTraversalToNameMap.put(traversalSource, graphName);
}
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee? I can iterate over this and populate my map
Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
No. They share the same namespace in Gremlin Server so the relationship gets lost programmatically. You would need to do something like what you are doing but I wouldn't rely on toString() of a Graph for equality. Perhaps use the Graph instance itself? Although that might not work either depending on your situation and what you want for equality as you could have two different Graph configurations pointed at the same data and want to resolve those as the same graph. I'm also not sure that any approach will work generally for all graph systems. Anyway, I think I'd experiment with using Map<Graph, String> graphTraversalToNameMap for your case and see how that goes.
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee?
No as it is backed by a ConcurrentHashMap. You would have to provide your own order.
Underlying storage details can be obtained from the configuration object and can be used for the mapping, sample code:
public class GraphTraversalMappingUtil {
public static void populateGraphTraversalToNameMapping(GraphManager graphManager){
if(graphTraversalToNameMap.size() != 0){
return;
}
Iterator<String> traversalSourceIterator = graphManager.getTraversalSourceNames().iterator();
Map<StorageBackendKey, String> storageKeyToTraversalMap = new HashMap<StorageBackendKey, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getTraversalSource(traversalSource).getGraph().configuration());
storageKeyToTraversalMap.put(key, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()) {
String graphName = graphNamesIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getGraph(graphName).configuration());
graphTraversalToNameMap.put(storageKeyToTraversalMap.get(key), graphName);
}
}
}
For full code, refer: https://pastebin.com/7m8hi53p

lucene wikipedia querying

I'm using lucene to query from wiki dump and get the categories out. So, I get the relevant documents and for every document, I call the below function.
static List<String> getCategories(Document document) throws IOException
{
List<String> categories = new ArrayList<String>();
String text = document.get("text");
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));
CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken())
{
String tokText = termAtt.toString();
if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
{
categories.add(tokText);
}
}
return categories;
}
but it throws the following error at the while statement.
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at SearchIndex.getCategories(SearchIndex.java:82)
at SearchIndex.main(SearchIndex.java:54)
I looked at zzRefill() function but it I'm not able to understand it. Is this a known bug or something? I don't know what am I doing wrong. The lucene guys says that the whole wikipediaTokenizer section is in beta and maybe be subject to changes. I was hoping someone could help me.
I solved the problem by adding tf.reset() before calling the while loop

How can I do a search with Google Custom Search API for .NET?

I just discovered the Google APIs Client Library for .NET, but because of lack of documentation I have a hard time to figure it out.
I am trying to do a simple test, by doing a custom search, and I have looked among other, at the following namespace:
Google.Apis.Customsearch.v1.Data.Query
I have tried to create a query object and fill out SearchTerms, but how can I fetch results from that query?
My bad, my first answer was not using the Google APIs.
As a pre-requisite, you need to get the Google API client library
(In particular, you will need to reference Google.Apis.dll in your project). Now, assuming you've got your API key and the CX, here is the same code that gets the results, but now using the actual APIs:
string apiKey = "YOUR KEY HERE";
string cx = "YOUR CX HERE";
string query = "YOUR SEARCH HERE";
Google.Apis.Customsearch.v1.CustomsearchService svc = new Google.Apis.Customsearch.v1.CustomsearchService();
svc.Key = apiKey;
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = svc.Cse.List(query);
listRequest.Cx = cx;
Google.Apis.Customsearch.v1.Data.Search search = listRequest.Fetch();
foreach (Google.Apis.Customsearch.v1.Data.Result result in search.Items)
{
Console.WriteLine("Title: {0}", result.Title);
Console.WriteLine("Link: {0}", result.Link);
}
First of all, you need to make sure you've generated your API Key and the CX. I am assuming you've done that already, otherwise you can do it at those locations:
API Key (you need to create a new browser key)
CX (you need to create a custom search engine)
Once you have those, here is a simple console app that performs the search and dumps all the titles/links:
static void Main(string[] args)
{
WebClient webClient = new WebClient();
string apiKey = "YOUR KEY HERE";
string cx = "YOUR CX HERE";
string query = "YOUR SEARCH HERE";
string result = webClient.DownloadString(String.Format("https://www.googleapis.com/customsearch/v1?key={0}&cx={1}&q={2}&alt=json", apiKey, cx, query));
JavaScriptSerializer serializer = new JavaScriptSerializer();
Dictionary<string, object> collection = serializer.Deserialize<Dictionary<string, object>>(result);
foreach (Dictionary<string, object> item in (IEnumerable)collection["items"])
{
Console.WriteLine("Title: {0}", item["title"]);
Console.WriteLine("Link: {0}", item["link"]);
Console.WriteLine();
}
}
As you can see, I'm using a generic JSON deserialization into a dictionary instead of being strongly-typed. This is for convenience purposes, since I don't want to create a class that implements the search results schema. With this approach, the payload is the nested set of key-value pairs. What interests you most is the items collection, which is the search result (first page, I presume). I am only accessing the "title" and "link" properties, but there are many more than you can either see from the documentation or inspect in the debugger.
look at API Reference
using code from google-api-dotnet-client
CustomsearchService svc = new CustomsearchService();
string json = File.ReadAllText("jsonfile",Encoding.UTF8);
Search googleRes = null;
ISerializer des = new NewtonsoftJsonSerializer();
googleRes = des.Deserialize<Search>(json);
or
CustomsearchService svc = new CustomsearchService();
Search googleRes = null;
ISerializer des = new NewtonsoftJsonSerializer();
using (var fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
googleRes = des.Deserialize<Search>(fileStream);
}
with the stream you can also read off of webClient or HttpRequest, as you wish
Google.Apis.Customsearch.v1 Client Library
http://www.nuget.org/packages/Google.Apis.Customsearch.v1/
you may start from Getting Started with the API.

Rss20FeedFormatter Ignores TextSyndicationContent type for SyndicationItem.Summary

While using the Rss20FeedFormatter class in a WCF project, I was trying to wrap the content of my description elements with a <![CDATA[ ]]> section. I found that no matter what I did, the HTML content of the description elements was always encoded and the CDATA section was never added. After peering into the source code of Rss20FeedFormatter, I found that when building the Summary node, it basically creates a new TextSyndicationContent instance which wipes out whatever settings were previously specified (I think).
My Code
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
... (The following code should wrap the Summary with a CDATA section)
SyndicationItem item = new SyndicationItem();
item.Title = new TextSyndicationContent(name);
item.Summary = new CDataSyndicationContent(
new TextSyndicationContent(
"<div>This is a test</div>",
TextSyndicationContentKind.Html));
Rss20FeedFormatter Code
(AFAIK, the above code does not work because of this logic)
...
else if (reader.IsStartElement("description", ""))
result.Summary = new TextSyndicationContent(reader.ReadElementString());
...
As a workaround, I've resorted to using the RSS20FeedFormatter to build the RSS, and then patch the RSS manually. For example:
StringBuilder buffer = new StringBuilder();
XmlTextWriter writer = new XmlTextWriter(new StringWriter(buffer));
feedFormatter.WriteTo(writer ); // feedFormatter = RSS20FeedFormatter
PostProcessOutputBuffer(buffer);
WebOperationContext.Current.OutgoingResponse.ContentType =
"application/xml; charset=utf-8";
return new MemoryStream(Encoding.UTF8.GetBytes(buffer.ToString()));
...
public void PostProcessOutputBuffer(StringBuilder buffer)
{
var xmlDoc = XDocument.Parse(buffer.ToString());
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("item")
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
buffer.Replace(" xmlns:a10=\"http://www.w3.org/2005/Atom\"",
" xmlns:atom=\"http://www.w3.org/2005/Atom\"");
buffer.Replace("a10:", "atom:");
}
private static void VerifyCdataHtmlEncoding(StringBuilder buffer,
XElement element)
{
if (!element.Value.Contains("<") || !element.Value.Contains(">"))
{
return;
}
var cdataValue = string.Format("<{0}><![CDATA[{1}]]></{2}>",
element.Name,
element.Value,
element.Name);
buffer.Replace(element.ToString(), cdataValue);
}
The idea for this workaround came from the following location, I just adapted it to work with WCF instead of MVC. http://localhost:8732/Design_Time_Addresses/SyndicationServiceLibrary1/Feed1/
I'm just wondering if this is simply a bug in Rss20FeedFormatter or is it by design? Also, if anyone has a better solution, I'd love to hear it!
Well #Page Brooks, I see this more as a solution then as a question :). Thanks!!! And to answer your question ( ;) ), yes, I definitely think this is a bug in the Rss20FeedFormatter (though I did not chase it as far), because had encountered precisely the same issue that you described.
You have a 'localhost:8732' referral in your post, but it wasn't available on my localhost ;). I think you meant to credit the 'PostProcessOutputBuffer' workaround to this post:
http://damieng.com/blog/2010/04/26/creating-rss-feeds-in-asp-net-mvc
Or actually it is not in this post, but in a comment to it by David Whitney, which he later put in a seperate gist here:
https://gist.github.com/davidwhitney/1027181
Thank you for providing the adaption of this workaround more to my needs, because I had found the workaround too, but was still struggling to do the adaptation from MVC. Now I only needed to tweak your solution to put the RSS feed to the current Http request in the .ashx handler that I was using it in.
Basically I'm guessing that the fix you mentioned using the CDataSyndicationContent, is from feb 2011, assuming you got it from this post (at least I did):
SyndicationFeed: Content as CDATA?
This fix stopped working in some newer ASP.NET version or something, due to the code of the Rss20FeedFormatter changing to what you put in your post. This code change might as well have been an improvement for other stuff that IS in the MVC framework, but for those using the CDataSyndicationContent fix it definitely causes a bug!
string stylesheet = #"<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform""><xsl:output cdata-section-elements=""description"" method=""xml"" indent=""yes""/></xsl:stylesheet>";
XmlReader reader = XmlReader.Create(new StringReader(stylesheet));
XslCompiledTransform t = new XslCompiledTransform(true);
t.Load(reader);
using (MemoryStream ms = new MemoryStream())
{
XmlWriter writer = XmlWriter.Create(ms, t.OutputSettings);
rssFeed.WriteTo(writer); // rssFeed is Rss20FeedFormatter
writer.Flush();
ms.Position = 0;
string niko = Encoding.UTF8.GetString(ms.ToArray());
}
I'm sure someone pointed this out already but this a stupid workaround I used.
t.OutputSettings is of type XmlWriterSettings with cdataSections being populated with a single XmlQualifiedName "description".
Hope it helps someone else.
I found the code for Cdata elsewhere
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
Code to call it something along the lines:
item.Content = new Helpers.CDataSyndicationContent(new TextSyndicationContent("<span>TEST2</span>", TextSyndicationContentKind.Html));
However the "WriteContentsTo" function wasn't being called.
Instead of Rss20FeedFormatter I tried Atom10FeedFormatter - and it worked!
Obviously this gives Atom feed rather than traditional RSS - but worth mentioning.
Output code is:
//var formatter = new Rss20FeedFormatter(feed);
Atom10FeedFormatter formatter = new Atom10FeedFormatter(feed);
using (var writer = XmlWriter.Create(response.Output, new XmlWriterSettings { Indent = true }))
{
formatter.WriteTo(writer);
}

Serialize an Activity to xaml

I have Googled a bit, and cannot seem to find any examples of Xaml-fying Activities - good, bad, or otherwise!
public static string ToXaml (this Activity activity)
{
// i would use ActivityXamlServices to go from Xaml
// to activity, but how to go other way? documentation
// is slim, and cannot infer proper usage of
// ActivityXamlServices from Xml remarks :S
string xaml = string.Empty;
return xaml;
}
Hints, tips, pointers would be welcome :)
NOTE: so found this. Will work through and update once working. Anyone wanna beat me to the punch, by all means. Better yet, if you can find a way to be rid of WorkflowDesigner, seems odd it is required.
Alright, so worked through this forum posting.
You may Xaml-fy [ie transform an instance to declarative Xaml] a well-known Activity via
public static string ToXaml (this Activity activity)
{
StringBuilder xaml = new StringBuilder ();
using (XmlWriter xmlWriter = XmlWriter.Create (
xaml,
new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true, }))
using (XamlWriter xamlWriter = new XamlXmlWriter (
xmlWriter,
new XamlSchemaContext ()))
using (XamlWriter xamlServicesWriter =
ActivityXamlServices.CreateBuilderWriter (xamlWriter))
{
ActivityBuilder activityBuilder = new ActivityBuilder
{
Implementation = activity
};
XamlServices.Save (xamlServicesWriter, activityBuilder);
}
return xaml.ToString ();
}
Your Xaml may contain certain artifacts, such as references to System.Activities.Presentation namespace appearing as xmlns:sap="...". If this presents an issue in your solution, read the source link above - there is a means to inject directives to ignore unrecognized namespaces.
Will leave this open for a while. If anyone can find a better solution, or improve upon this, please by all means :)
How about XamlServices.Save(filename, activity)?
Based on the other solution (for VS2010B2) and some Reflectoring, I found a solution for VS2010RC. Since XamlWriter is abstract in the RC, the new way to serialize an activity tree is this:
public static string ToXaml (this Activity activity)
{
var xamlBuilder = new StringBuilder();
var xmlWriter = XmlWriter.Create(xamlBuilder,
new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true });
using (xmlWriter)
{
var xamlXmlWriter =
new XamlXmlWriter(xmlWriter, new XamlSchemaContext());
using (xamlXmlWriter)
{
XamlWriter xamlWriter =
ActivityXamlServices.CreateBuilderWriter(xamlXmlWriter);
using (xamlWriter)
{
var activityBuilder =
new ActivityBuilder { Implementation = sequence };
XamlServices.Save(xamlWriter, activityBuilder);
}
}
}
return xamlBuilder.ToString();
}