lucene wikipedia querying - lucene

I'm using lucene to query from wiki dump and get the categories out. So, I get the relevant documents and for every document, I call the below function.
static List<String> getCategories(Document document) throws IOException
{
List<String> categories = new ArrayList<String>();
String text = document.get("text");
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));
CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken())
{
String tokText = termAtt.toString();
if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
{
categories.add(tokText);
}
}
return categories;
}
but it throws the following error at the while statement.
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at SearchIndex.getCategories(SearchIndex.java:82)
at SearchIndex.main(SearchIndex.java:54)
I looked at zzRefill() function but it I'm not able to understand it. Is this a known bug or something? I don't know what am I doing wrong. The lucene guys says that the whole wikipediaTokenizer section is in beta and maybe be subject to changes. I was hoping someone could help me.

I solved the problem by adding tf.reset() before calling the while loop

Related

Lucene's WordnetSynonymParser

I am trying to use Lucene's WordnetSynonymParser class to create a synonym filter, but I'm not sure which of the prolog files I'm meant to be passing into the parse() function.
The documentation says:
See http://wordnet.princeton.edu/man/prologdb.5WN.html for a
description of the format.
so I've downloaded the prolog files, but I'm not sure which ones I should be passing in, and how I go about it.
Could someone please point me in the right direction?
Thanks for your help
EDIT:
Thanks to femtoRgon for pointing me in the direction of wn_s.pl. I have now got the following code:
Analyzer tempanalyzer = new SimpleAnalyzer(Version.LUCENE_40);
WordnetSynonymParser synparser = new WordnetSynonymParser(true, true, tempanalyzer);
FileReader doctoread = new FileReader("wn_s.pl");
synparser.parse(doctoread);
SynonymMap synmap = synparser.build();
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
EnglishAnalyzer enganalyzer = new EnglishAnalyzer(Version.LUCENE_40);
CharArraySet engstopset = enganalyzer.getDefaultStopSet();
Tokenizer source = new StandardTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new SynonymFilter(source, synmap, true);
filter = new StandardFilter(Version.LUCENE_40, filter);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);
/*TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);*/
return new TokenStreamComponents(source, filter);
}
};
which I then plan on passing into IndexWriterConfig, however I get the following compile error:
IndexFilesDB.java:133: cannot find symbol
symbol : method parse(java.io.FileReader)
location: class org.apache.lucene.analysis.synonym.WordnetSynonymParser
synparser.parse(doctoread);
I still don't fully understand WordnetSynonymParser, is it an error to do with the class or it just a simple error where the file is not being passes in correctly?
Thanks for your help.
wn_s.pl contains the synset pointers (that is, it defines groups of synonyms), which is what you need for a synonym filter, to my knowledge. I'd start with that.

Sorting an ArrayList of NotesDocuments using a CustomComparator

I'm trying to sort a Documents Collection using a java.util.ArrayList.
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
The reason I'm trying with ArrayList and not with TreeMaps or HashMaps is because the field I need for sorting is not unique; which is a limitation for those two objects (I can't create my own key).
The problem I'm facing is calling CustomComparator:
Here how I'm trying to sort my arraylist:
java.util.Collections.sort(myarraylist, new CustomComparator());
Here my class:
import java.util.Comparator;
import lotus.notes.NotesException;
public class CustomComparator implements Comparator<lotus.notes.Document>{
public int compare(lotus.notes.Document doc1, lotus.notes.Document doc2) {
try {
System.out.println("Here");
System.out.println(doc1.getItemValueString("Form"));
return doc1.getItemValueString("Ranking").compareTo(doc2.getItemValueString("Ranking"));
} catch (NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Error:
Script interpreter error, line=44, col=23: Error calling method
'sort(java.util.ArrayList, com.myjavacode.CustomComparator)' on java
class 'java.util.Collections'
Any help will be appreciated.
I tried to run your SSJS code in a try-catch block, printing the error in exception in catch block and I got the following message - java.lang.ClassCastException: lotus.domino.local.Document incompatible with lotus.notes.Document
I think you have got incorrect fully qualified class names of Document and NotesException. They should be lotus.domino.Document and lotus.domino.NotesException respectively.
Here the SSJS from RepeatControl:
var docs:NotesDocumentCollection = database.search(query, null, 0);
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
java.util.Collections.sort(myarraylist, new com.mycode.CustomComparator());
return myarraylist;
Here my class:
package com.mycode;
import java.util.Comparator;
public class CustomComparator implements Comparator<lotus.domino.Document>{
public int compare(lotus.domino.Document doc1, lotus.domino.Document doc2) {
try {
// Numeric comparison
Double num1 = doc1.getItemValueDouble("Ranking");
Double num2 = doc2.getItemValueDouble("Ranking");
return num1.compareTo(num2);
// String comparison
// return doc1.getItemValueString("Description").compareTo(doc2.getItemValueString("Description"));
} catch (lotus.domino.NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Not that this answer is necessarily the best practice for you, but the last time I tried to do the same thing, I realized I could instead grab the documents as a NotesViewEntryCollection, via SSJS:
var col:NotesViewEntryCollection = database.getView("myView").getAllEntriesByKey(mtgUnidVal)
instead of a NotesDocumentCollection. I just ran through each entry, grabbed the UNIDs for those that met my criteria, added to a java.util.ArrayList(), then sent onward to its destination. I was already sorting the documents for display elsewhere, using a categorized column by parent UNID, so this is probably what I should have done first; still on leading edge of the XPages/Notes learning curve, so every day brings something new.
Again, if your collection is not equatable to a piece of a Notes View, sorry, but for those with an available simple approach, KISS. I remind myself frequently.

Rss20FeedFormatter Ignores TextSyndicationContent type for SyndicationItem.Summary

While using the Rss20FeedFormatter class in a WCF project, I was trying to wrap the content of my description elements with a <![CDATA[ ]]> section. I found that no matter what I did, the HTML content of the description elements was always encoded and the CDATA section was never added. After peering into the source code of Rss20FeedFormatter, I found that when building the Summary node, it basically creates a new TextSyndicationContent instance which wipes out whatever settings were previously specified (I think).
My Code
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
... (The following code should wrap the Summary with a CDATA section)
SyndicationItem item = new SyndicationItem();
item.Title = new TextSyndicationContent(name);
item.Summary = new CDataSyndicationContent(
new TextSyndicationContent(
"<div>This is a test</div>",
TextSyndicationContentKind.Html));
Rss20FeedFormatter Code
(AFAIK, the above code does not work because of this logic)
...
else if (reader.IsStartElement("description", ""))
result.Summary = new TextSyndicationContent(reader.ReadElementString());
...
As a workaround, I've resorted to using the RSS20FeedFormatter to build the RSS, and then patch the RSS manually. For example:
StringBuilder buffer = new StringBuilder();
XmlTextWriter writer = new XmlTextWriter(new StringWriter(buffer));
feedFormatter.WriteTo(writer ); // feedFormatter = RSS20FeedFormatter
PostProcessOutputBuffer(buffer);
WebOperationContext.Current.OutgoingResponse.ContentType =
"application/xml; charset=utf-8";
return new MemoryStream(Encoding.UTF8.GetBytes(buffer.ToString()));
...
public void PostProcessOutputBuffer(StringBuilder buffer)
{
var xmlDoc = XDocument.Parse(buffer.ToString());
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("item")
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
buffer.Replace(" xmlns:a10=\"http://www.w3.org/2005/Atom\"",
" xmlns:atom=\"http://www.w3.org/2005/Atom\"");
buffer.Replace("a10:", "atom:");
}
private static void VerifyCdataHtmlEncoding(StringBuilder buffer,
XElement element)
{
if (!element.Value.Contains("<") || !element.Value.Contains(">"))
{
return;
}
var cdataValue = string.Format("<{0}><![CDATA[{1}]]></{2}>",
element.Name,
element.Value,
element.Name);
buffer.Replace(element.ToString(), cdataValue);
}
The idea for this workaround came from the following location, I just adapted it to work with WCF instead of MVC. http://localhost:8732/Design_Time_Addresses/SyndicationServiceLibrary1/Feed1/
I'm just wondering if this is simply a bug in Rss20FeedFormatter or is it by design? Also, if anyone has a better solution, I'd love to hear it!
Well #Page Brooks, I see this more as a solution then as a question :). Thanks!!! And to answer your question ( ;) ), yes, I definitely think this is a bug in the Rss20FeedFormatter (though I did not chase it as far), because had encountered precisely the same issue that you described.
You have a 'localhost:8732' referral in your post, but it wasn't available on my localhost ;). I think you meant to credit the 'PostProcessOutputBuffer' workaround to this post:
http://damieng.com/blog/2010/04/26/creating-rss-feeds-in-asp-net-mvc
Or actually it is not in this post, but in a comment to it by David Whitney, which he later put in a seperate gist here:
https://gist.github.com/davidwhitney/1027181
Thank you for providing the adaption of this workaround more to my needs, because I had found the workaround too, but was still struggling to do the adaptation from MVC. Now I only needed to tweak your solution to put the RSS feed to the current Http request in the .ashx handler that I was using it in.
Basically I'm guessing that the fix you mentioned using the CDataSyndicationContent, is from feb 2011, assuming you got it from this post (at least I did):
SyndicationFeed: Content as CDATA?
This fix stopped working in some newer ASP.NET version or something, due to the code of the Rss20FeedFormatter changing to what you put in your post. This code change might as well have been an improvement for other stuff that IS in the MVC framework, but for those using the CDataSyndicationContent fix it definitely causes a bug!
string stylesheet = #"<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform""><xsl:output cdata-section-elements=""description"" method=""xml"" indent=""yes""/></xsl:stylesheet>";
XmlReader reader = XmlReader.Create(new StringReader(stylesheet));
XslCompiledTransform t = new XslCompiledTransform(true);
t.Load(reader);
using (MemoryStream ms = new MemoryStream())
{
XmlWriter writer = XmlWriter.Create(ms, t.OutputSettings);
rssFeed.WriteTo(writer); // rssFeed is Rss20FeedFormatter
writer.Flush();
ms.Position = 0;
string niko = Encoding.UTF8.GetString(ms.ToArray());
}
I'm sure someone pointed this out already but this a stupid workaround I used.
t.OutputSettings is of type XmlWriterSettings with cdataSections being populated with a single XmlQualifiedName "description".
Hope it helps someone else.
I found the code for Cdata elsewhere
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
Code to call it something along the lines:
item.Content = new Helpers.CDataSyndicationContent(new TextSyndicationContent("<span>TEST2</span>", TextSyndicationContentKind.Html));
However the "WriteContentsTo" function wasn't being called.
Instead of Rss20FeedFormatter I tried Atom10FeedFormatter - and it worked!
Obviously this gives Atom feed rather than traditional RSS - but worth mentioning.
Output code is:
//var formatter = new Rss20FeedFormatter(feed);
Atom10FeedFormatter formatter = new Atom10FeedFormatter(feed);
using (var writer = XmlWriter.Create(response.Output, new XmlWriterSettings { Indent = true }))
{
formatter.WriteTo(writer);
}

Google Search API - Number of Results

Whenever you perform a Google search, it spits out this little snippet of info
"About 8,110,000 results (0.10 seconds)"
I'm using the number of results certain terms return to rank them against each other, so if I could get this integer - 8,110,000 - via the API it would be very helpful. Some Google API's have recently been deprecated, so if you could point me to the right one that isn't deprecated, it would be very helpful.
Any other workarounds would also be much appreciated. I've seen one or two old posts on similar topics, but none seemed to be resolved successfully.
Completed using Bing instead of Google and with the following code:
string baseURL = "http://api.search.live.net/xml.aspx?Appid=<MyAppID>&query=%22" + name + "%22&sources=web";
WebClient c = new WebClient();
c.DownloadStringAsync(new Uri(baseURL));
c.DownloadStringCompleted += new DownloadStringCompletedEventHandler(findTotalResults);
and this calls findTotalResults:
void findTotalResults(object sender, DownloadStringCompletedEventArgs e)
{
lock (this)
{
string s = e.Result;
XmlReader reader = XmlReader.Create(new MemoryStream(System.Text.UTF8Encoding.UTF8.GetBytes(s)));
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("web:Total"))
{
gResults = reader.ReadInnerXml();
}
}
}
}
}

Serialize an Activity to xaml

I have Googled a bit, and cannot seem to find any examples of Xaml-fying Activities - good, bad, or otherwise!
public static string ToXaml (this Activity activity)
{
// i would use ActivityXamlServices to go from Xaml
// to activity, but how to go other way? documentation
// is slim, and cannot infer proper usage of
// ActivityXamlServices from Xml remarks :S
string xaml = string.Empty;
return xaml;
}
Hints, tips, pointers would be welcome :)
NOTE: so found this. Will work through and update once working. Anyone wanna beat me to the punch, by all means. Better yet, if you can find a way to be rid of WorkflowDesigner, seems odd it is required.
Alright, so worked through this forum posting.
You may Xaml-fy [ie transform an instance to declarative Xaml] a well-known Activity via
public static string ToXaml (this Activity activity)
{
StringBuilder xaml = new StringBuilder ();
using (XmlWriter xmlWriter = XmlWriter.Create (
xaml,
new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true, }))
using (XamlWriter xamlWriter = new XamlXmlWriter (
xmlWriter,
new XamlSchemaContext ()))
using (XamlWriter xamlServicesWriter =
ActivityXamlServices.CreateBuilderWriter (xamlWriter))
{
ActivityBuilder activityBuilder = new ActivityBuilder
{
Implementation = activity
};
XamlServices.Save (xamlServicesWriter, activityBuilder);
}
return xaml.ToString ();
}
Your Xaml may contain certain artifacts, such as references to System.Activities.Presentation namespace appearing as xmlns:sap="...". If this presents an issue in your solution, read the source link above - there is a means to inject directives to ignore unrecognized namespaces.
Will leave this open for a while. If anyone can find a better solution, or improve upon this, please by all means :)
How about XamlServices.Save(filename, activity)?
Based on the other solution (for VS2010B2) and some Reflectoring, I found a solution for VS2010RC. Since XamlWriter is abstract in the RC, the new way to serialize an activity tree is this:
public static string ToXaml (this Activity activity)
{
var xamlBuilder = new StringBuilder();
var xmlWriter = XmlWriter.Create(xamlBuilder,
new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true });
using (xmlWriter)
{
var xamlXmlWriter =
new XamlXmlWriter(xmlWriter, new XamlSchemaContext());
using (xamlXmlWriter)
{
XamlWriter xamlWriter =
ActivityXamlServices.CreateBuilderWriter(xamlXmlWriter);
using (xamlWriter)
{
var activityBuilder =
new ActivityBuilder { Implementation = sequence };
XamlServices.Save(xamlWriter, activityBuilder);
}
}
}
return xamlBuilder.ToString();
}