AnalyzerUtil error on Lucene - lucene

I'm learning to work with lucene. I wrote a simple program to test lucene analyzers like:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.Version;
import org.apache.lucene.wordnet.AnalyzerUtils;
import java.io.IOException;
public class AnalyzerDemo {
private static final String[] examples = {
"The quick brown fox jumped over the lazy dog",
"XY&Z Corporation - xyz#example.com"
};
private static final Analyzer[] analyzers = new Analyzer[] {
new WhitespaceAnalyzer(),
new SimpleAnalyzer(),
new StopAnalyzer(Version.LUCENE_30),
new StandardAnalyzer(Version.LUCENE_30)
};
public static void main(String[] args) throws IOException {
String[] strings = examples;
if (args.length > 0) {
strings = args;
}
for (String text : strings) {
analyze(text);
}
}
private static void analyze(String text) throws IOException {
System.out.println("Analyzing \"" + text + "\"");
for (Analyzer analyzer : analyzers) {
String name = analyzer.getClass().getSimpleName();
System.out.println(" " + name + ":");
System.out.print(" ");
AnalyzerUtils.displayTokens(analyzer, text);
System.out.println("\n");
}
}
}
but I got the following error:
AnalyzerDemo.java:7: package org.apache.lucene.wordnet does not exist
import org.apache.lucene.wordnet.AnalyzerUtils;
^
AnalyzerDemo.java:35: cannot find symbol
symbol : variable AnalyzerUtils
location: class AnalyzerDemo
AnalyzerUtils.displayTokens(analyzer, text);
^
Note: AnalyzerDemo.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2 errors
I think library wordnet or AnalyzerUtils is not available. How can I install this part of lucene? Do you have any ideas? Why is that missing? I've installed lucene 3.5.0.

lucene-wordnet contrib module was removed in Lucene 3.4.0. AnalyzerUtils also doesn't exist, so you either have to get Lucene 3.3.0 or write your own for 3.5.0 based on this one.

In case of word-net, the word-net contrib was removed from lucene 3.4.0 and the functionality was merged with analyzers contrib. Point number 4 at: http://apache.spinellicreations.com/lucene/java/3.4.0/changes-3.4.0/Contrib-Changes.html#3.4.0.new_features.
Java docs can be found for the same at: https://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/synonym/SynonymFilter.html

Instead of AnalyzerUtils.displayTokens(analyzer,text);
Use the function:
private static void displayTokens(Analyzer analyzer,String text) throws IOException
{
TokenStream stream=analyzer.tokenStream(null,new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()){
System.out.print(cattr.toString()+" ");
}
stream.end();
stream.close();
}

Related

Cleaning up unused images in PDF page resources

Please forgive me if this has been asked but I have not found any matches yet.
I have some PDF files where images are duplicated on each page's resources but never used in its content stream. I think this is causing the PDFSplit command to create very bloated pages. Is there any utility code or examples to clean up unused resources like this? Maybe a starting point for me to get going?
I was able to clean up the resources for each page by gathering a list of the images used inside the page's content stream. With the list of images, I then check the resources for the page and remove any that weren't used. See the PageExtractor.stripUnusedImages below for implementation details.
The resource object was shared between pages so I also had to make sure each page had its own copy of the resource object before removing images. See PageExtractor.copyResources below for implementation details.
The page splitter:
package org.apache.pdfbox.examples;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.interactive.action.PDAction;
import org.apache.pdfbox.pdmodel.interactive.action.PDActionGoTo;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDDestination;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageDestination;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class PageExtractor {
private final Logger log = LoggerFactory.getLogger(this.getClass());
public PDDocument extractPage(PDDocument source, Integer pageNumber) throws IOException {
PDDocument targetPdf = new PDDocument();
targetPdf.getDocument().setVersion(source.getVersion());
targetPdf.setDocumentInformation(source.getDocumentInformation());
targetPdf.getDocumentCatalog().setViewerPreferences(source.getDocumentCatalog().getViewerPreferences());
PDPage sourcePage = source.getPage(pageNumber);
PDPage targetPage = targetPdf.importPage(sourcePage);
targetPage.setResources(sourcePage.getResources());
stripUnusedImages(targetPage);
stripPageLinks(targetPage);
return targetPdf;
}
/**
* Collect the images used from a custom PDFStreamEngine (BI and DO operators)
* Create an empty COSDictionary
* Loop through the page's XObjects that are images and add them to the new COSDictionary if they were found in the PDFStreamEngine
* Assign the newly filled COSDictionary to the page's resource as COSName.XOBJECT
*/
protected void stripUnusedImages(PDPage page) throws IOException {
PDResources resources = copyResources(page);
COSDictionary pageObjects = (COSDictionary) resources.getCOSObject().getDictionaryObject(COSName.XOBJECT);
COSDictionary newObjects = new COSDictionary();
Set<String> imageNames = findImageNames(page);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
for (COSName xObjectName : xObjectNames) {
if (resources.isImageXObject(xObjectName)) {
Boolean used = imageNames.contains(xObjectName.getName());
if (used) {
newObjects.setItem(xObjectName, pageObjects.getItem(xObjectName));
} else {
log.info("Found unused image: name={}", xObjectName.getName());
}
} else {
newObjects.setItem(xObjectName, pageObjects.getItem(xObjectName));
}
}
resources.getCOSObject().setItem(COSName.XOBJECT, newObjects);
page.setResources(resources);
}
/**
* It is necessary to copy the page's resources since it can be shared with other pages. We must ensure changes
* to the resources are scoped to the current page.
*/
protected PDResources copyResources(PDPage page) {
return new PDResources(new COSDictionary(page.getResources().getCOSObject()));
}
protected Set<String> findImageNames(PDPage page) throws IOException {
Set<String> imageNames = new HashSet<>();
PdfImageStreamEngine engine = new PdfImageStreamEngine() {
#Override
void handleImage(Operator operator, List<COSBase> operands) {
COSName name = (COSName) operands.get(0);
imageNames.add(name.getName());
}
};
engine.processPage(page);
return imageNames;
}
/**
* Borrowed from PDFBox page splitter
*
* #see org.apache.pdfbox.multipdf.Splitter#processAnnotations(org.apache.pdfbox.pdmodel.PDPage)
*/
protected void stripPageLinks(PDPage imported) throws IOException {
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations) {
if (annotation instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null) {
PDAction action = link.getAction();
if (action instanceof PDActionGoTo) {
destination = ((PDActionGoTo) action).getDestination();
}
}
if (destination instanceof PDPageDestination) {
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
}
The stream reader used to analyze the page's images:
package org.apache.pdfbox.examples;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.OperatorProcessor;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroup;
import java.io.IOException;
import java.util.List;
abstract public class PdfImageStreamEngine extends PDFStreamEngine {
PdfImageStreamEngine() {
addOperator(new DrawObjectCounter());
}
abstract void handleImage(Operator operator, List<COSBase> operands);
protected class DrawObjectCounter extends OperatorProcessor {
#Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands != null && isImage(operands.get(0))) {
handleImage(operator, operands);
}
}
protected Boolean isImage(COSBase base) throws IOException {
if (!(base instanceof COSName)) {
return false;
}
COSName name = (COSName)base;
if (context.getResources().isImageXObject(name)) {
return true;
}
PDXObject xObject = context.getResources().getXObject(name);
if (xObject instanceof PDTransparencyGroup) {
context.showTransparencyGroup((PDTransparencyGroup)xObject);
} else if (xObject instanceof PDFormXObject) {
context.showForm((PDFormXObject)xObject);
}
return false;
}
#Override
public String getName() {
return "Do";
}
}
}

pdfbox cannot find symbol

I'm using code suggested a while ago by #ASu :
package pdf_form_filler;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.interactive.form.*;
import java.io.File;
import java.util.*;
public class pdf_form_filler {
public static void listFields(PDDocument doc) throws Exception {
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm form = catalog.getAcroForm();
List<PDFieldTreeNode> fields = form.getFields();
for(PDFieldTreeNode field: fields) {
Object value = field.getValue();
String name = field.getFullyQualifiedName();
System.out.print(name);
System.out.print(" = ");
System.out.print(value);
System.out.println();
}
}
public static void main(String[] args) throws Exception {
File file = new File("test.pdf");
PDDocument doc = PDDocument.load(file);
listFields(doc);
}
}
However, i keep getting a cannot find symbol error for PDFieldTreeNode. I have the latest pdfbox (2.0.4) but I cannot find the class in it anyway. I tried using this with PDField instead but then get error for .getValue
To get all terminal fields, do this:
Iterator<PDField> fieldIterator = catalog.getAcroForm();
while (fieldIterator.hasNext())
{
PDField pdField = fieldIterator.next();
// do stuff with the field
}
form.getFields() only gets the top level fields, including nonterminal fields (i.e. fields with no value but children).
A more advanced example can be found in the PrintFields.java example.

Arquillian Graphene #Location placeholder

I'm learning Arquillian right now I wonder how to create page that has a placeholder inside the path. For example:
#Location("/posts/{id}")
public class BlogPostPage {
public String getContent() {
// ...
}
}
or
#Location("/posts/{name}")
#Location("/specific-page?requiredParam={value}")
I have looking for an answer on graphine and arquillian reference guides without success. I used library from other language that have support for page-objects, but it has build-in support for placeholders.
AFAIK there is nothing like this implemented in Graphene.
To be honest, I'm not sure how this should behave - how would you pass the values...?
Apart from that, I think that it could be also limited by Java annotation abilities https://stackoverflow.com/a/10636320/6835063
This is not possible currently in Graphene. I've created ARQGRA-500.
It's possible to extend Graphene to add dynamic parameters now. Here's how. (Arquillian 1.1.10.Final, Graphene 2.1.0.Final.)
Create an interface.
import java.util.Map;
public interface LocationParameterProvider {
Map<String, String> provideLocationParameters();
}
Create a custom LocationDecider to replace the corresponding Graphene's one. I replace the HTTP one. This Decider will add location parameters to the URI, if it sees that the test object implements our interface.
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
import java.util.Map;
import java.util.Map.Entry;
import org.jboss.arquillian.core.api.Instance;
import org.jboss.arquillian.core.api.annotation.Inject;
import org.jboss.arquillian.graphene.location.decider.HTTPLocationDecider;
import org.jboss.arquillian.graphene.spi.location.Scheme;
import org.jboss.arquillian.test.spi.context.TestContext;
public class HTTPParameterizedLocationDecider extends HTTPLocationDecider {
#Inject
private Instance<TestContext> testContext;
#Override
public Scheme canDecide() {
return new Scheme.HTTP();
}
#Override
public String decide(String location) {
String uri = super.decide(location);
// not sure, how reliable this method of getting the current test object is
// if it breaks, there is always a possibility of observing
// org.jboss.arquillian.test.spi.event.suite.TestLifecycleEvent's (or rather its
// descendants) and storing the test object in a ThreadLocal
Object testObject = testContext.get().getActiveId();
if (testObject instanceof LocationParameterProvider) {
Map<String, String> locationParameters =
((LocationParameterProvider) testObject).provideLocationParameters();
StringBuilder uriParams = new StringBuilder(64);
boolean first = true;
for (Entry<String, String> param : locationParameters.entrySet()) {
uriParams.append(first ? '?' : '&');
first = false;
try {
uriParams.append(URLEncoder.encode(param.getKey(), "UTF-8"));
uriParams.append('=');
uriParams.append(URLEncoder.encode(param.getValue(), "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}
uri += uriParams.toString();
}
return uri;
}
}
Our LocationDecider must be registered to override the Graphene's one.
import org.jboss.arquillian.core.spi.LoadableExtension;
import org.jboss.arquillian.graphene.location.decider.HTTPLocationDecider;
import org.jboss.arquillian.graphene.spi.location.LocationDecider;
public class MyArquillianExtension implements LoadableExtension {
#Override
public void register(ExtensionBuilder builder) {
builder.override(LocationDecider.class, HTTPLocationDecider.class,
HTTPParameterizedLocationDecider.class);
}
}
MyArquillianExtension should be registered via SPI, so create a necessary file in your test resources, e.g. for me the file path is src/test/resources/META-INF/services/org.jboss.arquillian.core.spi.LoadableExtension. The file must contain a fully qualified class name of MyArquillianExtension.
And that's it. Now you can provide location parameters in a test.
import java.util.HashMap;
import java.util.Map;
import org.jboss.arquillian.graphene.page.InitialPage;
import org.jboss.arquillian.graphene.page.Location;
import org.junit.Test;
public class TestyTest implements LocationParameterProvider {
#Override
public Map<String, String> provideLocationParameters() {
Map<String, String> params = new HashMap<>();
params.put("mykey", "myvalue");
return params;
}
#Test
public void test(#InitialPage TestPage page) {
}
#Location("MyTestView.xhtml")
public static class TestPage {
}
}
I've focused on parameters specifically, but hopefully this paves the way for other dynamic path manipulations.
Of course this doesn't fix the Graphene.goTo API. This means before using goTo you have to provide parameters via this roundabout provideLocationParameters way. It's weird. You can make your own alternative API, goTo that accepts parameters, and modify your LocationDecider to support other ParameterProviders.

StandardAnalyzer with stemming

Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way?
Also, if I would like not to consider numbers, how can I achieve that?
Thanks
If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer. Otherwise, you could create a new Analyzer that extends the AnalyzerWraper as shown below.
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class PorterAnalyzer extends AnalyzerWrapper {
private Analyzer baseAnalyzer;
public PorterAnalyzer(Analyzer baseAnalyzer) {
this.baseAnalyzer = baseAnalyzer;
}
#Override
public void close() {
baseAnalyzer.close();
super.close();
}
#Override
protected Analyzer getWrappedAnalyzer(String fieldName)
{
return baseAnalyzer;
}
#Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
{
TokenStream ts = components.getTokenStream();
Set<String> filteredTypes = new HashSet<>();
filteredTypes.add("<NUM>");
TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);
PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
return new TokenStreamComponents(components.getTokenizer(), porterStem);
}
public static void main(String[] args) throws IOException
{
//Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
String text = "This is a testing example. It should tests the Porter stemmer version 111";
TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
ts.reset();
while (ts.incrementToken()){
CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
System.out.println(ca.toString());
}
analyzer.close();
}
}
The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.

Cucumber: Unable to find step definition

I have the following feature file: MacroValidation.feature
#macroFilter
Feature: Separating out errors and warnings
Scenario: No errors or warnings when separating out error list
Given I have 0 macros
When I filter out errors and warnings for Macros
Then I need to have 0 errors
And I need to have 0 warnings
My definition files.
package com.test.definition;
import cucumber.api.java.After;
import cucumber.api.java.Before;
import cucumber.api.java.en.Given;
import cucumber.api.java.en.Then;
import cucumber.api.java.en.When;
import cucumber.runtime.java.StepDefAnnotation;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;
import static org.powermock.api.mockito.PowerMockito.doReturn;
import static org.powermock.api.mockito.PowerMockito.mock;
import static org.powermock.api.mockito.PowerMockito.spy;
#StepDefAnnotation
public class MacroValidationStepDefinitions {
private final MacroService macroService = spy(new MacroService());
private final LLRBusList busList = mock(LLRBusList.class);
private final List<String> errorList = new ArrayList<String>();
private final List<String> warningList = new ArrayList<String>();
#Before({"#macroFilter"})
public void setUp() {
errorList.addAll(Arrays.asList("error 1, error2, error 3"));
warningList.addAll(Arrays.asList("warning 1, warning 2, warning 3"));
}
#After({"#macroFilter"})
public void tearDown() {
errorList.clear();
warningList.clear();
}
#Given("^I have (\\d+) macros$")
public void i_have_macros(int input) {
doReturn(input).when(busList).size();
}
#When("^I filtered out errors and warnings for Macros$")
public void i_filtered_out_errors_and_warnings_for_Macros() {
macroService.separateErrorsAndWarning(busList, errorList, warningList);
}
#Then("^I need to have (\\d+) errors$")
public void i_need_to_have_errors(int numOfError) {
if (numOfError == 0) {
assertTrue(errorList.isEmpty());
} else {
assertEquals(errorList.size(), numOfError);
}
}
#Then("^I need to have (\\d+) warnings$")
public void i_need_to_have_warnings(int numOfWarnings) {
if (numOfWarnings == 0) {
assertTrue(warningList.isEmpty());
} else {
assertEquals(warningList.size(), numOfWarnings);
}
}
}
My unit test class.
#CucumberOptions(features = {"classpath:testfiles/MacroValidation.feature"},
glue = {"com.macro.definition"},
dryRun = false,
monochrome = true,
tags = "#macroFilter"
)
#RunWith(Cucumber.class)
public class PageMacroValidationTest {
}
When I execute the test, I get file definition not implemented warnings in the log.
Example log:
You can implement missing steps with the snippets below:
#Given("^I have (\\d+) macros$")
public void i_have_macros(int arg1) throws Throwable {
// Write code here that turns the phrase above into concrete actions
throw new PendingException();
}
#When("^I filter out errors and warnings for Macros$")
public void i_filter_out_errors_and_warnings_for_Macros() throws Throwable {
// Write code here that turns the phrase above into concrete actions
throw new PendingException();
}
#Then("^I need to have (\\d+) errors$")
public void i_need_to_have_errors(int arg1) throws Throwable {
// Write code here that turns the phrase above into concrete actions
throw new PendingException();
}
#Then("^I need to have (\\d+) warnings$")
public void i_need_to_have_warnings(int arg1) throws Throwable {
// Write code here that turns the phrase above into concrete actions
throw new PendingException();
}
I don't think file name should matter right?
It looks like Cucumber isn't finding your step defintion class. In your unit test class you say:
glue = {"com.macro.definition"}
However the step definition classes are in com.test.definition
Try changing that line to:
glue = {"com.test.definition"}
You may have to rebuild your project to pick up the change.
Also, Cucumber is sensitive to white space. If you try to make your runner or feature file pretty after having captured the snippets, you will get this problem.
Here's an example that drove me nuts for several hours while creating my first BDD. I had created the feature file and a skeleton runner which I ran and and captured the snippets. Then I prettified the feature file, and when I ran the runner got the errors.
Of course everything looked fine to my human brain, so the next few hours were spent in fruitless research here, and checking versions and bug lists. Finally I decided to compare the first two lines of the snippet to see what was different:
// #Then("^the test result is = \"([^\"]*)\"$")
// public void theTestResultIs(String ruleResult) throws Throwable {
#Then("^the test result is = \"([^\"]*)\"$")
public void theTestResultIs(String arg1) throws Throwable {
Doh!
Try to use all dependencies in POM.xml with io.cucumber group id which is the latest jars instead of info.cukes. After removing jars update the project with new imports and run the project.
Remember you have to replace all dependencies of info.cukes with io.cucumber