Adding new facet to DSpace has no effect (DSpace 4.1) - facet

I changed the discovery.xml file as described in the documentation to add a new facet over dc.type to our DSpace. When I finished reindexing and deleting the cache I see the new search filter at advanced search but not as a facet.
These are the changes I made to discovery.xml:
Added filter to sidbarFacets and SearchFilter:
<ref bean="searchFilterType" />
and this is the filter:
<bean id="searchFilterType" class="org.dspace.discovery.configuration.DiscoverySearchFilterFacet">
<property name="indexFieldName" value="type"/>
<property name="metadataFields">
<list>
<value>dc.type</value>
</list>
</property>
</bean>
Thanks in advance

The following modifications to discovery.xml on the latest DSpace master branch worked on my local setup:
https://github.com/bram-atmire/DSpace/commit/3f084569cf1bbc6c6684d114a09a1617c8d3de5d
One reason why the facet wouldn't appear in your setup, could be that you omitted to add it to both the "defaultconfiguration" as well as the specific configuration for the DSpace homepage.
After building and deploying, a forced discovery re-index using the following command made the facet appear:
./dspace index-discovery -f

Here is an example facet that I have configured in our instance. Try setting the facetLimit, sortOrder, and splitter. Re-index and see if that resolves the issue.
<bean id="searchFilterGeographic"
class="org.dspace.discovery.configuration.HierarchicalSidebarFacetConfiguration">
<property name="indexFieldName" value="geographic-region"/>
<property name="metadataFields">
<list>
<value>dc.coverage.spatial</value>
</list>
</property>
<property name="facetLimit" value="5"/>
<property name="sortOrder" value="COUNT"/>
<property name="splitter" value="::"/>
</bean>

Related

Ignite expiry policy is not working for old data

I have 400M records on a ignite cache. And native persistence is enabled. I want to enable expiry policy. TO do so i have added below below on my xml config.
<!-- Enabling expiry policy -->
<property name="cacheConfiguration">
<list>
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="CACHE_L4_TRIGGER_NOTIFICATION"/>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="MINUTES"/>
<constructor-arg value="60"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
</list>
</property>
It worked for newly added data but i have old 400M data. i need help to remove 30 days old data from this 400M data. How can do this? I have searched but cant find anything. Also i cant purge all data as they are important.
You can't do this for existing data. Ignite doesn't keep track of when an entry was created or modified in any way if expiry policy is not set. You have to iterate over all your data and clean it manually based on the contents (e.g. if you have a creation timestamp attribute).

Ignite: Configuring persistence to a custom directory

I want to provide a custom directory to persist the data. My persistence configuration is:
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
</bean>
</property>
</bean>
</property>
As mentioned in the documentation, by default it persists under ${IGNITE_HOME}/work/db directory on each node. I can change the directory by calling setStoragePath() method. But how do I configure it through xml.
I have searched but couldn't find in the documentation. Please help to find the right xml key for modifying this configuration.
Thanks!!
The correct one would be the property of DataStorageConfiguration:
<property name="storagePath" value="$ENV_VAR/relative/path"/>
Javadoc link: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/DataStorageConfiguration.html#getStoragePath--

Jackrabbit Indexing Config Whitelisting (Magnolia CMS 5.5.5 Fulltextsearch)

I want to do a whitelisting of what properties are indexed/searched and shown in excerpt with a Magnolia search.
I am changing the indexing_configuration.xml in my website workspace.
Removing the index and restarting magnolia did not change anything...
By now I have this in my indexing_configuration.xml (next to other stuff)
but these are the String properties I want to include in my ecxcerpt the rest should be excluded:
<index-rule nodeType="nt:hierarchyNode">
<property boost="10" useInExcerpt="true">introTitle</property>
<property boost="1.0" useInExcerpt="true">introAbstract</property>
<property boost="1.0" useInExcerpt="true">contentText</property>
<property boost="1.0" useInExcerpt="true">subText</property>
<property boost="10" useInExcerpt="true">title</property>
<!-- exclude jcr:* and mgnl:* properties -->
<property isRegexp="true" nodeScopeIndex="false" useInExcerpt="false">.*:.*</property>
</index-rule>
<index-rule nodeType="mgnl:contentNode">
<property boost="5" nodeScopeIndex="false" useInExcerpt="true">introTitle</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">introAbstract</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">contentText</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">subText</property>
<property boost="5" nodeScopeIndex="false" useInExcerpt="true">title</property>
<!-- exclude jcr:* and mgnl:* properties -->
<property isRegexp="true" nodeScopeIndex="false" useInExcerpt="false">.*:.*</property>
</index-rule>
How can i get this to work as intended? Thanks for your help..
Most likely cause is that Magnolia/JR is not seeing your new configuration. Did you change your repo configuration (workspace.xml in website workspace) to point it to new index configuration?
Default looks like:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index" />
<!-- SearchIndex will get the indexing configuration from the classpath, if not found in the workspace home -->
<param name="indexingConfiguration" value="/info/magnolia/jackrabbit/indexing_configuration.xml"/>
and you need to point it to your new file.
Also not sure why you are setting indexing based on nt:hierarchyNode or mgnl:contentNode rather then using more specific mgnl:page/mgnl:component

Running transformation script in Alfresco

Why isn't my transformation script running on any uploaded files beyond the first file?
I set up a transformation rule in Alfresco that listens to a folder. When a new file is placed into the folder, the rule triggers a script to run that takes a PDF without a text layer, breaks it into jpegs, OCRs the jpegs, then converts the jpegs into PDFs and merges the PDFs, returning an OCRed PDF with a text layer then copies the result into another folder so we know it got done.
Running the script at command line works. The first time I drop a file into the Alfresco folder (upload) it runs the script and copies the file. But any subsequent time I drop files into the folder, the script isn't run, but the file is still copied to the target folder. So I know the rule is being called, but the script doesn't seem to be running on the following files. I have logging on the script, so I know the script isn't even getting called. The rule is being applied to all new and modified files in the folder with no filters. Then it runs the Transform and Copy command using our custom OCR script and with the target folder being defined as the parent folder.
Below is my alfresco transformation extension:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
<bean id="transformer.worker.PdfOCRTool" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService"/>
</property>
<property name="transformCommand">
<bean name="transformer.pdftoocr.Command" class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<entry key=".*">
<value>/opt/ocr/ocr.sh ${source} ${target}</value>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
<property name="sourceMimetype">
<value>application/pdf</value>
</property>
<property name="targetMimetype">
<value>application/pdf</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.proxy.PdfOCRTool" class="org.alfresco.repo.management.subsystems.SubsystemProxyFactory">
<property name="sourceApplicationContextFactory">
<ref bean="thirdparty"/>
</property>
<property name="sourceBeanName">
<value>transformer.worker.PdfOCRTool</value>
</property>
<property name="interfaces">
<list>
<value>org.alfresco.repo.content.transform.ContentTransformerWorker</value>
</list>
</property>
</bean>
<bean id="transformer.PdfOCRTool" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.proxy.PdfOCRTool"/>
</property>
</bean>
</beans>
The transformation service is intended for converting items from one mimetype to another. I am not sure that converting from PDF to a second PDF is valid. You would be better implementing a custom Java repository action which then in turn uses a org.alfresco.util.exec.RuntimeExec bean to fire off the command.
Since your Spring config already defines a RuntimeExec bean, you could re-use this definition but wrap it instead in your own custom class which extends org.alfresco.repo.action.executer.ActionExecuterAbstractBase. In fact, if you take a look at the source of org.alfresco.repo.action.executer.TransformActionExecuter then that might give you some clues on how to go about implementing things.

How do i exclude everything but text/html from a heritrix crawl?

On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"
My Problem: i dont know how to implement it in my cxml File. Especially:
Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ...
There is no ContentTypeRegExpFilter in the sample cxml Files.
Kris's answer is only half the truth (at least with Heritrix 3.1.x that I'm using). A DecideRule return ACCEPT, REJECT or NONE. If a rule returns NONE, it means that this rule has "no opinion" about that (like ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (as all other MatchesRegexDecideRule) can be configured to return a decision if a regex matches (configured by the two properties "decision" and "regex"). The setting means that this rule returns an ACCEPT decision if the regex matches, but returns NONE if it does not match. And as we have seen - NONE is not an opinion so that shouldProcessRule will evaluate to ACCEPT because no decisions have been made.
So to only archive responses with text/html* Content-Type, configure a DecideRuleSequence where everything is REJECTed by default and only selected entries will be ACCEPTed.
This looks like this:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule" />
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*" />
</bean>
</list>
</property>
</bean>
</property>
<!-- other properties... -->
</bean>
To avoid that images, movies etc. are downloaded at all, configure the "scope" bean with a MatchesListRegexDecideRule that REJECTs urls with well known file extensions like:
<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList">
<list>
<value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
<value>.*(?i)(\.(rar|zip|tar|gz))$</value>
<value>.*(?i)(\.(pdf|doc|xls|odt))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js|css))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
</list>
</property>
</bean>
The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.
The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule
A possible ARCWriter configuration:
<bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*">
</bean>
</property>
<!-- Other properties that need to be set ... -->
</bean>
This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.
Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).
As shouldProcessRule is inherited from Processor, this can be applied to any processor.
More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)