On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"
My Problem: i dont know how to implement it in my cxml File. Especially:
Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ...
There is no ContentTypeRegExpFilter in the sample cxml Files.
Kris's answer is only half the truth (at least with Heritrix 3.1.x that I'm using). A DecideRule return ACCEPT, REJECT or NONE. If a rule returns NONE, it means that this rule has "no opinion" about that (like ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (as all other MatchesRegexDecideRule) can be configured to return a decision if a regex matches (configured by the two properties "decision" and "regex"). The setting means that this rule returns an ACCEPT decision if the regex matches, but returns NONE if it does not match. And as we have seen - NONE is not an opinion so that shouldProcessRule will evaluate to ACCEPT because no decisions have been made.
So to only archive responses with text/html* Content-Type, configure a DecideRuleSequence where everything is REJECTed by default and only selected entries will be ACCEPTed.
This looks like this:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule" />
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*" />
</bean>
</list>
</property>
</bean>
</property>
<!-- other properties... -->
</bean>
To avoid that images, movies etc. are downloaded at all, configure the "scope" bean with a MatchesListRegexDecideRule that REJECTs urls with well known file extensions like:
<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList">
<list>
<value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
<value>.*(?i)(\.(rar|zip|tar|gz))$</value>
<value>.*(?i)(\.(pdf|doc|xls|odt))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js|css))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
</list>
</property>
</bean>
The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.
The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule
A possible ARCWriter configuration:
<bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*">
</bean>
</property>
<!-- Other properties that need to be set ... -->
</bean>
This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.
Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).
As shouldProcessRule is inherited from Processor, this can be applied to any processor.
More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)
Related
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="mycache"/>
<!-- Configure query entities -->
<property name="queryEntities">
<list>
<bean class="org.apache.ignite.cache.QueryEntity">
<!-- Setting indexed type's key class -->
<property name="keyType" value="java.lang.Long"/>
<!-- Setting indexed type's value class -->
<property name="valueType"
value="org.apache.ignite.examples.Person"/>
<!-- Defining fields that will be either indexed or queryable.
Indexed fields are added to 'indexes' list below.-->
<property name="fields">
<map>
<entry key="id" value="java.lang.Long"/>
<entry key="name" value="java.lang.String"/>
<entry key="salary" value="java.lang.Long "/>
</map>
</property>
<!-- Defining indexed fields.-->
<property name="indexes">
<list>
<!-- Single field (aka. column) index -->
<bean class="org.apache.ignite.cache.QueryIndex">
<constructor-arg value="id"/>
</bean>
<!-- Group index. -->
<bean class="org.apache.ignite.cache.QueryIndex">
<constructor-arg>
<list>
<value>id</value>
<value>salary</value>
</list>
</constructor-arg>
<constructor-arg value="SORTED"/>
</bean>
</list>
</property>
</bean>
</list>
</property>
</bean>
I understand the above XML configuration can be used to define an SQL entity in ignite with indexes. The documentation is better understandable from a code perspective either Java or NET because API is available. As we do most of the development C++ and API is not available , we would like to know few more details to use the XML configuration. Could anyone please answer below points?
1.Where does this configuration file can be used? Server side or client side (thin & thick) or both side.
2.Is it possible to change the field names, types and indexes once it has been created and loaded data in the same entity?#
3.<property name="valueType" value="org.apache.ignite.examples.Person"/> If not mistaken, we understand the value here is taken from a namespace and from a DLL (for example in c#) but How does ignite knows about the location of DLL or namespace to get load from? where does the binaries to be kept?
4.In the case of C++ , what binary file can be used to define the value type? .lib or .dll or some other way.
C++ Thick Client can use the XML config, see IgniteConfiguration.springCfgPath.
Think about the CacheConfiguration as the "starting" config for a cache. Most of it can't be changed later. A few things, like the set of SQL columns or indexes, can be changed via SQL DDL: ALTER TABLE..., CREATE INDEX..., etc. If something isn't available in the DDL, assume that it can't be changed without recreating the cache.
Check out this. The value type name will be mapped by each platform component - Java, C++, .NET - accordingly to the binary marshaller configuration. For example, it's common to use BinaryBasicNameMapper that will map all platform type names (with namespaces/packages) to simple names, so that different namespace/package naming conventions don't create a problem. When a class is needed to deserialize a value, it will be loaded via the regular platform-specific mechanism to load code. For Java, it'll be the classpath. For C++, I guess it's LD_LIBRARY_PATH. In any case, Ignite has nothing to do with that really.
Again, Ignite has nothing to do with that. Whatever works on your platform to load code can be used.
After few experiments, I found the solution and actually it is easy.
The value given at the valueType property is directly mapped to the binary object name when it create from the code.
for e.g below configuration
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="TEST"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="atomicityMode" value="TRANSACTIONAL"/>
<property name="writeSynchronizationMode" value="FULL_SYNC"/>
<!-- Configure type metadata to enable queries. -->
<property name="queryEntities">
<list>
<bean class="org.apache.ignite.cache.QueryEntity">
<property name="keyType" value="java.lang.Long"/>
<property name="valueType" value="TEST"/>
<property name="fields">
<map>
<entry key="ID" value="java.lang.Long"/>
<entry key="DATE" value="java.lang.String"/>
</map>
</property>
</bean>
</list>
</property>
</bean>
the below C++ code works
template<>
struct BinaryType<examples::TEST> : BinaryTypeDefaultAll<examples::TEST>
{
static void GetTypeName(std::string& dst)
{
dst = "TEST";
}
static void Write(BinaryWriter& writer, const examples::RHO& obj)
{
writer.WriteInt64("ID", obj.Id);
writer.WriteString("DATE", obj.dt);
}
static void Read(BinaryReader& reader, examples::RHO& dst)
{
dst.Id = reader.ReadInt64("Id");
dst.dt = reader.ReadString("dt");
}
};
I am trying to read HTTP header from inside an API that I defined in the ESB. I have tried various methods (see below) all of them print out "null" in the logs.
<log level="custom">
<property name="LOG-POSITION___________________________" value="...4"/>
<property name="AXIS2___________________________" expression="$axis2:accept"/>
<property name="AXIS2___________________________" expression="$axis2:Accept"/>
<property name="AXIS2___________________________" expression="$axis2:ACCEPT"/>
<property name="CTX___________________________" expression="$ctx:accept"/>
<property name="CTX___________________________" expression="$ctx:Accept"/>
<property name="CTX___________________________" expression="$ctx:ACCEPT"/>
<property name="TRP___________________________" expression="$trp:accept"/>
<property name="TRP___________________________" expression="$trp:Accept"/>
<property name="TRP___________________________" expression="$trp:ACCEPT"/>
</log>
I can't see why it is not working.
Here is the synapse code to read and log the content-type http header. Here $trp stands for the transport header.
<log level="custom">
<property name="Content_Type" expression="$trp:Content-Type"/>
</log>
Ref: https://docs.wso2.com/display/ESB500/Synapse+XPath+Variables
If it doesn't work, enable wire logs and post the logs in the question.
I have found an answer to this by looking through existing code, and doing a bit of trial and error.
The expression "$trp:Accept" works, but it must be used before using either a call or send mediator.
For any others experiencing this issue, move the property mediator for grabbing this to the beginning of your proxy or api and the values should come through.
I want to do a whitelisting of what properties are indexed/searched and shown in excerpt with a Magnolia search.
I am changing the indexing_configuration.xml in my website workspace.
Removing the index and restarting magnolia did not change anything...
By now I have this in my indexing_configuration.xml (next to other stuff)
but these are the String properties I want to include in my ecxcerpt the rest should be excluded:
<index-rule nodeType="nt:hierarchyNode">
<property boost="10" useInExcerpt="true">introTitle</property>
<property boost="1.0" useInExcerpt="true">introAbstract</property>
<property boost="1.0" useInExcerpt="true">contentText</property>
<property boost="1.0" useInExcerpt="true">subText</property>
<property boost="10" useInExcerpt="true">title</property>
<!-- exclude jcr:* and mgnl:* properties -->
<property isRegexp="true" nodeScopeIndex="false" useInExcerpt="false">.*:.*</property>
</index-rule>
<index-rule nodeType="mgnl:contentNode">
<property boost="5" nodeScopeIndex="false" useInExcerpt="true">introTitle</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">introAbstract</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">contentText</property>
<property boost="2" nodeScopeIndex="false" useInExcerpt="true">subText</property>
<property boost="5" nodeScopeIndex="false" useInExcerpt="true">title</property>
<!-- exclude jcr:* and mgnl:* properties -->
<property isRegexp="true" nodeScopeIndex="false" useInExcerpt="false">.*:.*</property>
</index-rule>
How can i get this to work as intended? Thanks for your help..
Most likely cause is that Magnolia/JR is not seeing your new configuration. Did you change your repo configuration (workspace.xml in website workspace) to point it to new index configuration?
Default looks like:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index" />
<!-- SearchIndex will get the indexing configuration from the classpath, if not found in the workspace home -->
<param name="indexingConfiguration" value="/info/magnolia/jackrabbit/indexing_configuration.xml"/>
and you need to point it to your new file.
Also not sure why you are setting indexing based on nt:hierarchyNode or mgnl:contentNode rather then using more specific mgnl:page/mgnl:component
I changed the discovery.xml file as described in the documentation to add a new facet over dc.type to our DSpace. When I finished reindexing and deleting the cache I see the new search filter at advanced search but not as a facet.
These are the changes I made to discovery.xml:
Added filter to sidbarFacets and SearchFilter:
<ref bean="searchFilterType" />
and this is the filter:
<bean id="searchFilterType" class="org.dspace.discovery.configuration.DiscoverySearchFilterFacet">
<property name="indexFieldName" value="type"/>
<property name="metadataFields">
<list>
<value>dc.type</value>
</list>
</property>
</bean>
Thanks in advance
The following modifications to discovery.xml on the latest DSpace master branch worked on my local setup:
https://github.com/bram-atmire/DSpace/commit/3f084569cf1bbc6c6684d114a09a1617c8d3de5d
One reason why the facet wouldn't appear in your setup, could be that you omitted to add it to both the "defaultconfiguration" as well as the specific configuration for the DSpace homepage.
After building and deploying, a forced discovery re-index using the following command made the facet appear:
./dspace index-discovery -f
Here is an example facet that I have configured in our instance. Try setting the facetLimit, sortOrder, and splitter. Re-index and see if that resolves the issue.
<bean id="searchFilterGeographic"
class="org.dspace.discovery.configuration.HierarchicalSidebarFacetConfiguration">
<property name="indexFieldName" value="geographic-region"/>
<property name="metadataFields">
<list>
<value>dc.coverage.spatial</value>
</list>
</property>
<property name="facetLimit" value="5"/>
<property name="sortOrder" value="COUNT"/>
<property name="splitter" value="::"/>
</bean>
Why isn't my transformation script running on any uploaded files beyond the first file?
I set up a transformation rule in Alfresco that listens to a folder. When a new file is placed into the folder, the rule triggers a script to run that takes a PDF without a text layer, breaks it into jpegs, OCRs the jpegs, then converts the jpegs into PDFs and merges the PDFs, returning an OCRed PDF with a text layer then copies the result into another folder so we know it got done.
Running the script at command line works. The first time I drop a file into the Alfresco folder (upload) it runs the script and copies the file. But any subsequent time I drop files into the folder, the script isn't run, but the file is still copied to the target folder. So I know the rule is being called, but the script doesn't seem to be running on the following files. I have logging on the script, so I know the script isn't even getting called. The rule is being applied to all new and modified files in the folder with no filters. Then it runs the Transform and Copy command using our custom OCR script and with the target folder being defined as the parent folder.
Below is my alfresco transformation extension:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
<bean id="transformer.worker.PdfOCRTool" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService"/>
</property>
<property name="transformCommand">
<bean name="transformer.pdftoocr.Command" class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<entry key=".*">
<value>/opt/ocr/ocr.sh ${source} ${target}</value>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
<property name="sourceMimetype">
<value>application/pdf</value>
</property>
<property name="targetMimetype">
<value>application/pdf</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.proxy.PdfOCRTool" class="org.alfresco.repo.management.subsystems.SubsystemProxyFactory">
<property name="sourceApplicationContextFactory">
<ref bean="thirdparty"/>
</property>
<property name="sourceBeanName">
<value>transformer.worker.PdfOCRTool</value>
</property>
<property name="interfaces">
<list>
<value>org.alfresco.repo.content.transform.ContentTransformerWorker</value>
</list>
</property>
</bean>
<bean id="transformer.PdfOCRTool" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.proxy.PdfOCRTool"/>
</property>
</bean>
</beans>
The transformation service is intended for converting items from one mimetype to another. I am not sure that converting from PDF to a second PDF is valid. You would be better implementing a custom Java repository action which then in turn uses a org.alfresco.util.exec.RuntimeExec bean to fire off the command.
Since your Spring config already defines a RuntimeExec bean, you could re-use this definition but wrap it instead in your own custom class which extends org.alfresco.repo.action.executer.ActionExecuterAbstractBase. In fact, if you take a look at the source of org.alfresco.repo.action.executer.TransformActionExecuter then that might give you some clues on how to go about implementing things.