cas server 3.5.2 with ldap - ldap

I work with cas server 3.5.2 ,
I want to integrate ldap with cas
I make this jar : cas-server-support-ldap-3.5.2.jar and spring-ldap-1.3.1.RELEASE-all.jar under : apache-tomcat-7.0.47\webapps\cas-server-webapp-3.5.2\WEB-INF\lib
I added :
<dependency>
<groupId>org.jasig.cas</groupId>
<artifactId>cas-server-support-ldap</artifactId>
<version>3.5.2</version>
</dependency>
in : apache-tomcat-7.0.47\webapps\cas-server-webapp-3.5.2\META-INF\maven\org.jasig.cas\cas-server-webapp\pom.xml
and I modified apache-tomcat-7.0.47\webapps\cas-server-webapp-3.5.2\WEB-INF\deployerConfigContext.xml
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to Jasig under one or more contributor license
agreements. See the NOTICE file distributed with this work
for additional information regarding copyright ownership.
Jasig licenses this file to you under the Apache License,
Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a
copy of the License at the following location:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!--
| deployerConfigContext.xml centralizes into one file some of the declarative configuration that
| all CAS deployers will need to modify.
|
| This file declares some of the Spring-managed JavaBeans that make up a CAS deployment.
| The beans declared in this file are instantiated at context initialization time by the Spring
| ContextLoaderListener declared in web.xml. It finds this file because this
| file is among those declared in the context parameter "contextConfigLocation".
|
| By far the most common change you will need to make in this file is to change the last bean
| declaration to replace the default SimpleTestUsernamePasswordAuthenticationHandler with
| one implementing your approach for authenticating usernames and passwords.
+-->
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xmlns:tx="http://www.springframework.org/schema/tx"
xmlns:sec="http://www.springframework.org/schema/security"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd
http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-3.1.xsd
http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd">
<!--
| This bean declares our AuthenticationManager. The CentralAuthenticationService service bean
| declared in applicationContext.xml picks up this AuthenticationManager by reference to its id,
| "authenticationManager". Most deployers will be able to use the default AuthenticationManager
| implementation and so do not need to change the class of this bean. We include the whole
| AuthenticationManager here in the userConfigContext.xml so that you can see the things you will
| need to change in context.
+-->
<bean id="authenticationManager"
class="org.jasig.cas.authentication.AuthenticationManagerImpl">
<!-- Uncomment the metadata populator to allow clearpass to capture and cache the password
This switch effectively will turn on clearpass.
<property name="authenticationMetaDataPopulators">
<list>
<bean class="org.jasig.cas.extension.clearpass.CacheCredentialsMetaDataPopulator">
<constructor-arg index="0" ref="credentialsCache" />
</bean>
</list>
</property>
-->
<!--
| This is the List of CredentialToPrincipalResolvers that identify what Principal is trying to authenticate.
| The AuthenticationManagerImpl considers them in order, finding a CredentialToPrincipalResolver which
| supports the presented credentials.
|
| AuthenticationManagerImpl uses these resolvers for two purposes. First, it uses them to identify the Principal
| attempting to authenticate to CAS /login . In the default configuration, it is the DefaultCredentialsToPrincipalResolver
| that fills this role. If you are using some other kind of credentials than UsernamePasswordCredentials, you will need to replace
| DefaultCredentialsToPrincipalResolver with a CredentialsToPrincipalResolver that supports the credentials you are
| using.
|
| Second, AuthenticationManagerImpl uses these resolvers to identify a service requesting a proxy granting ticket.
| In the default configuration, it is the HttpBasedServiceCredentialsToPrincipalResolver that serves this purpose.
| You will need to change this list if you are identifying services by something more or other than their callback URL.
+-->
<property name="credentialsToPrincipalResolvers">
<list>
<!--
| UsernamePasswordCredentialsToPrincipalResolver supports the UsernamePasswordCredentials that we use for /login
| by default and produces SimplePrincipal instances conveying the username from the credentials.
|
| If you've changed your LoginFormAction to use credentials other than UsernamePasswordCredentials then you will also
| need to change this bean declaration (or add additional declarations) to declare a CredentialsToPrincipalResolver that supports the
| Credentials you are using.
+-->
<bean class="org.jasig.cas.authentication.principal.UsernamePasswordCredentialsToPrincipalResolver" >
<property name="attributeRepository" ref="attributeRepository" />
</bean>
<!--
| HttpBasedServiceCredentialsToPrincipalResolver supports HttpBasedCredentials. It supports the CAS 2.0 approach of
| authenticating services by SSL callback, extracting the callback URL from the Credentials and representing it as a
| SimpleService identified by that callback URL.
|
| If you are representing services by something more or other than an HTTPS URL whereat they are able to
| receive a proxy callback, you will need to change this bean declaration (or add additional declarations).
+-->
<bean
class="org.jasig.cas.authentication.principal.HttpBasedServiceCredentialsToPrincipalResolver" />
</list>
</property>
<!--
| Whereas CredentialsToPrincipalResolvers identify who it is some Credentials might authenticate,
| AuthenticationHandlers actually authenticate credentials. Here we declare the AuthenticationHandlers that
| authenticate the Principals that the CredentialsToPrincipalResolvers identified. CAS will try these handlers in turn
| until it finds one that both supports the Credentials presented and succeeds in authenticating.
+-->
<property name="authenticationHandlers">
<list>
<!--
| This is the authentication handler that authenticates services by means of callback via SSL, thereby validating
| a server side SSL certificate.
+-->
<bean class="org.jasig.cas.authentication.handler.support.HttpBasedServiceCredentialsAuthenticationHandler"
p:httpClient-ref="httpClient" />
<!--
| This is the authentication handler declaration that every CAS deployer will need to change before deploying CAS
| into production. The default SimpleTestUsernamePasswordAuthenticationHandler authenticates UsernamePasswordCredentials
| where the username equals the password. You will need to replace this with an AuthenticationHandler that implements your
| local authentication strategy. You might accomplish this by coding a new such handler and declaring
| edu.someschool.its.cas.MySpecialHandler here, or you might use one of the handlers provided in the adaptors modules.
+-->
<!-- <bean
class="org.jasig.cas.authentication.handler.support.SimpleTestUsernamePasswordAuthenticationHandler" />-->
<bean class="org.jasig.cas.adaptors.ldap.BindLdapAuthenticationHandler">
<property name="filter" value="cn=%u" />
<property name="searchBase" value="OU=Users,OU=User Accounts,DC=MIN,DC=TN" />
<property name="contextSource" ref="contextSource" />
<property name="ignorePartialResultException" value="yes" />
</bean>
</list>
</property>
</bean>
<bean id="contextSource" class="org.jasig.cas.adaptors.ldap.util.AuthenticatedLdapContextSource">
<property name="urls">
<list>
<value>ldap://192.168.0.88:389</value>
</list>
</property>
<!--
+-->
<property name="userName"
value="CN=LDAP Requester,OU=Users,OU=Technical Accounts,DC=MIN,DC=TN"/>
<property name="password" value="min$2013"/>
<property name="baseEnvironmentProperties">
<map>
<entry>
<key>
<value>java.naming.security.authentication</value>
</key>
<value>simple</value>
</entry>
</map>
</property>
</bean>
<!--
This bean defines the security roles for the Services Management application. Simple deployments can use the in-memory version.
More robust deployments will want to use another option, such as the Jdbc version.
The name of this should remain "userDetailsService" in order for Spring Security to find it.
-->
<!-- <sec:user name="##THIS SHOULD BE REPLACED##" password="notused" authorities="ROLE_ADMIN" />-->
<sec:user-service id="userDetailsService">
<sec:user name="##THIS SHOULD BE REPLACED##" password="notused" authorities="ROLE_ADMIN" />
</sec:user-service>
<!--
Bean that defines the attributes that a service may return. This example uses the Stub/Mock version. A real implementation
may go against a database or LDAP server. The id should remain "attributeRepository" though.
-->
<bean id="attributeRepository"
class="org.jasig.services.persondir.support.StubPersonAttributeDao">
<property name="backingMap">
<map>
<entry key="uid" value="uid" />
<entry key="eduPersonAffiliation" value="eduPersonAffiliation" />
<entry key="groupMembership" value="groupMembership" />
</map>
</property>
</bean>
<!--
Sample, in-memory data store for the ServiceRegistry. A real implementation
would probably want to replace this with the JPA-backed ServiceRegistry DAO
The name of this bean should remain "serviceRegistryDao".
-->
<bean
id="serviceRegistryDao"
class="org.jasig.cas.services.InMemoryServiceRegistryDaoImpl">
<property name="registeredServices">
<list>
<bean class="org.jasig.cas.services.RegexRegisteredService">
<property name="id" value="0" />
<property name="name" value="HTTP and IMAP" />
<property name="description" value="Allows HTTP(S) and IMAP(S) protocols" />
<property name="serviceId" value="^(https?|imaps?)://.*" />
<property name="evaluationOrder" value="10000001" />
</bean>
<!--
Use the following definition instead of the above to further restrict access
to services within your domain (including subdomains).
Note that example.com must be replaced with the domain you wish to permit.
-->
<!--
<bean class="org.jasig.cas.services.RegexRegisteredService">
<property name="id" value="1" />
<property name="name" value="HTTP and IMAP on example.com" />
<property name="description" value="Allows HTTP(S) and IMAP(S) protocols on example.com" />
<property name="serviceId" value="^(https?|imaps?)://([A-Za-z0-9_-]+\.)*example\.com/.*" />
<property name="evaluationOrder" value="0" />
</bean>
-->
</list>
</property>
</bean>
<bean id="auditTrailManager" class="com.github.inspektr.audit.support.Slf4jLoggingAuditTrailManager" />
<bean id="healthCheckMonitor" class="org.jasig.cas.monitor.HealthCheckMonitor">
<property name="monitors">
<list>
<bean class="org.jasig.cas.monitor.MemoryMonitor"
p:freeMemoryWarnThreshold="10" />
<!--
NOTE
The following ticket registries support SessionMonitor:
* DefaultTicketRegistry
* JpaTicketRegistry
Remove this monitor if you use an unsupported registry.
-->
<bean class="org.jasig.cas.monitor.SessionMonitor"
p:ticketRegistry-ref="ticketRegistry"
p:serviceTicketCountWarnThreshold="5000"
p:sessionCountWarnThreshold="100000" />
</list>
</property>
</bean>
</beans>
but when I started cas server Ihave this error :
caused by: org.springframework.beans.factory.CannotLoadBeanClassException: Canno
t find class [org.jasig.cas.adaptors.ldap.util.AuthenticatedLdapContextSource] f
or bean with name 'contextSource' defined in ServletContext resource [/WEB-INF/d
eployerConfigContext.xml]; nested exception is java.lang.ClassNotFoundException:
org.jasig.cas.adaptors.ldap.util.AuthenticatedLdapContextSource
at org.springframework.beans.factory.support.AbstractBeanFactory.resolve
BeanClass(AbstractBeanFactory.java:1262)
at org.springframework.beans.factory.support.AbstractAutowireCapableBean
Factory.createBean(AbstractAutowireCapableBeanFactory.java:433)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getOb
ject(AbstractBeanFactory.java:294)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistr
y.getSingleton(DefaultSingletonBeanRegistry.java:225)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBe
an(AbstractBeanFactory.java:291)

My suggestion is:
Create a webapp maven project
Add cas-server-webapp dependency (Take a look here)
Create war using mvn package.
Run cas and test it.
Add cas-server-support-ldap dependency
If you want to modify files, create the same structure in src/main/webapp. Your files will replace cas files.

Change the bean contextSource
<bean id="contextSource" class="org.jasig.cas.adaptors.ldap.util.AuthenticatedLdapContextSource">
....
</bean>
to new way of doing it in version 3.5.2
<bean id="contextSource" class="org.springframework.ldap.core.support.LdapContextSource">
You'll need of course spring-ldap jar, you already have it.
more details here, look for contextSource

Just fixed the same problem on cas 3.5.2+LDAP. You can replace spring-ldap-1.3.1.RELEASE-all.jar with following jar files in ..\WEB-INFO\lib folder for a try.
spring-ldap-core-1.3.2.RELEASE.jar
spring-ldap-core-tiger-1.3.2.RELEASE.jar
Download link:
http://mvnrepository.com/artifact/org.springframework.ldap/spring-ldap-core/1.3.2.RELEASE

Related

ActiveMQ & built-in Jetty: how to redirect HTTP to HTTPS? And how to signal which protocol to use?

I have modified the admin console of ActiveMQ, i.e. the built-in Jetty, to use HTTPS instead of plain HTTP. However, two (albeit minor) issues remain:
I only managed to disable the HTTP port and enable the HTTPS port as suggested in the jetty.xml file:
<list>
<!--
Default: Enable this connector if you wish to use http with web console
->
<bean id="Connector" class="org.eclipse.jetty.server.ServerConnector">
<constructor-arg ref="Server" />
<!- see the jettyPort bean ->
<property name="host" value="#{systemProperties['jetty.host']}" />
<property name="port" value="#{systemProperties['jetty.port']}" />
</bean>
<!- -->
<!--
Enable this connector if you wish to use https with web console
-->
<bean id="SecureConnector" class="org.eclipse.jetty.server.ServerConnector">
<constructor-arg ref="Server" />
<constructor-arg>
<bean id="handlers" class="org.eclipse.jetty.util.ssl.SslContextFactory">
<property name="keyStorePath" value="${activemq.conf}/broker.ks" />
<property name="keyStorePassword" value="password" />
</bean>
</constructor-arg>
<property name="port" value="8162" />
</bean>
I would have preferred to leave the HTTP port active but use it to redirect HTTP calls to HTTPS. Can one do that and if so, how? I found no documentation describing this.
If one looks at the log at startup one gets a line
...
2022-11-23 17:56:04,836 | INFO | ActiveMQ WebConsole available at http://0.0.0.0:8162/ | org.apache.activemq.web.WebConsoleStarter | WrapperSimpleAppMain
2022-11-23 17:56:04,836 | INFO | ActiveMQ Jolokia REST API available at http://0.0.0.0:8162/api/jolokia/ | org.apache.activemq.web.WebConsoleStarter | WrapperSimpleAppMain
...
I.e. the URL displayed obviously picks up the correct port (8162 which I had changed from the default 8161 when switching to HTTPS) but displays the wrong protocol ("http") which is not correct. HTTP is not served any more, only HTTPS.
Can one tweak that as well so that the log also displays the correct protocol, i.e https://0.0.0.0:8162/....
Specify your (http connector) HttpConfiguration properly with regards to securePort and secureScheme.
Then add the SecureRedirectHandler somewhere early in your Jetty Handler tree.

Could not refresh JMS Connection for destination 'queue://inventorydsDestination' - retrying in 5000 ms. Cause: AOP configuration seems to be invalid

I'm getting the following exception when I re-deploy the application war in the Tomcat manager. For example, on first time deployment it connects to the external ActiveMQ properly but when I stop/start the war in Tomcat manager, then the following execption is thrown repeatedly. After this, the JMS does not connect to ActiveMQ with the below exception:
[2015-09-13T04:03:33.689] | [ERROR] | [inventorydsRequestListenerContainer-1] | [Could not refresh JMS Connection for destination 'queue://inventorydsDestination' - retrying in 5000 ms. Cause: AOP configuration seems to be invalid: tried calling method [public abstract javax.jms.Connection javax.jms.ConnectionFactory.createConnection() throws javax.jms.JMSException] on target [org.springframework.jms.connection.UserCredentialsConnectionFactoryAdapter#168d95c7]; nested exception is java.lang.IllegalArgumentException: java.lang.ClassCastException#2fb6f3c3]
applicationContext-Jms.xml
<bean id="jmsJndiConnectionFactory" class="org.springframework.jndi.JndiObjectFactoryBean">
<property name="jndiName" value="${inventory.mq.name}"/>
<property name="lookupOnStartup" value="false"/>
<property name="cache" value="true" />
<property name="proxyInterface" value="javax.jms.QueueConnectionFactory" />
</bean>
<bean id="jmsConnectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">
<property name="targetConnectionFactory" ref="jmsJndiConnectionFactory" />
<property name="sessionCacheSize" value="10" />
</bean>
connectionFactory - JNDI configuration
<bean id="jndiName" class="java.lang.String">
<constructor-arg value="${inventory.mq.name}"/>
</bean>
<bean id="bindingObject" class="org.springframework.jms.connection.UserCredentialsConnectionFactoryAdapter">
<property name="targetConnectionFactory" ref="mqConnectionFactory" />
<property name="username" value="${inventory.activeMQ.username}" />
<property name="password" value="${inventory.activeMQ.password}" />
</bean>
<bean id="mqConnectionFactory" class="org.apache.activemq.ActiveMQConnectionFactory">
<property name="brokerURL" value="${inventory.activeMQ.brokerurl}" />
</bean>
Properties:
inventory.activeMQ.brokerurl=tcp://localhost:61616
inventory.activeMQ.username=admin
inventory.activeMQ.password=admin
inventory.mq.name=jms/connectionFactory
inventory.queue.type=org.apache.activemq.command.ActiveMQQueue
I had similar issue and discovered it was a classpath issue between tomcat and my web application. I needed to set the scope of the jms dependency in my web app to provided instead of the default (i.e. compile). That way, my WAR deployable did not contain another jms jar that clashed with the jms classes contained in the apache-activemq-all jar that was located in the tomcat lib folder.
<dependency>
<groupId>javax.jms</groupId>
<artifactId>jms-api</artifactId>
<version>1.1-rev-1</version>
<scope>provided</scope>
</dependency>
Try after skipping ALL Breakpoints in Debug mode/ Switch off Debug Mode or Run in Run Mode

allow specific role in cas

I have a cas server that authenticate and send back some attributes corerectly. Now i want to add a service that check user roles in principal attributes and allow access to a service only if logged in user has the Specific role ( like admin role!).
I read about 'requiredHandlers' and thought it can help, but i can not make it work!
For service, i have something like this:
<bean class="org.jasig.cas.services.RegexRegisteredService">
<property name="id" value="1"/>
<property name="name" value="Admin panel service"/>
<property name="serviceId" value="http://localhost:8080/admin"/>
<property name="evaluationOrder" value="0"/>
<property name="ignoreAttributes" value="true"/>
<property name="requiredHandlers" value="supporterAuthenticationHandler"> <!-- this is what i found so far -->
</property>
</bean>
where supporterAuthenticationHandler is defined in authenticationManager
<bean id="authenticationManager" class="org.jasig.cas.authentication.PolicyBasedAuthenticationManager">
<constructor-arg>
<map>
<entry key-ref="proxyAuthenticationHandler" value-ref="proxyPrincipalResolver"/>
<entry key-ref="primaryAuthenticationHandler"><null /></entry>
<entry key-ref="supporterAuthenticationHandler"><null /></entry>
</map>
</constructor-arg>
<property name="authenticationPolicy">
<bean class="org.jasig.cas.authentication.AnyAuthenticationPolicy"/>
</property>
</bean>
And supporterAuthenticationHandler is a simple implemention of AuthenticationHandler (nothing implemented in it yet.
problem is i can not make cas to check supporterAuthenticationHandler so that i can go further (and probably fall into another hole where i need the new Principal with attributes).
Am i going completely the wrong way? Shall i check user role in my admin application? is it even possible to check roles with cas with different services?
This goes to the point that CAS is an authentication service not an authorization service. CAS only ensures that users are who they say they are not what they can do in an application (service).
please see this answer
https://security.stackexchange.com/questions/7623/can-central-authentication-service-cas-do-authorization

Access activemq Poolable Connection factory as OSGI service

I am using fuse 6.0 and activemq 5.8. Instead of defining activemq poolable connection factory in each bundle, it makes sense to define in a common bundle and expose it as osgi service. I created blue print file in FUSE_HOME/etc and opened an osgi service like this.
<osgix:cm-properties id="prop" persistent-id="xxx.xxx.xxx.properties" />
<bean id="jmsConnectionFactory" class="org.apache.activemq.ActiveMQConnectionFactory">
<property name="brokerURL" value="${xxx.url}" />
<property name="userName" value="${xxx.username}" />
<property name="password" value="${xxx.password}" />
</bean>
<bean id="pooledConnectionFactory" class="org.apache.activemq.pool.PooledConnectionFactory" init-method="start" destroy-method="stop">
<property name="maxConnections" value="${maxconnections}" />
<property name="connectionFactory" ref="jmsConnectionFactory" />
</bean>
<service ref="pooledConnectionFactory" interface="javax.jms.ConnectionFactory">
<service-properties>
<entry key="name" value="localhost"/>
</service-properties>
</service>
and when i try to access this service in both blueprint files and spring text files like this
<reference id="pooledConnectionFactory" interface="javax.jms.ConnectionFactory"/>
bean id="jmsConfig" class="org.apache.camel.component.jms.JmsConfiguration">
<property name="connectionFactory" ref="pooledConnectionFactory"/>
<property name="concurrentConsumers" value="${xxx.concurrentConsumers}"/>
</bean>
<bean id="activemq" class="org.apache.activemq.camel.component.ActiveMQComponent">
<property name="configuration" ref="jmsConfig"/>
</bean>
but I am getting following expection during bundles startup.
Failed to add Connection ID:PLNL6237-55293-1401929434025-11:1201, reason: java.lang.SecurityException: User name [null] or password is invalid.
I even defined compendium definition in my bundles.
How can i solve this problem? any help is appreciated.
I found this online https://issues.apache.org/jira/i#browse/SM-2183
Do i need to upgrade?
It looks to me like you're using the property placeholders incorrectly. First of all, you should know what osgix:cm-properties only exposes the properties at the persistent id that you specify. You can treat it like a java.util.Properties object, and even inject it into a bean as one. This does however mean that it makes no attempt to resolve the properties.
To resolve properties, use spring's property placeholder configurer.
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="properties" ref="prop"/>
</bean>
P.S. The persistent id of cm-properties is the name of the file, not including the file type. You don't need the .properties at the end.

How do i exclude everything but text/html from a heritrix crawl?

On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"
My Problem: i dont know how to implement it in my cxml File. Especially:
Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ...
There is no ContentTypeRegExpFilter in the sample cxml Files.
Kris's answer is only half the truth (at least with Heritrix 3.1.x that I'm using). A DecideRule return ACCEPT, REJECT or NONE. If a rule returns NONE, it means that this rule has "no opinion" about that (like ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (as all other MatchesRegexDecideRule) can be configured to return a decision if a regex matches (configured by the two properties "decision" and "regex"). The setting means that this rule returns an ACCEPT decision if the regex matches, but returns NONE if it does not match. And as we have seen - NONE is not an opinion so that shouldProcessRule will evaluate to ACCEPT because no decisions have been made.
So to only archive responses with text/html* Content-Type, configure a DecideRuleSequence where everything is REJECTed by default and only selected entries will be ACCEPTed.
This looks like this:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule" />
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*" />
</bean>
</list>
</property>
</bean>
</property>
<!-- other properties... -->
</bean>
To avoid that images, movies etc. are downloaded at all, configure the "scope" bean with a MatchesListRegexDecideRule that REJECTs urls with well known file extensions like:
<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList">
<list>
<value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
<value>.*(?i)(\.(rar|zip|tar|gz))$</value>
<value>.*(?i)(\.(pdf|doc|xls|odt))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js|css))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
</list>
</property>
</bean>
The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.
The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule
A possible ARCWriter configuration:
<bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*">
</bean>
</property>
<!-- Other properties that need to be set ... -->
</bean>
This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.
Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).
As shouldProcessRule is inherited from Processor, this can be applied to any processor.
More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)