Issue parsing PDF with Apache Nutch - extractor plugin - pdf

I am trying to index web pages AND pdf documents from a website. I am using Nutch 1.9.
I downloade the nutch-custom-search plugin from https://github.com/BayanGroup/nutch-custom-search. The plugin is awsome and indeed let me match selected divs to solr fieds.
The problem I am having is that, my site also contains numerous pdf files. I can see that they are fetched but never parsed. There is no pdf when I query solr. Just web pages. I am trying to use tika to parse .PDFs (I hope that I have the right idea)
If on cygwin, I run parsechecker see below, it seems to parse OK:
$ bin/nutch parsechecker -dumptext -forceAs application/pdf http://www.immunisationscotland.org.uk/uploads/documents/18304-Tuberculosis.pdf
I am not too sure what to do next (see below for my config)
extractor.xml
<config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd" omitNonMatching="true">
<fields>
<field name="pageTitleChris" />
<field name="contentChris" />
</fields>
<documents>
<document url="^.*\.(?!pdf$)[^.]+$" engine="css">
<extract-to field="pageTitleChris">
<text>
<expr value="head > title" />
</text>
</extract-to>
<extract-to field="contentChris">
<text>
<expr value="#primary-content" />
</text>
</extract-to>
</document>
</documents>
Inside my parse-plugins.xml i added
<mimeType name="application/pdf">
<plugin id="parse-tika" />
</mimeType>
nutch-site.xml
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|extractor|index-(basic|anchor)|query-(basic|site|url)|indexer-solr|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<property>
<name>http.content.limit</name>
<value>65536666</value>
<description></description>
</property>
<property>
<name>extractor.file</name>
<value>extractor.xml</value>
</property>
Help would be much appreciated,
Thanks
Chris

I think the problem relates to omitNonMatching="true" in your extractor.xml file.
omitNonMatching="true" means "don't index those pages that don't match in any extracto-to rules of extractor.xml". The default value is false.

Related

Multi-App Kiosk mode provisioning failure code '0xC00CE223'

I'm trying to put in place a kiosk on a Surface Go using the following AssignedAccess.xml file in my provisioning package:
<?xml version="1.0" encoding="utf-8" ?>
<AssignedAccessConfiguration
xmlns="https://schemas.microsoft.com/AssignedAccess/2017/config"
xmlns:r1809="https://schemas.microsoft.com/AssignedAccess/201810/config"
>
<Profiles>
<Profile Id="{f46cfb9f-044f-4d96-bb33-ea1c1c18a354}">
<AllAppsList>
<AllowedApps>
<App AppUserModelId="Microsoft.Windows.Explorer" r1809:AutoLaunch="true" />
<App AppUserModelId="Microsoft.WindowsCalculator_8wekyb3d8bbwe!App" />
<App DesktopAppPath="C:\Program Files\SumatraPDF\SumatraPDF.exe" />
</AllowedApps>
</AllAppsList>
<r1809:FileExplorerNamespaceRestrictions>
<r1809:AllowedNamespace Name="Downloads" />
</r1809:FileExplorerNamespaceRestrictions>
<StartLayout>
<![CDATA[<LayoutModificationTemplate xmlns:defaultlayout="http://schemas.microsoft.com/Start/2014/FullDefaultLayout" xmlns:start="http://schemas.microsoft.com/Start/2014/StartLayout" Version="1" xmlns="http://schemas.microsoft.com/Start/2014/LayoutModification">
<LayoutOptions StartTileGroupCellWidth="6" />
<DefaultLayoutOverride>
<StartLayoutCollection>
<defaultlayout:StartLayout GroupCellWidth="6">
<start:Group Name="Apps">
<start:Tile Size="4x2" Column="0" Row="2" AppUserModelID="Microsoft.WindowsCalculator_8wekyb3d8bbwe!App" />
<start:DesktopApplicationTile Size="2x2" Column="0" Row="0" DesktopApplicationLinkPath="%APPDATA%\Microsoft\Windows\Start Menu\Programs\SumatraPDF.lnk" />
<start:DesktopApplicationTile Size="2x2" Column="2" Row="0" DesktopApplicationLinkPath="%APPDATA%\Microsoft\Windows\Start Menu\Programs\System Tools\File Explorer.lnk" />
</start:Group>
</defaultlayout:StartLayout>
</StartLayoutCollection>
</DefaultLayoutOverride>
</LayoutModificationTemplate>
]]>
</StartLayout>
<Taskbar ShowTaskbar="false" />
</Profile>
</Profiles>
<Configs>
<Config>
<Account>CouncilKiosk</Account>
<DefaultProfile Id="{f46cfb9f-044f-4d96-bb33-ea1c1c18a354}"/>
</Config>
</Configs>
</AssignedAccessConfiguration>
I took a look at the logs and the consensus seems to be this error code '0xC00CE223'. According to my research this is telling me that "Validate failed because the document does not contain exactly one root node." (XML DOM Error Messages Doc) I'm not sure where this is going wrong.
The provisioning package is also setting 2 user accounts (local admin and local user), hiding OOBE, enabling tablet mode as default, and running a provisioning command script that installs a single application and sets registry keys necessary for autologin.
UPDATE: I re-imaged the Surface Go with Windows 10 Pro and it still fails. But now I get an error '0x8000FFFF' which appears to be related to windows update and the windows store. I only have 1 USB port on this thing so it isn't connected to the internet at this time.
UPDATE 2: I re-imaged with a more up to date ISO of 10 Pro and I'm back to the original errors listed in the above post. I have updated the XML file and changed the tag as well as the xmlns from rs5 to r1809. I am not seeing any changes and this continues to be a frustrating problem to have.
Test to change this:
https://schemas.microsoft.com/AssignedAccess/2017/config
to the following:
http://schemas.microsoft.com/AssignedAccess/2017/config

Font in Apache FOP 1.1

Can anybody help me? I my Oracle ADF project I use Apache FOP for printing data in PDF file. Application is running on CentOS. I need use Arial font for that. I want setup Arial font via auto-detect and MANIFEST.MF. My steps for this:
I've added folder with font to project
C:\dev\JdevUserDir\mywork\RIM282\ViewController\libs\font\arial.ttf
In ViewController.jpr I've added:
<hash>
<value n="id" v="Font"/>
<value n="isJDK" v="false"/>
</hash>
<hash>
<list n="classPath">
<url path="libs/font/"/>
</list>
<value n="deployedByDefault" v="true"/>
<value n="description" v="Font"/>
<value n="id" v="Font"/>
<value n="locked" v="true"/>
</hash>
In fop.xconf I've added:
<fonts>
<font-triplet name="Arial" style="normal" weight="bold"/>
<auto-detect/>
</fonts>
My manifest content ViewController/src/META-INF/MANIFEST.MF:
Manifest-Version: 1.0
Name: libs/font/arial.ttf
Content-Type: application/x-font
Fragment from xsl file:
<fo:block text-align="end"
font-size="10pt"
font-family="Arial">
Страница <fo:page-number />
</fo:block>
But, when I open pdf file, I see ##### instead cyrillic symbols. All symbols are OK on Windows, but there is issue with cyrillic on CentOS.
I've tried to use
<font-base>./libs/font</font-base>
in my fop.xconf file. But instead of path relative to folder which contains fop.xconf I get this path: C:/dev/JdevUserDir/system12.2.1.2.42.161008.1648/DefaultDoma‌​in/libs/font Why?

Removing indexes of master database from Content delivery log files

I want to remove the indexing of master database from content delivery log files.I added SwitchMasterToWeb.config to the app_config/include folder but still I am getting indexing of master databases in my log files.
Is there any configuration required or I need to customize some Sitecore files?
I guess you saw the exception Message: Index sitecore_master_index was not found in your log files on Content Delivery Server. It is a know issue for Sitecore and you will need to install a support package, based on your Sitecore version which are listed on the Sitecore Knowledge Base
If you are still seeing references to the master database after applying the SwitchmasterToWeb it is possible that either the file is not being loaded correctly or is loading too early.
I try to put it in a sub-folder that will process last (such as App_Config\Include\zzz_FinalConfigs). That way I can be sure it runs after all of the Sitecore subfolders and configuration files.
At this point, load up ShowConfig.aspx and verify that all references to the master database have been removed. You can look for patch:source references to your switchmastertoweb.config file to see if your file is being read and parsed.
If not, you may be editing the wrong file system.
You need to remove the master index from your content delivery server to remove log files.
In a Sitecore 7.2 solution, in my SwitchMasterToWeb.config file I have the following patch:
<?xml version="1.0" encoding="utf-8" ?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:set="http://www.sitecore.net/xmlconfig/set/">
<sitecore>
<search>
<configuration>
<indexes>
<index>
<locations>
<master>
<patch:delete />
</master>
</locations>
</index>
</indexes>
</configuration>
</search>
<contentSearch>
<indexUpdateStrategies>
<intervalAsyncCore type="Sitecore.ContentSearch.Maintenance.Strategies.IntervalAsynchronousStrategy, Sitecore.ContentSearch">
<patch:delete/>
</intervalAsyncCore>
<intervalAsyncMaster type="Sitecore.ContentSearch.Maintenance.Strategies.IntervalAsynchronousStrategy, Sitecore.ContentSearch">
<patch:delete/>
</intervalAsyncMaster>
<syncMaster type="Sitecore.ContentSearch.Maintenance.Strategies.SynchronousStrategy, Sitecore.ContentSearch">
<param desc="database">web</param>
</syncMaster>
</indexUpdateStrategies>
</contentSearch>
<!-- other patching configurations -->
</sitecore>
</configuration>

Datanucleus schema generation ignores "inheritance strategy=" directive

I'm working with the Datanucleus tutorial application for JDO, specifically this one.
Regardless which "inheritance strategy" I try the table layout is the same. I would like two tables, one for PRODUCT and one for BOOK, but using the configuration below I only get the PRODUCT table with columns for both class Product and class Book.
<class name="Product" identity-type="sequence">
<inheritance strategy="complete-table"/>
<field name="name">
<column name="PRODUCT_NAME" length="100" jdbc-type="VARCHAR"/>
</field>
<field name="description">
<column length="255" jdbc-type="VARCHAR"/>
</field>
</class>
<class name="Book" identity-type="sequence">
<field name="author">
<column length="40" jdbc-type="VARCHAR"/>
</field>
<field name="isbn">
<column length="20" jdbc-type="CHAR"/>
</field>
<field name="publisher">
<column length="40" jdbc-type="VARCHAR"/>
</field>
</class>
The directory structure is exactly as in the tutorial, as is the build.xml. I have tried generating the schema via both the Ant task and the command line.
I use the sequence of commands:
ant clean
ant compile
ant enhance
ant createschema
The schema is generated but not as the Datanucleus documentation suggests that it should be with inheritance strategy "compete-table."
My target database is PostgreSQL 8.4 running on Ubuntu 10.04 if that matters.
Anyone else run into this issue and found a solution?
To answer my own question:
In the datanucleus tutorial download, the build.xml file given has a "createschema" target like:
<target name="createschema">
...
<schematool ...>
<fileset dir="${basedir}/target/classes">
<include name="**/*.class"/>
</fileset>
...
</schematool>
</target>
It should be changed to include all .jdo files as shown below:
<target name="createschema">
...
<schematool ...>
<fileset dir="${basedir}/target/classes">
<include name="**/*.class"/>
<include name="**/*.jdo"/>
</fileset>
...
</schematool>
</target>
In addition the package-hsql.orm file needs to be renamed to package-hsql.jdo and its header needs to be changed to:
<?xml version="1.0"?>
<!DOCTYPE jdo PUBLIC
"-//Sun Microsystems, Inc.//DTD Java Data Objects ORM Metadata 2.0//EN"
"http://java.sun.com/dtd/orm_2_0.dtd">
<jdo>
...
<jdo>
Notice that the DOCTYPE and root element were changed. The root element was "orm" and changed to "jdo".
Once I made these changes the schema generation tool followed the "inheritance strategy" directive.
For my custom application, I had a similar issue, and it worked fine after making the changes in the header of the jdo file. I am using version 3.2.9.

How do I get my custom Maven Reporting Plugin to appear under the Project Reports in the generated site?

I have written a new maven reporting plugin and it executes when the site is generated and produces and output html file.
How do i get the Project Reports section of the site to contain a link to it?
As a workaround you could manually re-define the project reports section in a custom site.xml:
<!-- Inherit this menu for sub modules. -->
<menu name="Module Documentation" inherit="top" >
<item name="Overview" href="index.html" >
</item>
<item name="Information" href="project-info.html" >
<item name="Dependencies" href="dependencies.html" />
<item name="Project Plugins" href="plugins.html" />
<item name="Project Summmary" href="project-summary.html" />
...
</item>
<item name="Reports" href="project-reports.html" >
<item name="Checkstyle" href="checkstyle.html" />
<item name="Your Report" href="your-report.html" />
...
</item>
</menu>
The drawbacks are:
You have to specify every single report.
You have to provide project-reports.html as well (e.g. via project-reports.apt) and you will lose the automaticly generated overview page with all reports.
If some modules do not contain all reports, you have to provide individual site.xml files for them to override the parent site.xml.
See Configuring the Site Descriptor in the documentation of the site plugin.