I've got solr 9.0 running with the a request handler set up for Tika per https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html.
If I pass in a pdf document (that is, a text document that is stored as a PDF), I get the expected results of being able to query for the content of the document.
If, however, I pass in a pdf that is an image (it is a scanned page from a newsletter, then saved as a PDF), no OCR is taking place. I'm using solrj to communicate with the solr install.
I also tried indexing the PDF after it was exported as a PNG. This worked testing with locally running tika, but not with solr.
fun index(file: File) {
val urlString = "http://localhost:8983/solr/films"
val solr = HttpSolrClient.Builder(urlString).build()
solr.parser = XMLResponseParser()
val req = ContentStreamUpdateRequest("/update/extract")
// I've tried both "image/pdf" and "application/pdf"
req.addFile(file, "image/pdf")
req.setParam("literal.id", file.name);
req.setAction(ACTION.COMMIT, true, true)
val result = solr.request(req)
println("Result: $result")
}
solrconfig.xml
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
I decided to test in isolation with tika, so I started the docker container
docker run -it \
--name tika-server-ocr \
-d \
-p 9998:9998 \
apache/tika:1.24-full
If I passed in the file as a PDF, it did not work:
curl -T "285 October-5.pdf" http://localhost:9998/tika
If I pass in an exported png from the PDF, it does work:
curl -T "285 October-5 copy.png" http://localhost:9998/tika
NEGOTIATING GARRY'S ANCHORAGE
Garry's Anchorage is a popular rest spot on the western
I'm guessing there is a bit of config or perhaps a parameter I need to send in to solr during the extraction?
Related
I am trying to upgrade solr 4.7 to solr 5.5.5.
In solr 4.7 solr.xml contained <str name="sharedLib">common</str> so configuration files such as solrconfig.xml loaded from the path '$SOLR_HOME/solr/common/conf' instead loading this file from each core.
Now, after upgrading to solr 5.5.5, I put the same 'sharedLib' but after starting solr I see errors in the UI for each core:
mycore_1: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core mycore_1: Error loading solr config from ($SOLR_HOME)/solr/mycore_1/conf/solrconfig.xml
mycore_2: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core mycore_2: Error loading solr config from ($SOLR_HOME)/solr/mycore_2/conf/solrconfig.xml
remote_shared_instance: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core remote_shared_instance: Error loading solr config from ($SOLR_HOME)/solr/remote_shared_instance/conf/solrconfig.xml
Seems that it tried to search for the configuration files in the core's folders instead in the shared folder.
solr.xml placed in the right folder ($SOLR_HOME) because when deleting this file I get the message: 'olr home directory $SOLR_HOME/solr must contain a solr.xml file!' when starting solr.
I start solr from $SOLR_HOME which is different from the solr installation folder.
This is how the 'solr' section in solr.xml file looks:
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
<str name="sharedLib">common</str>
</solr>
I also tried using configsets - I created a new folder under $SOLR_HOME 'solr/solr/configsets/common/conf' with cond files but how can I tell solr to refer to this folder? NOTE that the original 'configsets' folder is under solr installation folder 'solr-5.5.5/server/solr/configsets' and not under $SOLR_HOME.
Have I missed something?
Thanks!
Mike
I want to index from two different databases. Therefore I make two data-config.xml files with different names.
I integrate in solrconfig.xml file two requestHandler with DataimportHandler.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config-847.xml</str>
</lst>
<requestHandler name="/dataimport857" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config-857.xml</str>
</lst>
But it does not function. I did the same configuration in solr 4.7, it function without problem. What ist different between solr 4.7 and solr 6.0? Or how it function?
It is probably SOLR-8993 affecting new Admin UI.
Workarounds:
Use legacy Admin UI, accessible through a link on the top of the screen
Pass config value as a URL parameter invoking DIH URL directly and not via Admin UI. The defaults section is just that - defaults that can be overridden with URL parameters.
Noob Solr question
I am trying to set up Solr, and to aid I have been using Apache Solr installer from bitnami.
This will install Solr 5.4.
I have gone and created a new core, and everything looks good. However when I restart solr, the core I have just created is lost.
I have not altered any configuration items from what is installed by Bitnami
I have been reading up about how Solr 5 is self discovering, and I am sure that everything is correct.
This is a copy of my solr.xml file from C:\Bitnami\solr-5.4.0-0\apache-solr\solr
<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>
<int name="distribUpdateConnTimeout">${distribUpdateConnTimeout:60000}</int>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:600000}</int>
<int name="connTimeout">${connTimeout:60000}</int>
</shardHandlerFactory>
</solr>
And I have checked, and in the core folder I have created, there is a core.properties file in the the conf folder. This is the contents of the file
#Written by CorePropertiesLocator
#Tue Dec 22 10:37:24 UTC 2015
name=sitecore_analytics_index
config=solrconfig.xml
schema=schema.xml
dataDir=data
loadOnStartup=true
So I cannot understand why the core is not being discovered. Any help greatly appreciated
ps. I am doing this on Windows and not *nix
I am experimenting with Apache Nutch 2.3.
When I ran the parse command for the URL http://comptuergodzilla.blogspot.com Nutch parse the content correctly. I mean I get all the outlinks and content in ol and p column family respectively.
But when I did the same for URL http://goal.com/en-india it was not able to parse the site outlinks and content.
What makes me scratch my head is after running parsechecker command for URL http://www.goal.com/en-india I get all the parsed contents and outlinks.
Regarding above my questions are:
i. Why parse command is not working? It should work if parsechecker is parsing the URL correctly.
ii. Do I have to build the separate HTMLParser plugin for achieving above.
Partial Text is not rendering correct on a site.
From browser:
From phantomjs:
Machine :Fedora 64 bit
Phantomjs version : 1.9.7
Dependencies already installed :
yum install urw-fonts
sudo yum install fontconfig freetype libfreetype.so.6 libfontconfig.so.1 libstdc++.so.6
On another Windows platform works fine, only issues with Linux environment. What am i missing?
I was having a problem with phantomjs and rendering of fonts for special characters (Japanese, Chinese, Hindi, etc.), and that's how I solved the problem:
Download the corresponding .ttf Unicode fonts file. There are several free sites on the web, but you have several Unicode fonts free files for several different character maps here.
Save the .ttf files in your server/machine either in the directory /etc/fonts or in the directory ~/.fonts. The latter is also a unix directory, if it does not exist, create it, and you need it if you have no root privileges (shared server for example)
Run either fc-cache -fv /etc/fonts or fc-cache -fv ~/.fonts/ accordingly. This fc-cache command scans the font directories on the system and builds font information cache files
Then in the phantomjs code, use always utf-8 either for file system
var HTMLfile = fs.open(path + file_name, {
mode: 'r', //read
charset: 'utf-8'
});
var content = HTMLfile.read();
HTMLfile.close();
or for webpage
var webPage = require('webpage');
var page = webPage.create();
var settings = {
operation: "POST",
encoding: "utf8",
headers: {
"Content-Type": "application/json"
},
data: JSON.stringify({
some: "data",
another: ["custom", "data"]
})
};
After long search, it worked for me.