i've been working with Solr for a while, i recently tried the solr-cell component and i'm indexing some PDFs, however im having the exact same problem presented in this thread.
When I search for *:* in the admin console, the PDFs are listed. However when I search for content within the PDF I get no results.
I already tried the command from the answer given there with no luck, im still having the same problem, i've tried with different Solr versions (i'm using 3.5 btw), different PDFs, i've changed the fields in the schema.xml, i've modified the RequestHandlers in solrconfig.xml but nothing seems to work. Any help would be any appreciated.
I got it working finally. It turns out it was a problem with the fmap.content input parameter. I didn't declare it directly on the RequestHandler in the solrconfig.xml file, instead I was passing it in the curl command I was using to index the PDF file:
curl 'http://localhost:8080/solr/solrcell/update/extract?map.content=text&map.stream_name=id&commit=true' -F "file=#mccm.pdf"
I know this way should work too but as you can see there was a 'map' instead of 'fmap'
(I was using a book example from a previous version of solr).
I opted for leave the fmap input parameter explicitly declared in the solrconfig.xml file to save me any problems:
<str name="fmap.content">text</str>
Thanks for your help.
Related
This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Regards,
Munish Arora
Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.
[1] https://tika.apache.org/
[2] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
Problem: Need to convert local html (with local images etc) to pdf from an AIX box running Universe 11.2.5 with System Builder
Current solution: FTP over html file to a Windows server which converts in batches and sends the e-mail to the destination
Proposed Solution: Do everything on the AIX box, from converting html to pdf and sending the e-mail.
Current problem: Unable to find a way to convert local html to PDF on the AIX box. I have been trying many different ways from trying to install Python3, but to no avail.
The only really difficult part of the process is getting the HTML to render into a format will properly display your html into pages that are suitable for printing. There is a fair amount of magic that goes on between HTTP:GET and clicking print on a browser window that needs to be accounted for.
I was trying accomplish something similar many moons ago on AIX but kind of ran into a skill level/time wall because I was going to have essentially create a headless browser to render the html. It looks like there are now some utilities that you might be able to leverage. I found this recent updated article on Super User that actually got me somewhat excited, especially since I don't use AIX anymore so precompiled binaries and well understood and easily attainable dependencies are something I can actually have in my life.
https://superuser.com/questions/280552/how-can-i-render-a-website-as-an-image-from-the-shell
Good Luck.
There seems to be several questions rolled into this one item.
Converting HTML to PDF, while that is just a data manipulation that you could do in basic, writing such code would be a large task. The option you use sending it to another system is valid, but put more points of failure into the system. I would think you could find code to do it on the AIX box.
Rocket plans on getting the MV Python to work on AIX, this will make the converting of html to PDF much easier since there are a lot of open source modules.
As for my suggestion of using sockets, that would be if you intend to send it to a service that will take the htms, and return the pdf document.
i.e. Is there a web service for converting HTML to PDF?
Once you have the pdf document, you can either store it in a UniVerse type-19 file, or do the base64 encoding and store it in UniVerse hash file.
Hope this helps,
Mike
We have custom content types that were created as extensions of the ATTypes, two of them extend the ATFile type and one extends the ATImage type. We recently upgraded from Plone 4.2 to Plone 4.3.2. Just discovered we are not using Blob storage at all. No wonder our Data.fs is HUGE. So, I have been trying to migrate these custom types.
I have followed all of the steps explained in this example and the product's notes from pypi, these Plone instructions, and used the example from the pypi page for archetypes.schemaextender (Sorry, since I'm still a noob my reputation won't let me post more than 2 links).
In the end, I created an extender script that just extends the ATFile type changing the FileField to BlobField. It seems to be working for new items. I can add a new CustomFileType and it appears to be uploading the file to blob, and my new upload field is showing (I changed the description as a quick way to verify which one it was using).
However, I am having a problem migrating all existing content items to move the binary files over to blob. I tried the generic migrate() script, then I created my own migrate and walker as suggested in the above resources. It doesn't seem like it is doing anything though. When printing results for each item it tries merging, I do see this returned for each item:
DEBUG ATCT.migration Migrating /site/path/to/custom/file/filename.ext (CustomFile -> Blob)
When I navigate to the custom file type in the site, where it usually shows the link to the file, it is just empty. Then going to edit, it treats it as if there is no file there. As a check, I disabled the extender, restarted, and reloaded the custom file. The file was there now. So it looks like the script I am running just isn't moving that file over to where it should be now.
I feel like I am missing something simple, and it is right there, but I can't seem to find it. All of this is learn as I go and a bit over my head, so hopefully someone can easily set me straight.
If I need to provide any additional information leave a comment and I will try to provide what you need.
UPDATE
I used the Red Turtle objects as examples to migrate my custom types as suggested by keul. I still was not able to get the file to migrate to blob within the type itself. So, I tried a different approach. I created a new custom type "CustomBlob", that is a mimic setup of my CustomFile type, and only extended this new blob type to be blob aware. Then I migrated the CustomFiles to CustomBlob, did a complete clear and rebuild, and packed the zeo. The migration seemed to work for the most part, the blobstorage grew by an expected amount, the new types worked. However, the Data.fs didn't go down in size. I would have thought that the binary files that were stored in Data.fs would be removed during the migration. Am I understanding this incorrectly? How can I remove these files so the Data.fs size goes down appropriately?
Not sure if this is the best solution, but here is how I was able to get this to work.
I created temporary content types parallel of each type (for CustomImage I made CustomImageBlob, and so on). I made the new types blob-aware only, migrated all types to their parallel. Then I enabled the extender for the original types to make them blob-aware, and migrated back. It is a little redundant and time consuming, but I just could not get the files to migrate to blob when migrating to itself.
Providing this as the best answer so far in case it helps someone else, or might encourage someone to find a better solution. Thanks for the tip keul, it definitely helped me get to this solution.
I am creating the Index using Lucene.
But using the SolrSearch engine to search.
My problem is
while I index I add each filed in my
code using the code
doc.add(fieldname, val, tokenized)
**But my code does not see the schema file
Even copy fields I need to add manually**
Now I want to use the autosuggest feature of Solr
I do not know how to enable this feature while creating the Index
But when I use the simplePostTool to post the data through Solr all is fine.
I cannot do that because I have to
get some text from different urls.
So can someone please advise me how
can I achieve this? A sample code
will be very helpful. In any case If
I can have some code that can see
the schema file and use the
fieldTypes it would be great.
Thanks everyone.
--pramila
See EmbeddedSolrServer at:
http://wiki.apache.org/solr/Solrj
It's a pure Java API to Solr, which will allow you to index your documents using the schema.xml you have defined.
How can I add a new topic to TWiki programmatically?
I've got a working TWiki (http://twiki.org/) installation, everything works fine.
I need to find a way to create and add new Topics through the command line, programmatically.
Any ideas how this can be accomplished?
Thanx!
d.
What I did was
take a look at some wiki pages (files in <twiki-home>/data/Main/<PageName>.txt) to figure out the file/text format (pretty much what you see in the browser, preceded by one line of meta info)
generate that text format with a perl script with content based on data from a DB or some Excel
copy the files to the apropriate location using putty's pscp on windows
I think using TWiki scripts is a more "clean" way, as you wouldn't have to worry about metadata in the TXT file or update the .changes file.
Just use wget to make a POST call using the 'save' script (see documentation here)