Can Apache solr stores actual files which are uploaded on it? - apache

This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Regards,
Munish Arora

Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.
[1] https://tika.apache.org/
[2] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

Related

Where does the tool txt2pdf.exe store its configuration settings?

I am using txt2pdf to convert text files to pdfs. It's been working great but I got a new PC and I can't get it to retain the settings for lines per page. I don't see any contact information on their web site.
https://www.sanface.com/txt2pdf.html
Does anyone know where older program s might store their data?
I found it using a system file watcher:
C:\Users[user]\AppData\Local\VirtualStore\Windows\win.ini

AEM (Adobe Experience Manager) Indexed PDF Search Results

My employer has recently switched its CMS to AEM(Adobe Experience Manager).
We store a large amount of documentation and our site users need to be able to find the information contained within those documents, some of which are 100s pages in length.
Adobe are disappointingly saying their search tool will not search PDFs. Is there any format for producing or saving pdfs that allow the content be indexed?
I think you need to configure external index/search tools like Apache Solr and use REST endpoint to sync DAM data and fetch results on queries.
Out of the box AEM supports most binary formats, without needing for SOLR. You only need this in advanced scenarios, like exposing search outside of Authoring or having millions of assets.
When any asset is uploaded to AEM Dam it will go though a Dam Asset Workflow which has a step Metadata Processor. That step will extract content from the asset. So "binary" assets like Word docs, Excel and PDF it will be searchable. As long as you have Dam Asset Update workflow enabled you will be ok.

Azure Data Factory HTTP Connector Data Copy - Bank Of England Statistical Database

I'm trying to use the HTTP connector to read a CSV of data from the BoE statistical database.
Take the SONIA rate for instance.
There is a download button for a CSV extract.
I've converted this to the following URL which downloads a CSV via web browser.
[https://www.bankofengland.co.uk/boeapps/database/_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y][1]
Putting this in the Base URL it connects and pulls the data.
I'm trying to split this out so that I can parameterise some of it.
Base
https://www.bankofengland.co.uk/boeapps/database
Relative
_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y
It won't fetch the data, however when it's all combined in the base URL it does.
I've tried to add a "/" at the start of the relative URL as well and that hasn't worked either.
According to the documentation ADF puts the "/" in for you "[Base]/[Relative]"
Does anyone know what I'm doing wrong?
Thanks,
Dan
[1]: https://www.bankofengland.co.uk/boeapps/database/_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y
I don't see a way you could download that data directly as a csv file. The data seems to be manually copied from the site, using their Save as option.
They have used read-only block and hidden elements, I doubt there would any easy way or out of the box method within ADF web activity to help on this.
You can just manually copy-paste into a csv file.

Apache ManifoldCF TIKA

I am trying to extract the text content of a PDF using the Apache Tika integration on Apache ManifoldCF, in order to ingest some PDF files on my Laptop in an Elasticsearch server.
After properly creating the Tika Transformer and configuring it inside my job, I see that the resulting field "_content" on ES is filled with the binary encoding of the file, and not the text.
I saw also this :Extract file content with ManifoldCF, But still no answer has been provided (since 2015!).
Can anybody help me?
Thanks!
In the output connector for elastic search what is the field name that you have specified for the content field?
Please provide a field name as well as max document size.

iPad - how should I distribute offline web content for use by a UIWebView in application?

I'm building an application that needs to download web content for offline viewing on an iPad. At present I'm loading some web content from the web for test purposes and displaying this with a UIWebView. Implementing that was simple enough. Now I need to make some modifications to support offline content. Eventually that offline content would be downloaded in user selectable bundles.
As I see it I have a number of options but I may have missed some:
Pack content in a ZIP (or other archive) file and unpack the content when it is downloaded to the iPad.
Put the content in a SQLite database. This seems to require some 3rd party libs like FMDB.
Use Core Data. From what I understand this supports a number of storage formats including SQLite.
Use the filesystem and download each required file individually. OK, not really a bundle but maybe this is the best option?
Considerations/Questions:
What are the storage limitations and performance limitations for each of these methods? And is there an overall storage limit per iPad app?
If I'm going to have the user navigate through the downloaded content, what option is easier to code up?
It would seem like spinning up a local web server would be one of the most efficient ways to handle the runtime aspects of displaying the content. Are there any open source examples of this which load from a bundle like options 1-3?
The other side of this is the content creation and it seems like zipping up the content (option 1) is the simplest from this angle. The other options would appear to require creation of tools to support the content creator.
If you have the control over the content, I'd recommend a mix of both the first and the third option. If the content is created by you (like levels, etc) then simply store it on the server, download a zip and store it locally. Use CoreData to store an Index about the things you've downloaded, like the path of the folder it's stored in and it's name/origin/etc, but not the raw data. Databases are not thought to hold massive amounts of raw content, rather to hold structured data. And even if they can -- I'd not do so.
For your considerations:
Disk space is the only limit I know on the iPad. However, databases tend to get slower if they grow too large. If you barely scan though the data, use the file system directly -- may prove faster and cheaper.
The index in CoreData could store all relevant data. You will have very easy and very quick access. Opening a content will load it from the file system, which is quick, cheap and doesn't strain the index.
Why would you do so? Redirect your WebView to a file:// URL will have the same effect, won't it?
Should be answered by now.
If you don't have control then use the same as above but download each file separately, as suggested in option four. after unzipping both cases are basically the same.
Please get back if you have questions.
You could create a xml file for each bundle, containing the path to each file in the bundle, place it in a folder common to each bundle. When downloading, download and parse the xml first and download each ressource one by one. This will spare you the overhead of zipping and unzipping the content. Create a folder for each bundle locally and recreate the folder structure of the bundle there. This way the content will work online and offline without changes.
With a little effort, you could even keep track of file versions by including version numbers in the xml file for each ressource, so if your content has been partially updated only the files with changed version numbers have to be downloaded again.