AEM (Adobe Experience Manager) Indexed PDF Search Results - pdf

My employer has recently switched its CMS to AEM(Adobe Experience Manager).
We store a large amount of documentation and our site users need to be able to find the information contained within those documents, some of which are 100s pages in length.
Adobe are disappointingly saying their search tool will not search PDFs. Is there any format for producing or saving pdfs that allow the content be indexed?

I think you need to configure external index/search tools like Apache Solr and use REST endpoint to sync DAM data and fetch results on queries.

Out of the box AEM supports most binary formats, without needing for SOLR. You only need this in advanced scenarios, like exposing search outside of Authoring or having millions of assets.
When any asset is uploaded to AEM Dam it will go though a Dam Asset Workflow which has a step Metadata Processor. That step will extract content from the asset. So "binary" assets like Word docs, Excel and PDF it will be searchable. As long as you have Dam Asset Update workflow enabled you will be ok.

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

How to retrieve files in Domino Web documents to embed them instead of showing them as links?

I have a Notes app that was designed for the browser, not the client. It allowed upload of files into the documents, so nearly all the documents have files. The files are stored in the NSF as $FILE and displayed in the documents as links.
I am using Adobe Acrobat Pro to create PDFs from the documents and need to include the file attachments within the PDFs, however the PDFs just include links to the files, not the attachments. Can I write an agent to run against the documents to get those files and embed them within the documents? When I view those documents through the client, I see all of the HTML etc. and then at the bottom of the document, the file attachments appear. When I view these same documents in the browser, the file attachments do not appear. If I could merely ensure that they are there, then when running the PDF generator in Acrobat Pro, they would be included in the PDFs and executable.
I am really stuck here, with no other way to 'archive' this notes database with all the data intact.
Thanks in advance for any insights!!
Ginni
There is a commercial product from Swing Software that does this. I hear that it's quite good, but I've never used it. Let me explain why...
The way I usually end up doing this is just quick-and-dirty. I write an agent to export the files, using the document UNID as part of the filename. The same agent exports all the data fields from the document into a CSV file, and I add a column with the filename of the extracted attachment. In your case, I would add two columns -- one for the extracted attachment(s), and one for the generated PDF. The CSV serves as an index for the exported data. It can be imported into something more friendly, or just left as-is and brought up in Excel, depending on the customer's usage requirements and available systems. I've recommended Swing Software's product and offered to explore other ideas for developing code (e.g., using wkhtmltopdf for Domino web apps to capture a WYSIWYG rendering based on an HTML crawl) for PDF rendering of Notes documents for a couple of clients, but none of them have justified the cost that would be involved in buying licenses and/or writing the code. Quick and dirty always seems to win, even when there are retention and eDiscovery considerations taken into account.

Can Apache solr stores actual files which are uploaded on it?

This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Regards,
Munish Arora
Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.
[1] https://tika.apache.org/
[2] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.
Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.
Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.
Nick
Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.
Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.
Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.
So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.
I would try the Elasticsearch attachment plugin. Details can be found here:
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
It's built on top of Apache Tika:
http://tika.apache.org/1.7/formats.html
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats
A bit late to the party but this may help someone :)
I had a similar problem and some research led me to fscrawler. Description:
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH
crawling.
REST interface to let you "upload" your binary documents to elasticsearch.
Regarding solr:
If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.
If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler
Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).

iPad - how should I distribute offline web content for use by a UIWebView in application?

I'm building an application that needs to download web content for offline viewing on an iPad. At present I'm loading some web content from the web for test purposes and displaying this with a UIWebView. Implementing that was simple enough. Now I need to make some modifications to support offline content. Eventually that offline content would be downloaded in user selectable bundles.
As I see it I have a number of options but I may have missed some:
Pack content in a ZIP (or other archive) file and unpack the content when it is downloaded to the iPad.
Put the content in a SQLite database. This seems to require some 3rd party libs like FMDB.
Use Core Data. From what I understand this supports a number of storage formats including SQLite.
Use the filesystem and download each required file individually. OK, not really a bundle but maybe this is the best option?
Considerations/Questions:
What are the storage limitations and performance limitations for each of these methods? And is there an overall storage limit per iPad app?
If I'm going to have the user navigate through the downloaded content, what option is easier to code up?
It would seem like spinning up a local web server would be one of the most efficient ways to handle the runtime aspects of displaying the content. Are there any open source examples of this which load from a bundle like options 1-3?
The other side of this is the content creation and it seems like zipping up the content (option 1) is the simplest from this angle. The other options would appear to require creation of tools to support the content creator.
If you have the control over the content, I'd recommend a mix of both the first and the third option. If the content is created by you (like levels, etc) then simply store it on the server, download a zip and store it locally. Use CoreData to store an Index about the things you've downloaded, like the path of the folder it's stored in and it's name/origin/etc, but not the raw data. Databases are not thought to hold massive amounts of raw content, rather to hold structured data. And even if they can -- I'd not do so.
For your considerations:
Disk space is the only limit I know on the iPad. However, databases tend to get slower if they grow too large. If you barely scan though the data, use the file system directly -- may prove faster and cheaper.
The index in CoreData could store all relevant data. You will have very easy and very quick access. Opening a content will load it from the file system, which is quick, cheap and doesn't strain the index.
Why would you do so? Redirect your WebView to a file:// URL will have the same effect, won't it?
Should be answered by now.
If you don't have control then use the same as above but download each file separately, as suggested in option four. after unzipping both cases are basically the same.
Please get back if you have questions.
You could create a xml file for each bundle, containing the path to each file in the bundle, place it in a folder common to each bundle. When downloading, download and parse the xml first and download each ressource one by one. This will spare you the overhead of zipping and unzipping the content. Create a folder for each bundle locally and recreate the folder structure of the bundle there. This way the content will work online and offline without changes.
With a little effort, you could even keep track of file versions by including version numbers in the xml file for each ressource, so if your content has been partially updated only the files with changed version numbers have to be downloaded again.