Indexing PDF with page numbers with Solr - pdf

I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."
Is it possible to include page numbers in the query result like this?

It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.
Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.
Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)

Related

TYPO3 v7.6.x migration to Drupal 8

I have to migrate a complex TYPO3 v7.6.30 website to Drupal 8.
So far I have investigated how TYPO3's administration part works.
I've also been digging into the TYPO3 database to find the correct mapping pattern, but I just don't seem to be getting anywhere.
My question is if there is a nice way to map/join all of the content with it's images/files/categories, so I can get row by row all page content like:
title
description
text fields
images
documents
tables
...
So in the end I will end up with a joined table with all of the data for each page on a single row, which then I can map in the migration.
I need a smooth way to map the pages with their fields.
I need the same for users (haven't researched this one yet).
The same is for the nesting of the pages in order to recreate the menus in the new CMS.
Any help on this will be highly appreciated.
You need a detailed plan of the configuration and then much understanding how TYPO3 works.
Here a basic introduction:
All content is organized in records and the main table is pages, the pagetree.
For nearly all records you have some common fields:
uid unique identifier
pid page ID (in which 'page' is the record 'stored', important for editing) (even pages are stored in pages to build a page tree)
title name of record
hidden, deleted,starttime,endtime, fe_group for visibility
there are fields for
versioning and workspaces
language support
sorting
some records (especially tt_content) have type fields, which decide how the record and which fields of it are used
there are relations to files (which are represented by sys_file records, and other records like file metadata or categories).
Aside from the default content elments where the data is stored in the tt_content record itself you can have plugins which display other records, (e.g. news, addresses, events, ...) or which get their data from another application or server.
You need to understand the complete configuration to save all.
What you might need is a special rendering of the pages.
That is doable with TYPO3: aside from the default HTML-rendering you can define other page types where you can get the content in any kind you define. e.g. xml, json, CSV, ...
This needs detailed knowledge of the individual TYPO3 configuration. So nobody can give you a full detailed picture of your installation.
And of course you need a good knowledge of your drupal target installation to answer the question 'what information should be stored where?'

Recrawl URL with Nutch just for updated sites

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?
Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.
You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.
You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.
Article describes the same in detail.
what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
This is discussed on : How to recrawle nutch
I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

Using Lucene Highlighter infrastructure to mark up arbitrary text

I am using Lucene 3.5 in a client-server architecture as follows: the client issues a query to the server. The server returns a list of terms used in the query, and a list of hits, including snippets generated by the application of a Highlighter to the retrieved documents. The user can then request that the full document be displayed. This document comes from another service that is part of the system I am building.
When the requested document is displayed, I would like to highlight the same terms that were used to retrieve it. I can write some other code to do this without involving the Lucene infrastructure, but since I already have code to generate the snippets, I was hoping to be able to re-use it. (DRY and all that.)
So my question is how best to do this: When the need to mark up a document with search results occurs, the client has the set of terms that were used to retrieve the document and the id of the document that was retrieved. It also knows which fields in the document can be marked up with query terms.
Some possible strategies:
Create a query filter that selects only the needed document and then re-run the query only on that document.
Somehow (how?) construct a Scorer that doesn't depend on a Query but that can be seeded with the terms I already have.
Skip the Lucene infrastructure entirely.
What else?
I believe you could index your documents with a TermVector which will tell you the position of each term in the original document. Making highlighting trivial. Or simply reuse the contrib highlighter

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

Drupal 6: How to sort/filter search results by date

How to customize standard search behavior in Drupal 6? I need search results to be sorted by date. In example, people want to show items within 2 past weeks or something like that.
I've tried a lot things on this reference without luck. Have you ever encountered such problem? Any help will b appreciated. Thanks!
You can sort by date using search solutions like Apache Solr. But I understand you want to use standard Drupal search.
In that situation I would recommend using the faceted search module http://drupal.org/project/faceted_search
Faceted Search module does not require the installation of a separate search engine. It also has views integration which will allow you to do thinks like show results from last 2 weeks and so on.
Please see:
http://drupalcode.org/viewvc/drupal/contributions/modules/faceted_search/README.txt?view=co
You can search for "views" in the above document for information.
You can choose to also not show any facets if you don't want your users to see them. In that case you would be installing the module only because of the benefits of views integration.