Creating dynamic facets using apache solr - apache

I'm new to apache solr.
I have uploaded a few log files using solr-cell and I want to create facets based on the content which is there in the log file.
For example: inside my log file I have a record for transaction, I would like to create transactionid as my facet and clicking it should result in a search in the uploaded log files and give me results according to that particular id.
Note: I need to facet field according to the content which is in the log.

As long as the field is indexed, you can facet on it. So, you can use either schemaless configuration or use dynamicField definitions to match and automatically create fields for your log records.
Go through Solr examples first, there should be enough information there.
(updated based on the comments)
If the text needs to be pre-processed and split, there are two basic avenues:
Using DataImportHandler (DIH), probably with LineEntityProcessor and RegexTransformer to split the field into multiple fields
Using UpdateRequestProcessor chains (in solrconfig.xml) and probably clone the field multiple times and then use RegexReplaceProcessorFactory to extract relevant parts. That's even uglier than DIH though as there is no easy way to split one field into many.
Still, specifically for logs, it is better to use something like Logstash with Solr output plugin.

+1 to Alex's answer.
Another alternative is to write a custom update processor where you figure out what field you want to facet on and explicitly add that field to your document.
This makes sense only if you know what kind of fields to expect, based on some pattern. If that is not the case, then using dynamic fields or a schemaless config is your best bet.

Related

TYPO3 v7.6.x migration to Drupal 8

I have to migrate a complex TYPO3 v7.6.30 website to Drupal 8.
So far I have investigated how TYPO3's administration part works.
I've also been digging into the TYPO3 database to find the correct mapping pattern, but I just don't seem to be getting anywhere.
My question is if there is a nice way to map/join all of the content with it's images/files/categories, so I can get row by row all page content like:
title
description
text fields
images
documents
tables
...
So in the end I will end up with a joined table with all of the data for each page on a single row, which then I can map in the migration.
I need a smooth way to map the pages with their fields.
I need the same for users (haven't researched this one yet).
The same is for the nesting of the pages in order to recreate the menus in the new CMS.
Any help on this will be highly appreciated.
You need a detailed plan of the configuration and then much understanding how TYPO3 works.
Here a basic introduction:
All content is organized in records and the main table is pages, the pagetree.
For nearly all records you have some common fields:
uid unique identifier
pid page ID (in which 'page' is the record 'stored', important for editing) (even pages are stored in pages to build a page tree)
title name of record
hidden, deleted,starttime,endtime, fe_group for visibility
there are fields for
versioning and workspaces
language support
sorting
some records (especially tt_content) have type fields, which decide how the record and which fields of it are used
there are relations to files (which are represented by sys_file records, and other records like file metadata or categories).
Aside from the default content elments where the data is stored in the tt_content record itself you can have plugins which display other records, (e.g. news, addresses, events, ...) or which get their data from another application or server.
You need to understand the complete configuration to save all.
What you might need is a special rendering of the pages.
That is doable with TYPO3: aside from the default HTML-rendering you can define other page types where you can get the content in any kind you define. e.g. xml, json, CSV, ...
This needs detailed knowledge of the individual TYPO3 configuration. So nobody can give you a full detailed picture of your installation.
And of course you need a good knowledge of your drupal target installation to answer the question 'what information should be stored where?'

Is it possible to set shared variables outside of the plugin pipeline CRM 2011

I want to create a record of an entity, but I need to pass a list of guids to the pre create plugin. I don't want to create fields or related entities to do this. Can I use the Shared Variables to do it?
In other words is it possible to set shared variables before initiating the action that will trigger the plugins that will consume them?
EDIT:
I can be creating this type of records from different points that integrate with crm, silverlight, external pages or even plugins of other entities. My current problem can be solved with a field on the entity, but this way if I had to send parameters to control the execution of the plugin for two or more independent actions I would need one field for each action or instead use only one field using a complex format/parse pattern to parameterize each different action. Using fields to accomplish this feature looks a bit excessive.
If the shared variables could be set before the call of the action that will trigger the plugin that would solve the problem and I wouldn't have to create fields in the crm database, because the data I want to pass to the plugin it will only be needed at that time, like a parameter in a function, no need to persist them in the database.
But if it is not possible I will have to stick with the fields :(
Not if they vary by entity/execution of the plugin.
Options:
Set them in the plugin configuration if they don't change but need to be updated
without a recompile.
Apply them as a delimited string in a single field on the entity if they vary per record.
What's the reason for not wanting to use 2?
Nope. The easiest solution that I can think of is to add a BAT (big-ass text) field to the entity and populate it with a comma-delimited list of GUIDs, then access that field in your Create plugin. You could even clear it out if you don't want that extra data in your system.
Edit after your edit:
General comment about your thinking process: you are probably overthinking it. :) Using a single field, you could pass in any kind of "command" using a json or xml formatted string. As I said above, in the pre-create plugin, after you have extracted this "argument" field, you can clear out that field in the Target entity image and that data will never be persisted to the database. Technically it achieves the exact result you want with the only side effect being one extra "argument" field that is always NULL in the database. Don't fight simplicity so hard! :)

Using Lucene Highlighter infrastructure to mark up arbitrary text

I am using Lucene 3.5 in a client-server architecture as follows: the client issues a query to the server. The server returns a list of terms used in the query, and a list of hits, including snippets generated by the application of a Highlighter to the retrieved documents. The user can then request that the full document be displayed. This document comes from another service that is part of the system I am building.
When the requested document is displayed, I would like to highlight the same terms that were used to retrieve it. I can write some other code to do this without involving the Lucene infrastructure, but since I already have code to generate the snippets, I was hoping to be able to re-use it. (DRY and all that.)
So my question is how best to do this: When the need to mark up a document with search results occurs, the client has the set of terms that were used to retrieve the document and the id of the document that was retrieved. It also knows which fields in the document can be marked up with query terms.
Some possible strategies:
Create a query filter that selects only the needed document and then re-run the query only on that document.
Somehow (how?) construct a Scorer that doesn't depend on a Query but that can be seeded with the terms I already have.
Skip the Lucene infrastructure entirely.
What else?
I believe you could index your documents with a TermVector which will tell you the position of each term in the original document. Making highlighting trivial. Or simply reuse the contrib highlighter

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

Building a ColdFusion Application with Version Control

We have a CMS built entirely in house. I'm the new web developer guy with literally 4 weeks of ColdFusion Experience. What I want to do is add version control to our dynamic pages. Something like what Wordpress does. When you modify a page in Wordpress it makes some database entires and keeps a copy of each page when you save it. So if you create a page and modifiy it 6 times, all in one day you have 7 different versions to roll back if necessary. Is there a easy way to do something similar in Coldfusion?
Please note I'm not talking about source control or version control of actual CFM files, all pages are done on the backend dynamically using SQL.
sure you can. just stash the page content in another database table. you can do that with ColdFusion or via a trigger in the database.
One way (there are many) to do this is to add a column called "version" and a column called "live" in the table where you're storing all of your cms pages.
The column called live is option but might make it easier for your in some ways when starting out.
The column "version" will tell you what revision number of a document in the CMS you have. By a process of elimination you could say the newest one (highest version #) would be the latest and live one. However, you may need to override this some time and turn an old page live, which is what the "live" setting can be set to.
So when you click "edit" on a page, you would take that version that was clicked, and copy it into a new higher version number. It stays as a draft until you click publish (at which time it's written as 'live')..
I hope that helps. This kind of an approach should work okay with most schema designs but I can't say for sure either without seeing it.
Jas' solution works well if most of the changes are to one field, for example the full text of a page of content.
However, if you have many fields, and people only tend to change one or two at a time, a new entry in to the table for each version can quickly get out of hand, with many almost identical versions in the history.
In this case what i like to do is store the changes on a per field basis in a table ChangeHistory. I include the table name, row ID, field name, previous value, new value, and who made the change and when.
This acts as a complete change history for any field in any table. I'm also able to view changes by record, by user, or by field.
For realtime page generation from the database, your best bet are "live" and "versioned" tables. Reason being keeping all data, live and versioned, in one table will negatively impact performance. So if page generation relies on a single SELECT query from the live table you can easily version the result set using ColdFusion's Web Distributed Data eXchange format (wddx) via the tag <cfwddx>. WDDX is a serialized data format that works particularly well with ColdFusion data (sorta like Python's pickle, albeit without the ability to deal with objects).
The versioned table could be as such:
PageID
Created
Data
Where data is the column storing the WDDX.
Note, you could also use built-in JSON support as well for version serialization (serializeJSON & deserializeJSON), but cfwddx tends to be more stable.