How to setup splunk summary index? - splunk

I'm a bit confused with setting up summary index in splunk.
I have an index name index_1 which receive logs from my app.
There are much too many logs, and I need to save an aggregation of them.
I have tried setting up the summary index from here to an index name summary,
but when I search the index there are no log entries.
My search is as follow:
index=index_1 ... level>30
I couldn't understand when to use the collect command and when setting up from the web ui is enough.

Your search, index=index_1 ... level>30 should reduce the number of events being returned, and to only those events you want to store in the summary index. In this case, it looks like you're only interested in keeping events where level>30.
At the end of your search, you need to include the collect command. The collect command will take the remaining events, and write it to the named index, so collect index=summary
Overall, your search should look like
index=index_1 ... level>30 | collect index=summary
Here is an older blog post discussing summary indexing that may help you understand the process and good practices around using it.
https://davidveuve.com/tech/how-i-use-summary-indexes-in-splunk/

Related

Splunk : Record deduplication using an unique field

We are considering moving out log analytics solution from ElasticSearch/Kibana to Splunk.
We currently use "document id" in ElasticSearch to deduplicate records when indexing :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
In Splunk, I found the internal field "_cd" which is unique to each record in Splunk index: https://docs.splunk.com/Documentation/Splunk/8.1.0/Knowledge/Usedefaultfields
However, using HTTP Event Collector to ingest records, I couldn't find any way to embed this "_cd" field in the request :
https://docs.splunk.com/Documentation/Splunk/8.1.0/Data/HECExamples
Any tips on how to achieve this in Splunk ?
What are you trying to achieve?
If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".
It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.
Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement
We currently use "document id" in ElasticSearch to deduplicate records when indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.
It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.
How are you getting duplicate data in the first place?
Fields in Splunk which begin with the underscore (eg _time, _cd, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.
If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats and, when absolutely necessary/unavoidable, dedup).
HEC inputs don't go through the usual ingestion pipeline so not all internal fields are present.
Not that it matters, really, because Splunk doesn't deduplicate at index time. There is no provision for searching data to see if a given record is already present. Any deduplication must be done at search time.
One cannot use the _cd field to deduplicate at search time because two identical records will have different _cd values.
Consider using a tool such as Cribl to add a hash to each ingested record and use that hash in Splunk to deduplicate in your searches.
Good call #RichG. Cribl has some nice options for this use case.
https://cribl.io/blog/streaming-data-deduplication-with-cribl/
Be aware you can add other fields to HEC data if you are using Cribl LogStream. You get many more options using LogStream. It saved my old team so much time and effort.

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

How to get all hashes in foo:* using a single id counter instead of a set/array

Introduction
My domain has articles, which have a title and text. Each article has revisions (like the SVN concept), so every time it is changed/edited, those changes will be stored as a revision. A revision is composed of changes and the description of those changes
I want to be able to obtain all revisions descriptions at once.
What's the problem?
I'm certain that I would store the revision as a hash in articles:revisions:<id> storing the changes, and the description in it.
What I'm not certain of is how do I get all of the descriptions at once.
I have many options to do this, but none of them convinces me.
Store the revision ids for an article as a set, and use SORT articles:revisions:idSet BY NOSORT GET articles:revisions:*->description. This means that I would store a set for each article. If every article had 50 revisions, and we had 10.000 articles, we would have 500.000 ids stored.
Is this the best way? Isn't this eating up too much RAM?
I have other ideas in mind, but I don't consider them good either.
Iterate from 0 to the last revision's id, doing a HGET for each id using MULTI
Create the idSet for a specific article if it doesn't exist and is request, expire after some time.
Isn't there a way for redis to do a SORT array BY NOSORT GET, with array being an adhoc array in the form of [0, MAX]?
Seems like you have a good solution.
As long as you keep those id numbers less than 10,000 and your sets with less than 512 elements(set-max-intset-entries), your memory consumption will be much lower than you think.
Here's a good explanation of it.
This can be solved in an optimized way using a TRIE or DAWG better than what Redis provides. I don't know your application or other info on your search problem (e.g. construction time, unsuccessful searches, update performance).
If you search much more often than you need to update / insert into your lookup storage, I'd suggest you have a look at DAWGDIC [1] as a library, and construct "search paths" (similar as you already described) using a string format that can be search-completed later:
articleID:revisionID:"changeDescription":"change"
Example (I assume you have one description per revision, and n changes. This isn't clear to me from your question):
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
2:4:"Advertisement changes":"Added this, removed that"
Note: Even though you construct these strings with duplicate prefixes, the DAWG will store them in a very space efficient way (simply put, it will append the right side of the string to the data structure and create a shortcut for the common prefix, see also [2] for a comparison of TRIE data structures).
To list changes of article 1, revision 2, set the common prefix for your lookup:
completer.Start(index, "1:2");
Now you can simple call completer.Next() to lookup a next record that shares the same prefix, and completer.value() to get the record's value. In our example we'll get:
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
Of course you need to parse the strings yourself into your data object.
Maybe that's not what you're looking for and overkill. But it can be a very space and search performance efficient way, if it meets your requirements.
[1] https://code.google.com/p/dawgdic/
[2] http://kmike.ru/python-data-structures/

Nested search in solr

I have an Activity model and ActivityOccurrence Model where Activity has_many :activity_occurrences
Activity: This model will have all the meta data required by ActivityOccurrence
AcitvityOccurrence: attrs - occurrence(datetime), completed.
Now we have new requirement where we have to show all occurrences of activity in search results when user searches for activities in a particular range.
Previously we used to show only one record in case of repeating activities.
So as per new requirement we have decided to move search from Activity to ActivityOccurrence.
Now, I don't want to index the Meta information of Activity in each of my ActivityOccurrence as my activity has 10 fields more than ActivityOccurrence,
eg:
if I have Activity with 1000 AcitivityOccurrence then I will be indexing all my activity informations in 1000 AcitivityOccurrence records.
This will take huge space as app grows if we index this way
Hence, my major concern is the amount of indexing I have to do.
So I am thinking to avoid activity indexes in ActivityOccurrence.
So is there a way to search Activity based on its filters first and then search ActivityOccurrence in the range based on the results from activities?
Note: Also we have never ending occurrences.
Any ideas?
Thanks in advance.
Unless you're dealing with millions of Activities/Occurrences, this may be a premature optimization - space is cheap, and SOLR is fast. Looking at this the other way around, have you considered just indexing a list of the activity occurrences that pertain to each activity (using callbacks to ensure that it gets updated)?It's hard to really optimization without more info about your data access patterns, but I'm never a fan of doing more round-trips than necessary.
That said, while I'm not sure how to write a pure SOLR query to do this, you can do it with Sunspot pretty easily:
Make sure that ActivityOccurence is searchable by Activity easily (i.e. by Activity ID).
Search Activity for the metadata that you want, and use this to extract the ID's that are relevant:
search = Activity.solr_search {<some block that does what you want>}
activity_ids = search.hits.map { |hit| hit.primary_key.to_i }
Now you can just add a with parameter to your ActivityOccurence search block:
with(:activity_id, activity_ids)
This will limit the search to the occurrences for those activities. Note that you are trading off search-time performance for index efficiency with this.

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html