I have some audio files in S3 bucket. I want to write a splunk query to find the top 500 audio files and categorize them by date and store them in a folder. Also I want to do it every night. What should be my query?
You would do something like the following.
index=audiofiles earliest=-1d#d | stats count by title
The earliest=1d#d means you only want to look at the last day's worth of logs. The stats command counts the frequency of each song, by title. I am assuming that there is a title field in your data.
You would probably choose to schedule a search like this to run every night. Should you wish to save the results of this data, you would write this data into a summary index with a collect command, or write it out to a CSV with outputcsv command
Related
can you please help me with time stamp of summay index..
we having disk space issue and we are clearing the old logs . but we want keep some field data so if will schedule a SI then does it will add the data from last 1 month at one time ..then why we need to schedule it ? have gone through the splunk document but unable to understand the steps and logic ..
The idea of a summary index is to store the results of a search until they are needed for a later search. The classic example is the end-of-month report. Rather than run a huge search over thirty days to crunch the thousands of events of each day into a final report, a daily search crunches the events of that day into a SI then the monthly report runs on day 30 to read the 30 summary events from the SI into a report that runs quickly. The same SI can then be used for end-of-week reports and to populate a dashboard with the daily sales (or whatever) figures.
The key is to make the summary smaller than the original data. One cannot dump 1 month of data into a SI and hope to save space - it won't happen.
A summary index can help save disk space by retaining a smaller set of summary data long after the original events have been discarded.
Summaries do not have to be scheduled, but that is the most common way to producing them. It means no one has to remember to run the daily sales reports everyday to be able to get the monthly sales report. That said, one can write events to a summary index in an ad-hoc search using the collect command.
I am getting data in Splunk from Snowflake using Splunk DB Connect. This is just simple orders data. At Splunk search & reporting I am running the following query on my table to get visualization.
source="big_data_table_inner_join" "UNITS_SOLD" | top COUNTRY
What I am seeing is that each time I run query the events number at splunk increases quite heavily. For eg. After running first time they were 342000 events and when I ran the same query they were 67445 events. Any idea why is this happening?
I'm having splunk with holding 3 months of log details getting refreshed after that (no history we can see after that), but my requirement is: I need to store that log details to another folder in splunk, which holds all the log info with history by dumping. Not sure how to extract data from splunk. Can we use any java code? or any API to extract the log data from splunk and store into another?
I'm new to splunk.
You need to investigate the following:
index retention (and for Smart Store)
storage availability
if you have an index set for 500G or 1 year, but you store 50G per day, you'll rotate at 10 days
if you hsve an index set for 500G or 1 year, but only have 400G available storage, it will rotate sooner
In addition to the answer by #warren, look into the coldToFrozenDir and coldToFrozenScript settings in indexes.conf. These settings govern where and how data is archived rather than deleted. The data is not exported, however, it is stored in Splunk's proprietary format.
From Splunk, I am trying to get the user, saved search name and last time a query ran ?
A single Splunk query will be nice.
I am very new to Splunk and I have tried these queries :-
index=_audit action=search info=granted search=*
| search IsNotNull(savedsearch_name) user!="splunk-system-user"
| table user savedserach_name user search _time
The above query , is always empty for savesearch_name.
Splunk's audit log leaves a bit to be desired. For better results, search the internal index.
index=_internal savedsearch_name=* NOT user="splunk-system-user"
| table user savedsearch_name _time
You won't see the search query, however. For that, use REST.
| rest /services/saved/searches | fields title search
Combine them something like this (there may be other ways)
index=_internal savedsearch_name=* NOT user="splunk-system-user"
| fields user savedsearch_name _time
| join savedsearch_name [| rest /services/saved/searches
| fields title search | rename title as savedsearch_name]
| table user savedsearch_name search _time
Note that you have a typo in your query. "savedserach_name" should be "savedsearch_name".
But I also recommend a free app that has a dedicated search tool for this purpose.
https://splunkbase.splunk.com/app/6449/
Specifically the "user activity" view within that app.
Why it's a complex problem - part of the puzzle is in the audit log's info="granted" event, another part is in the audit log's info="completed" event, even more of it is over in the introspection index. You need those three stitched together, and the auditlog is plagued with parsing problems and autokv compounds the problem by extracting all of fields from the SPL itself.
That User Activity view will do all of this for you, sidestep pretty thorny autokv problems in the audit data, and not just give you all of this per search, but also present stats and rollups by user, app, dashboard, even by sourcetypes-that-were-actually-searched
it also has a macro called "calculate pain" that will score a "pain" number for each search, and then sum up all the "pain" in the by-user, by-app, by-sourcetype rollups etc. So that admins can try and pick off the worst offenders first.
it's up on SB here and approved for both Cloud and onprem - https://splunkbase.splunk.com/app/6449/
(and there's a #sideview_ui channel for it in the community slack.)
I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.