Summary Index In splunk - splunk

can you please help me with time stamp of summay index..
we having disk space issue and we are clearing the old logs . but we want keep some field data so if will schedule a SI then does it will add the data from last 1 month at one time ..then why we need to schedule it ? have gone through the splunk document but unable to understand the steps and logic ..

The idea of a summary index is to store the results of a search until they are needed for a later search. The classic example is the end-of-month report. Rather than run a huge search over thirty days to crunch the thousands of events of each day into a final report, a daily search crunches the events of that day into a SI then the monthly report runs on day 30 to read the 30 summary events from the SI into a report that runs quickly. The same SI can then be used for end-of-week reports and to populate a dashboard with the daily sales (or whatever) figures.
The key is to make the summary smaller than the original data. One cannot dump 1 month of data into a SI and hope to save space - it won't happen.
A summary index can help save disk space by retaining a smaller set of summary data long after the original events have been discarded.
Summaries do not have to be scheduled, but that is the most common way to producing them. It means no one has to remember to run the daily sales reports everyday to be able to get the monthly sales report. That said, one can write events to a summary index in an ad-hoc search using the collect command.

Related

report scheduler system design using database as master

Problem
we have ~50k scheduled financial reports that we periodically deliver to clients via email
reports have their own delivery frequency (date&time format - as configured by clients)
weekly
daily
hourly
weekdays only
etc.
Current architecture
we have a table called report_metadata that holds report information
report_id
report_name
report_type
report_details
next_run_time
last_run_time
etc...
every week, all 6 instances of our scheduler service poll the report_metadata database, extract metadata for all reports that are to be delivered in the following week, and puts them in a timed-queue in-memory.
Only in the master/leader instance (which is one of the 6 instances):
data in the timed-queue is popped at the appropriate time
processed
a few API calls are made to get a fully-complete and current/up-to-date report
and the report is emailed to clients
the other 5 instances do nothing - they simply exist for redundancy
Proposed architecture
Numbers:
db can handle up to 1000 concurrent connections - which is good enough
total existing report number (~50k) is unlikely to get much larger in the near/distant future
Solution:
instead of polling the report_metadata db every week and storing data in a timed-queue in-memory, all 6 instances will poll the report_metadata db every 60 seconds (with a 10 s offset for each instance)
on average the scheduler will attempt to pick up work every 10 seconds
data for any single report whose next_run_time is in the past is extracted, the table row is locked, and the report is processed/delivered to clients by that specific instance
after the report is successfully processed, table row is unlocked and the next_run_time, last_run_time, etc for the report is updated
In general, the database serves as the master, individual instances of the process can work independently and the database ensures they do not overlap.
It would help if you could let me know if the proposed architecture is:
a good/correct solution
which table columns can/should be indexed
any other considerations
I have worked on a differt kind of sceduler for a program that reported analyses on a specific moment of the month/week and what I did was combining the reports to so called business cycle based time moments. these moments are on the "start of a new week", "start of the month", "start/end of a D/W/M/Q/Y'. So I standardised the moments of sending the reports and added the id's to a table that would carry the details of the report. - now you add thinks to the cycle of you remove it when needed, you could do this by adding a tag like(EOD(end of day)/EOM (End of month) SOW (Start of week) ect, ect, ect,).
So you could index the moments of when the clients want to receive the reports and build on that track. Hope that this comment can help you with your challenge.
It seems good to simply query that metadata table by all 6 instances to check which is the next report to process as you are suggesting.
It seems odd though to have a staggered approach with a check once every 60 seconds offset by 10 seconds for your servers. You have 6 servers now but that may change. Also I don't understand the "locking" you are suggesting, why now simply set a flag on the row such as [State] = "processing", then the next scheduler knows to skip that row and move on to the next available one. Once a run is processed, you can simply update a [Date_last_processed] column, or maybe something like [last_cycle_complete] = 'YES'.
Alternatively you could have one server-process to go through the table, and for each available row, sends it off to one of the instances, in a round-robin fashion (or keep track of who is busy and who isn't).

Bin Packing Truck Load Optimization using Ms Excel

Currently were doing the truck load optimization plan manually and it takes a lot of time to finish the plan specially during peak season. My initiative is to automate the plan using excel macro in order to save time and can focus on other given tasks. I've been searching in internet these past few days but unfortunately I'm not able to find any solution related to ms excel.
Below is the scenario of my problem.
First snapshot is the list of the products and its configuration is pieces/box.
Second snapshot is the sample orders of the products and its equivalent boxes.
The constraint is each truck can load only up to 1,000 boxes and the maximum weight is 5,000kg. Any products can be combined to load the truck as long as not exceeding the maximum limit.
Third snapshot is the expected result based on the given constraint.
Appreciate your help on how to do this load optimization using ms excel vba. Thanks in Advance!

Database mining, Auto Graph and email at certain time of the day

I have some automated machines running a Pasteurization process, sensors register values such as Temperature, time, pressure etc...
Our main control software does offers us a historic graph for such values, however its not possible to mail them. The software is able to log all the data into an Microsoft Access/ SQL database. For the company, our days start at 6am so a 24/hr period is meant to start again at 6 everyday.
Now the question:
Is there a way to mine the database (can choose either) to graph all the values from 6am last day to 6am current day (X,Y plot) Automatically in Excel, and have it Automatically sent to some mail recipients EVERY day at 6 AM?
If so, how can I do this?
You could get Excel to query the Access database (if the email recipients are in your LAN), so when you refresh the query (or you can get it do refresh automatically when you open the spreadsheet) it fetches the relevant information for the day only. So no emails would be necessary, as you always use the same spreadsheet.
To do this, first open your access database and create a view (query) to fetch the data you need (use this as a reference to get today's or yesterday's data: https://support.office.com/en-us/article/Examples-of-using-dates-as-criteria-in-Access-queries-aea83b3b-46eb-43dd-8689-5fc961f21762).
Then follow this tutorial to fetch the data from your view into excel: http://www.excel-easy.com/examples/import-access-data.html

Schedule algorithm for nightly SQL extract of data

I am looking for an algorithm to extract data from one system in to another but on a sliding scale. Here are the details:
Every two weeks, 80 weeks of data needs to be extracted.
Extracts take a long time and are resource intensive so we would like to distribute the load of the extract over time.
The first 8-12 weeks are the most important and need be updated more often over the two week window. Data further out can be updated less frequently to the point where the last 40 weeks+ could even just be extracted once every two weeks.
Every two weeks, the start date shifts two weeks ahead and so two new weeks are extracted.
Extract procedure takes a start and end date (this is already made and should be treated like a black box). The procedure could be run for multiple date spans in a day if required but contiguous dates are faster than multiple blocks of dates.
Extracts blocks should be no smaller than 2 weeks and probably no greater than 16 weeks. Longer blocks are possible but at 16 weeks are already a significant load to the system.
4 contiguous weeks of data takes about 1 hour approximately. It takes a long time because the data needs to be generated/calculated.
Data that is newly extracted replaces the old data for the timespan. No need to merge or diff the data, it is just replaced.
This algorithm needs to be built into a SQL job which will handle the daily process (triggered once a day only).
My initial thought was to create a sliding schedule pretty much. Rotate the first 4 week block every second day and then the second 4 week block every 3 to 4 days. The rest of the data would be extracted in blocks in smaller chunks over the two week period.
What I am going to do will work but I wanted to spend some time seeing if there might be a better way to approach the problem. Mainly looking for an algorithm to do the start/end date schedule for the daily extract.

SSRS Data-Driven Subscription [based on static Subscription table] Not Picking Up Changes Made to Subscription Table

I have a .RDL report which I designed in BIDS and have deployed to my report server. The report asks for three parameters before viewing report: Year, Month and Customer ID. The report works great and does exactly what it is supposed to.
While I used to run each report individually because there were 2-3 customers, now there are 30+ customers who receive the report, so I wanted to switch to a more automated fulfillment method to get the reports generated. After doing some research it appears that a using Report Manager to create a "Data Driven Subscription" (DDS) using the "Windows File Share" option gives me the capabilities I need.
As part of creating the DDS, I created a table called [Subscription] which is a table containing one row for each customer receiving the report and has the following columns:
Year
Month
CustomerID
FileName
FileLocation
Overwrite
Format
...so through using the DDS Wizard in Report Manager, I was able to successfully set up a Data Driven Subscription (which is linked to various columns in the [Subscription] table) which creates a new report for each customer in the [Subscription] table, saves [and overwrites, if necessary] it in a location of my choosing as a PDF (specified in [Subscription].[FileLocation], or the FileLocation column of my table for each row), and runs every minute (I plan on changing frequency to once a week, eventually).
This works flawlessly, giving me a new set of 30 reports in the directory of my choosing, with each report having a name I assigned in the FileName column of my table. Exactly what I was looking for.
HERE'S THE PROBLEM: When I update the FileLocation or FileName (or anything, really) in the [Subscription] table - it doesn't pick up the changes right away. Sometimes it doesn't even pick it up at all (for example I updated the [ReportName] column for one customer from Report_711622 to SpecialReport_711622, so that the output file for that customer should be named SpecialReport_711622 while all of the other reports should be called Report_XXXXX [no Special prefix]. But the file name of report for Customer 711622 remains the same!
It's almost like the job only see's what it needs to do once a day, and then does not go back and reference the [Subscription] table until I leave for the night, then when I come back in the morning it picks up the change.
Since I am about to scale this process out to a large customer-base using a different report, I need to be able to make edits to the [Subscription] table and have them get picked up by the Data Driven Subscription immediately (and if not immediately, at least a fixed interval of time that I can adjust, so that I can know 100% when the change will get picked up).
Does anyone know what's causing my lag? How do I change it so that updates to the Subscription table get picked up regularly? I'm also having issues with creating new DDS on other reports (following the exact process outlined above) - I've created the subscriptions, for every minute, and it says they are running and the number of outputs match the number of customers with 0 errors, but there are no files in the drive I specified (or anywhere else I've looked, for that matter).
Any help would be greatly appreciated!
I think the answer lies in the mechanism SSRS uses. There are a few places "lag" can occur.
The subscription is in fact an SQL Agent job which creates a record in the Event table. This table is a queue that SSRS checks to do scheduled tasks.
There is a small amount of time between the moment the subscription creates the Event record and the moment SQL reads it and starts creating the dataset for your DDS. The creation of the DDS dataset takes some time, too. In this time, the subscription will be in the Pending state. If you change anything in the data during this time, The subscription will still use the old data as report parameters. So obviously you will not notice your change until the next scheduled run.
Which brings me to the following: if a subscription is still being run and the next schedule kicks in (chances are, because yours runs every minute), the engine will not execute it, but wait for the next subscription schedule, and so on. So that's another possibility of lag - and cause of missing reports for a certain schedule minute. The subscription processes reports sequentially, one row from your DDS recordset at a time. Again, this takes some time. You can also see that in the subscription window when it says: # of # processed.
I suggest you look at the Event table in the database ReportServer during an execution. Also the ExecutionHistory views (there are 3) may be interesting. A scheduled run shows up as a RequestType = 1 and generates one record for each report. You can see the exact timing and parameters of each report that is run in the subscription. You may be able to extract the data you need to resolve your other issues.
EDIT: Here is a more elaborate guide to DDS data and events
http://blogs.msdn.com/b/deanka/archive/2009/01/13/diagnosing-and-troubleshooting-subscriptions.aspx
http://blogs.msdn.com/b/deanka/archive/2010/02/16/troubleshooting-subscriptions-part-ii-using-the-report-services-trace-log-file.aspx
Could this "Double-Hop" problem be the source of my issues? I'm so stuck on this one!
The Double-Hop Problem - MSDN Knowledgecast