SQL Agent job failure universal handling - error-handling

I'm in a situation where I have a server running sql 2012 with roughly two hundred scheduled jobs (all are SSIS package executions). I'm facing a directive from management where I need to run some custom software to create a bug report ticket whenever a job fails. Right now I'm relying on half the jobs jobs notifying an operator on failure, while the other half do like a "go to step X- send failure email" for each step on failure, where "step X" is some sql that queries the DB and sends out an email saying which job failed at which step.
So what I'm looking for is some universal solution where I can have every job do the same thing when it fails (in this case, run some program that creates a bug tracking ticket). I am trying to avoid the situation where I manually go into every single job and add a new step at the end, with all previous steps changing to "go to step Y on failure" where step Y is this thing that creates the bug report.
My first thought was to create a new job that queries the execution history tables and looks for unhandled failures and then does the bug report creation itself. However, I already made the mistake of presenting this idea to the manager and was told it's not a viable solution because it's "reactive and not proactive" and also not creating tickets in real-time. I should know better than to brainstorm with non-programming management but it's too late, so that option is off the table and I haven't been able to uncover any other methods.
Any suggestions?

I'm proposing this as an answer, though it's not a technical solution. Present the possible solutions and let the manager decide:
Update all the Agent Jobs - This will take a lot of time and every job will need to be tested, which will also take a lot of time. I'd guess 2-8 weeks depending on how it's done.
Create an error handler job that monitors the logs and creates tickets based on those errors. This has two drawbacks - it is not "real-time" (as desired by the manager) and something will need to be put into place to insure errors are only reported once. This has the upside of being one change to manage. Also it can be made near real time if it were run on the minute.
A third option, which would be more a preliminary step, is to create an error report based off of the logs. This will help to understand the quantity and types of failures. This may help to shape the ultimate solution - do we want all these tickets, can they be broken up into different categories, do we want tickets for errors that are self-healing (i.e. connection errors which have built-in retries)?

Related

Pentaho Logging specify Job or Trans for each line

I am running Pentaho Kettle 6.1 through a java application. All of the Pentaho logs are directed through the java app and logged out into the same log file at the java level.
When a job starts or finishes the logs indicate which job is starting or finishing, but when the job is in the middle of running the log output only indicates the specific step it is on without any indication of which job or trans is executing.
This causes confusion and is difficult to follow when there is more than one job running simultaneously. Does anyone know of a way to prepend the name of the job or trans to each log entry?
Not that I know, and I doubt there is for the simple reason that the same transformation/job may be split to run on more than one machine, by more that one user, and/or launched in parallel in different job hierarchies of callers.
The general answer is to log in a database (right-click any where, Parameters, Logging, define the logging table and what you want to log). All the logging will be copied to a table database together with a channel_id. This is a unique number that will be attributed to each "run" and link together all the logging information that comes from all the dependent job/transformations. You can then view this info with a SELECT...WHERE channel_id=...
However, you case seams to be simpler. Use the database logging with a log_intervale of, say, 2 seconds and SELECT TRANSNAME/JOBNAME, LOG_FIELD FROM LOG_TABLE continuously on your terminal.
You can also follow a specific job/transformation by logging in a specific table, but this means you know in advance which is the job/transformation to debug.

Current session is no longer available due to structural changes in the database - Tabular

We are using a SQL Server Tabular model which we use for self-service BI purposes. At monthly basis we have some 90 distinct persons who are using the model. Recently we encountered some issues/errors in the client tools(Excel and Power BI) that are connecting to the Tabular model. See screenshots. We did not make any significant changes to the model the past period.
We noticed that the errors keep showing up after our incremental load, i.e. a full process of a number of partitions we process these partitions every 15 minutes. The process is kicked of by a SSIS job which is scheduled every 15 minutes and processes 5 partitions in 3 tables.
Edit: After some research I figured out that the problem lies in the perspectives. Everytime I do a full process on any object. The error appears. This does not happen on the default model view. Still not found a solution though.
The error occurs when you make a change to the power bi report or the excel file. For example when you do a refresh, or when you click a filter. If you press refresh multiple times the connection comes back and everything works as it is supposed to. It seems like the clients lose their connection to the model. After 15 minutes the problem occurs again.
This is very aggravating for the users. Especially when they are in the middle of a presentation.
This is what we tried:
We tried searching Google for a solution
Checked that we have the latest SQL Server 2016 update (13.0.5149.0)
SSAS Builds from Visual Studio(2015 en 2017)
No full process on tables, only on
partitions.
Upgrading the server from 4 to 8 cpu cores.
I hope somebody can help us.
You shouldn't have the error that you are seeing with just a full process of a partition or even the full table. We do this every hour for a number of core tables and we do not see any issues like this (and we would)
I am starting from the hypothesis that
Your 15 minute process is doing more than just processing the partitions with a refresh command
Something else is happening on the environment (either scheduled or not). Who has permissions to change the schema? Could it be users / developers deliberately or not making changes?
The only things that should cause that kind of error would be Alter, Delete or CreateOrReplace TMSL commands
So unless that triggers your own ideas on a diagnostic process I would do the following steps
Note: I presume that your users also see this issue on your test environment when you run your 15 min processing routine on that. You should do the following on that test environment where nothing else is running to eliminate the possibility of someone else interfering with the experiment. If you don't have a representative test environment then you will have to do on live but I would do this out of hours or under some kind of change control process with your 15 minute refresh turned off and admin permissions to the cube heavily locked down to ensure that nothing can interfere with your experiment.
First prove that you can reproduce this issue with the 15 minute routine
Get your sample PowerBI report that is known to present the error (I'd prefer Power BI for a repro as it is slightly simpler than Excel)
Refresh your PowerBI and explore the data to prove that the error doesn't occur
Run your 15 minute process
You should now see the problem reported. If you do, great, you have a reproduceable issue! If you don't then it is not quite as you thought it was and you need to find the way of reliably reproducing these errors. (perhaps something else is happening that isn't the 15 minute process)
So now you are sure how you can reproduce the issue, you need to isolate whether it is really the processing that is causing the problem
Refresh your PowerBI and explore the data to prove that the error doesn't occur
Execute (via SSMS) your XMLA that processes the entire database for one of your tables
it should look something like this
{
"refresh": {
"type": "full",
"objects": [
{
"database": "yourdbname"
}
]
}
}
Do the thing that your users do when they see the issue.
If you too see the issue, then I would raise to Microsoft Support as this shouldn't happen
If you don't see the issue then you can refine this processing to just be the partition for a single table. But as we have done a process for the entire db above if shouldn't change the result
If you still don't see the issue then it isn't the processing that is causing this issue (which I suspect) and it is something else in the 15 minute routine that is causing it. Look deeper into that process and understand what else it is doing.
Alongside this checking the logs should show if there are any other processing tasks or types of XMLA happening.
I hope these ideas get you closer to finding the actual activity that is causing this experience for your users. It would be great if you could post with how you got on and what you found.
I have the same problem here if I install the latest CU on my SQL Server 2017. My production environment is still running with CU3 (Jan/2018) due to this problem.
Knowing that I would suggest reverting your installation to a previous release. Maybe 13.0.5026.0 (SP2) or even to the 13.0.4466.4 (Jan/2018).
I am facing the same issue with SQL Server 2017 CU 11 installed.
The issue indeed occurs in case of a 'full refresh' in combination with the use of a 'perspective' in an existing connection. The workaround to use the default 'Model' in the connection does indeed 'solve' the issue.

Select Query from web application times out but completes with an error?

Background:
Two nights ago the old-as-hell and very poorly designed website for the company I work for got attacked by a bot that submitted about 5000+ phony orders. In the course of deleting all of those false orders from the database, SQL Management Studio crashed, and the application had to be stopped via task manager and restarted. After that I was getting optimistic concurrence control errors when trying to delete some of the fake records, and had to complete the cleanup via DELETE statement.
(yes, I KNOW it's generally bad practice to delete records from the results pane, but for people like me who aren't actually programmers but get stuck with the IT work because we're the only ones who know how to find the on switch, it makes me less paranoid that I won't delete a record I didn't mean to)
Ever since then, there is a specific page in the admin section of the site that takes a VERY long time to perform a SELECT query for a specific range. The query will complete if you sit there long enough, but here's a screenshot of the ColdFusion error box that comes up with it:
ColdFusion error message
I suspect that between the bot attack and Studio Express crashing in the middle of an DELETE query, part of the table is corrupted, which is why it exceeds the allowable time limit. I don't know if our webhost has a backup of the database (I've been in contact with them the last couple days).
What tools can I use to check for and repair errors on that table?

How to get SQL executed or transaction history on a Table (AS400) DB2 for IBM i

I have an issue in our database(AS400- DB2) in one of our tables all the rows were deleted. I do not know if it was a program or SQL that a user executed. All I know it hapend +- 3am in the morning. I did check for any scheduled jobs at that time.
We managed to get the data back from backups but I want to investigate what deleted the records or what user.
Are there any logs on die as400 on physical tables to check what SQL executed and when on a specified table? This will help me determine what caused this.
I tried checking I systems navigator but could not find any logs... Is there a way of getting transnational data on a table using i system navigator or green screen? And If I can get the SQL that executed in the timeline.
Any help would be appreciated.
There was no mention of how the time was inferred\determined, but for lack of journaling, I would suggest a good approach is immediately to gather information about the file and member; DSPOBJD for both *SERVICE and *FULL, DSPFD for *ALL, DMPOBJ, and perhaps even a copy of the row for the TABLE from the catalog [to include the LAST_ALTERED_TIMESTAMP for ALTEREDTS column of SYSTABLES or the based-on field DBXATS from the QADBXREF]. Gathering those, worthwhile almost only if done before any other activity [esp. before any recovery activity], can help establish the time of the event and perhaps allude to what was the event; most timestamps are reflective of only of the most recent activity against the object [rather than as a historical log], so any recovery activity is likely to effect loss of any timestamps that would be reflective of the prior event\activity.
Even if there was no journal for the file and nothing in the plan cache, there may have been [albeit unlikely] an active SQL Monitor. An active monitor should be available visible somewhere in the iNav GUI as well. I am not sure of the visibility of a monitor that may have been active in a prior time-frame.
Similarly despite lack of journaling, there may be some system-level object or user auditing in effect for which the event was tracked either as a command-string or as an action on the file.member; combined with the inferred timing, all audit records spanning just before until just after can be reviewed.
Although there may have been nothing in the scheduled jobs, the History Log (DSPLOG) since that time may show jobs that ended, or [perhaps soon] prior to that time show jobs that started, which are more likely to have been responsible. In my experience, often the name of the job may be indicative; for example the name of the job as the name of the file, perhaps due only to the request having been submitted from PDM. Any spooled [or otherwise still available] joblogs could be reviewed for possible reference to the file and\or member name; perhaps a completion message for a CLRPFM request.
If the action may have been from a program, the file may be recorded as a reference-object such that output from DSPPGMREF may reveal programs with the reference, and any [service] program that is an SQL program could have their embedded SQL statements revealed with PRTSQLINF; the last-used for those programs could be reviewed for possible matches. Note: module and program sources can also be searched, but there is no way to know into what name they were compiled or into what they may have been bound if created only temporarily for the purpose of binding.
Using System i Navigator, expand Databases. Right click on your system database. Select SQL Plan Cache-> Show Statements. From here, you can filter based on a variety of criteria.
This is not sure-fire, but often saves me some time. Using System i Navigator, right-click on the table and choose Index Advisor. If you're lucky, one or more indexes are advised. If so, sort by date last advised and right click on the index with the newest date and select Show Statements... In that dialog box, either sort by date to help narrow things down or just scroll through the statements to find the one you're interested in. Right-click it and select Work with SQL Statement and there you go.

Start SQL Server Jobs when field = specific value

I don't know if this is even possible, so i would appreciate any ideas, even those outside of Sql Server 2005, on how this might be accomplished. I have a linked server set up to a remote mainframe and I have a simple import job that runs overnight. The problem is that the table on the mainframe that the import needs to come from is just a temporary report file that gets overwritten each time a user runs that report, sometimes with different parameters, so the data is always changing. One request was that the SQL job would run only when a specific user runs the report. This is stored as a field in the same mainframe report table that the import is coming from. Setting up a scheduled run on the mainframe is not an option since we don't control it an having the owners set it up would be costly, don't ask me why.
Any ideas that will keep me from forcing the user to run the mainframe report at a specific time would be helpful.
Well, the only thing that you could do from this side is to pull periodically and detect a change. You may try to set up a job that queries only report version, time-stamp and the author. The job runs every 5 minutes and triggers the import job when it detect changes. Not elegant, but it may be good enough.