CDC LSNs: queries return different minimum value - sql

I have CDC enable on a table and I'm trying to get the minimum LSN for that table to use in an ETL job. However when I run this query
select sys.fn_cdc_get_min_lsn('dbo_Table')
I get a different result to this query
select min(__$start_lsn) from cdc.dbo_Table_CT
Shouldn't these queries return the same values? and if they don't why not? and how to get them back in sync?

The first query:
select sys.fn_cdc_get_min_lsn('dbo_Table')
Invokes a system function that returns the lowest POSSIBLE lsn for the capture instance. This value is set when the cleanup function runs. It is recorded in, and queried from, cdc.change_tables.
The second query:
select min(__$start_lsn) from cdc.dbo_Table_CT
Looks at the actual capture instance and returns the lowest ACTUAL lsn for the instance. This value is set when the first actual change to the instance is logged after the cleanup function runs. It is recorded in, and queried from, cdc.dbo_Table_CT.
They're unlikely to tie out, statistically speaking. For the purposes of an ETL job, using the call to the system table will likely be quicker, and is a more accurate reflection of when the current set of change records started being accumulated.

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

Bigquery return nested results without flattening it without using a table

It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.

Pentaho Job - Execute job based on condition

I have a pentaho job that is scheduled to run every week and gets data from one table and populates in another.
Now the job executes every week irrespective of whether the source table was updated or not.
I want to put a condition before the job runs to see if the source was updated last week or not and run the job only if the source was updated else dont run the job.
There are many ways you can do this. Assuming you have a table in your database that stores the last date your job was run, you could do something like this.
Create a Job and configure a parameter in it (I called mine RunJob). Create a transformation which gets your max run date, or row count then looks up the run date or row count from the previous run and compares them. It then sets the value of your job's variable based on the results of the comparison. Mine looks like this.
Note that the last step in the transform is a Set Variables step from the Job branch.
Then in your job use a Simple Evaluation step to test the variable. Mine looks like this:
Note here that my transform sets the value of the variable only if the job needs to be run, otherwise it will be NULL.
Note also to be sure to update your last run date or row count after doing the table load. That's what the SQL step does at the end of the job.
You could probably handle this with fewer steps if you used some JavaScript in there, but I prefer not to script if I can avoid it.

Trial column in client statictics SQL

I start attach Client Statistics to examinate my queries. There are "Trial x" columns, What do they mean?
I use SQL Server 2014
Trial X column specifies the number of times you execute the query. Trial column creates each time when you execute the query. Maximum of last 10 trial statistics will be shown. Average values of the trial will be shown in Average column.
MSDN says it “Displays information about the query execution grouped
into categories. When Include Client Statistics is selected from the
Query menu, a Client Statistics window is displayed upon query
execution. Statistics from successive query executions are listed
along with the average values. Select Reset Client Statistics from the
Query menu to reset the average.”
Every time you run a query with include client statistics, SQL Server includes a run history ("Trial n"). This allows you to easily compare recent execution statistics. The history is generally useful only if you are running the same basic query repeatedly and are making changes that may affect performance. The up and down arrows indicate progression or digression compared to the previous run.

Replacing Null Entries with a Zero

I've been working on learning SQL starting with basically no knowledge. I've only been at this for about a week.
I'm working with a piece of simulation software. I've been given a task to run many replications of a simulation that will generate random events of either something happens or nothing happens. The sim will create a table of the events and a column for replication numbers. If the event happened then it will get an event number, if it didn't, it will skip that rep and go to the next. I need a query to take the replications that have no event and replace the null entry with a zero.
As an added bonus, I might need to generate the number for the replication where there is a null.
The last bit of code I tried looked like this:
SELECT IsNull(table1.column1, 0)
FROM table1
SELECT COALESCE(table1.column1,0)
FROM table1
Should do the trick. COALESCE returns the first non-null value it finds, so this should work.