SQL - Incremental insert using SSIS? - sql

I have quite a large table that I need to import into my BI environment for reporting. I have an SSIS package that calls a stored procedure runs every 20 minutes to extract data from source and populate it into my table. The earliest date on the source table is 01-January-2012.
What I would like is for the first time the package runs it will import all the data from source for the month of January 2012. The next time it runs it will populate all the data for February 2012 and so on.
The below is the query I would use to extract the data - this is based on Created and Modified Dates
Select ID, Name, Company, Job, HRID, PayID, CreatedOn, ModifiedOn
from dbo.HRDetails
where CreatedOn between #MonthStart and #MonthEnd
or ModifiedOn between #MonthStart and #MonthEnd
I just need help on how I would make this incremental to pick up the data month on month dynamically?
Any help would be appreciated
-Jess

In your stored procedure, pull the current max date from the loaded table and set your variables based on that:
DECLARE #DateLoaded = ISNULL((SELECT MAX(dateField) FROM yourLoadedTable),'20120101') --MAX date loaded
DECLARE #MonthStart = DATEADD(DAY,1,EOMONTH(#DateLoaded)) --End of max loaded month, plus 1 day to get first day of next month
DECLARE #MonthEnd = EOMONTH(#DateLoaded, 1) --End of next month
Select ID, Name, Company, Job, HRID, PayID, CreatedOn, ModifiedOn
from dbo.HRDetails
where CreatedOn between #MonthStart and #MonthEnd
or ModifiedOn between #MonthStart and #MonthEnd
I like this type of approach because it's self-repairing if a pull fails, even if you've missed a few months before noticing the issue.

I would do this by creating a metadata table in SQL Server.
Each time the package runs, insert a date/identifier into that table to note that the task has been completed for that month (as your final package step). For the first package step, you would use that table to get the next month that hasn't been completed (you would store this in a variable for use in the later insert). (You would also have a default/start month, which would be used if the table is empty.)

Instead of using dates, you can use Change Tracking to identify all keys modified since the last update, even deleted ones. The feature is available in all versions and editions since 2005, even Express.
Work with Change Tracking (SQL Server) shows how you can retrieve any changes made to a table since the last sync operation.
This query will return all modified rows and the reason they were modified from the Product table since the version specified in last_synchronization_version. Any deleted rows will appear with D in the SYS_CHANGE_OPERATION field :
SELECT
CT.ProductID, P.Name, P.ListPrice,
CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS,
CT.SYS_CHANGE_CONTEXT
FROM
SalesLT.Product AS P
RIGHT OUTER JOIN
CHANGETABLE(CHANGES SalesLT.Product, #last_synchronization_version) AS CT
ON
P.ProductID = CT.ProductID
The sync version you'll use for the next iteration should be retrieved before selecting changes. You can retrieve it with :
SET #synchronization_version = CHANGE_TRACKING_CURRENT_VERSION();
This query is very fast because it joins on the table's primary keys.
Another nice thing about this is that it doesn't care if you forget to run it - it will still pull all changes since the last execution. Running this more frequently results in better performance since the query has fewer changes to return.

Related

creating materialized view for annual report based on slow function

Consider the following scenario:
I have a table with 1 million product ids products :
create table products (
pid number,
p_description varchar2(200)
)
also there is a relatively slow function
function gerProductMetrics(pid,date) return number
which returns some metric for the given product at given date.
there is also an annual report executed every year that is based on the following query:
select pid,p_description,getProductMetrics(pid,'2019-12-31') from
products
that query takes about 20-40 minutes to execute for a given year.
would it be correct approach to create Materialized View (MV) for this scenario using the following
CREATE TABLE mydates
(
mydate date
);
INSERT INTO mydates (mydate)
VALUES (DATE '2019-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2018-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2017-12-31');
CREATE MATERIALIZED VIEW metrics_summary
BUILD IMMEDIATE
REFRESH FORCE ON DEMAND
AS
SELECT pid,
getProductMetrics(pid,mydate AS annual_metric,
mydate
FROM products,mydates
or it would take forever?
Also, how and how often would I update this MV?
Metrics data is required for the end of each year.
But any year's data could be requested at any time.
Note, that I have no control over the slow function - it's just a given.
thanks.
First, you do not have a "group by" query, so you can remove that.
An MV would be most useful if you needed to recompute all of the data for all years. As this appears to be a summary, with no need to reprocess old data, updated only when certain threshold dates like end of year are passed, I would recommend putting the results in a normal table and only adding the updates as often as your threshold dates occur (annually?) using a stored procedure. Otherwise your MV will take longer to run and require more system resources with every execution that adds a new date.
Do not create a materialized view. This is not just a performance issue. It is also an archiving issue: You don't want to run the risk that historical results could change.
My advice is to create a single table with a "year" column. Run the query once per year and insert the rows into the new table. This is an archive of the results.
Note: If you want to recalculate previous years because the results may have changed (say the data is updated somehow), then you should store those results in a separate table and decide which version is the "right" version. You may find that you want an archive table with both the "as-of" date and the "run-date" to see how results might be changing.

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

Is it possible to have same SYSDATETIME result for 2 transactions

According to documentation, the precision of SYSDATETIME() function in sql server is approximately 100 nanoseconds. I have seen that just like GETDATE(), the SYSDATETIME function also returns the same result within a transaction. Also, the time differs in two batches separated by GO.
Now my real question is, is it safe to assume that two transactions will always have different SYSDATETIME, no matter how close to concurrency they can reach within the same server/database instance, irrespective of the number of cores/hardware, the server has?
Background: I am trying to implement audit on an existing database using temporal tables. We are already keeping a modified by column in all tables. But we cannot identify who deleted a record using temporal tables. So I was thinking of dumping user id (end user's id) into a table for all transactions. So, if the time matches with temporal table, I might be able to identify the user based on date-time.
First i need to inform you that GETDATE() and SYSDATETIME gives the different datetime formate. SYSDATETIME() will give you result as follow - 2019-06-03 16:11:07.3683245 and GETDATE() will - 2019-06-03 16:11:07.367. Now the things is you need to add two columns in a table for which user updated the records and what time. And if you are not using any time consuming process between update temporal table and update main table than it will get same time both. But if any reason it will take time to update both record than it can be different time in both table.
You can use Declare method in sql to have same datetime in both table. No problem when it will update at different time. you can use Declare method to get datetime same as like follow.
Declare #date Datetime2 = SYSDATETIME()
Select #Date
You can use #Date When you are updating datetime in query.
I hope it will work.

I need to calculate values for a record in a database based off of other values in other records

I need to calculate values for a record in a database based off of other values in other records. Using SqlServer 2012, what would be the best way to do this? I'm thinking some type of script that runs on the server that may be able to query for the values it needs to compute, compute them, and insert them into the record it needs to. I know you can have computed columns based off of other columns in SqlServer, but what about new records based off of different columns in different records?
Thanks!
EDIT:
I'm using a google charts table on an MVC4 Razor website to show items purchased by specific users by month and year; looks something like this:
Email Address | Purchase Value | Year | Month
This currently works absolutely fine. I query the database for purchases by user and group by month and year and sum the purchases, and I put the values in the table. I also have category filters that only show one month and one year, so only one user is shown at a time.
Now management wants an 'All' selection on the category filter, which means that for every month of every year, and every year total, I'm going to have to compute a cumulative purchase value for each user and put it in the table; you can imagine, if the users list gets very long, this could take some time. So, I think the best option would probably be to have a script that groups purchases by year and by user and updates a new record every time a donation is made anytime within that year; obviously, you'd do the same for each month of the year. That way, I wouldn't have to worry about computing this when the user requests the page. I'm just not sure how to go about writing a script for SQLServer that would be able to do something like this.
This shows how to calculate values for a record in a database based off of other values in other records. The example is written in TSQL and can be executed on SQL Server. You will need to change the script to use your tables and columns.
DECLARE #total dec(12,2), #num int --Variable declaration
SET #total = (SELECT SUM(Salary) FROM Employee) --Capture sum of employee salaries
SET #num = (SELECT COUNT(ID) FROM Employee) --Capture the number of employees
SELECT #total 'Total', --calculate values for a record in a database based off of other values in other records
#num 'Number of employees',
#total/#num 'Average'
INTO
dbo.AverageSalary
Hope this helps.

How do I Automatically insert monthly records into a table via SQL?

I'm trying to generate monthly records in one table based on instructions in another table. Software - MS Access 2007, though I'm looking for an SQL solution here. To greatly simplify the matter, let's say the following describes the tables:
TaskManager:
- DayDue
- TaskName
Task:
- DateDue
- TaskName
So what happens is that there may be an entry in TaskManager {15, "Accounts due"}, so this should lead to an "Account due" record in the Task table with the due date being the 15th of each month. I'd want it to create records for the last few months and the next year.
What I'm thinking that I need to do is first create a SELECT query that results in x records for each record in the TaskManager table, with a date for each month. After that, I do an INSERT query which inserts records into the Task table if they do not EXIST in the aforementioned SELECT query.
I think I can manage the INSERT query, though I'm having trouble figuring out how to do the SELECT query. Could someone give me a pointer?
You could use a calendar table.
INSERT INTO Task ( DateDue, TaskName )
SELECT calendar.CalDate, TaskManager.TaskName
FROM calendar, TaskManager
WHERE (((Day([CalDate]))=TaskManager.DayDue)
AND ((calendar.CalDate)<#7/1/2013#));
The calendar table would simply contain all dates and other such relevant fields as work day (yesno). Calendar tables are generally quite useful.
Here is the solution I developed using Remou's Calendar table idea.
First create a Calendar table, which simply contains all dates for a desired range. It's easy to just make the dates in Excel and paste them into the table. This is also a very reliable way of doing it, as Excel handles leap years correctly for the modern range of dates.
After building this table, there are three queries to run. The first is a SELECT, which selects every possible task generated by the TaskManager based on the date and frequency. This query is called TaskManagerQryAllOptions, and has the following code:
SELECT TaskManager.ID, Calendar.CalendarDate
FROM TaskManager INNER JOIN Calendar ON
TaskManager.DateDay = Day(Calendar.CalendarDate)
WHERE (TaskManager.Frequency = "Monthly")
OR (TaskManager.Frequency = "Yearly" AND
TaskManager.DateMonth = Month(Calendar.CalendarDate))
OR (TaskManager.Frequency = "Quarterly" AND
(((Month(Calendar.CalendarDate)- TaskManager.DateMonth) Mod 3) = 0));
The bulk of the above is to cover the different options a quarterly Day and Month pair could cover. The next step is another SELECT query, which selects records from the TaskManagerQryAllOptions in which the date is within the required range. This query is called TaskManagerQrySelect.
SELECT TaskManagerQryAllOptions.ID, TaskManager.TaskName,
TaskManagerQryAllOptions.CalendarDate
FROM TaskManagerQryAllOptions INNER JOIN TaskManager
ON TaskManagerQryAllOptions.ID = TaskManager.ID
WHERE (TaskManagerQryAllOptions.CalendarDate > Date()-60)
AND (TaskManagerQryAllOptions.CalendarDate < Date()+370)
AND (TaskManagerQryAllOptions.CalendarDate >= TaskManager.Start)
AND ((TaskManagerQryAllOptions.CalendarDate <= TaskManager.Finish)
OR (TaskManager.Finish Is Null))
ORDER BY TaskManagerQryAllOptions.CalendarDate;
The final query is an INSERT. As we will be using this query frequently, we don't want it to generate duplicates, so we need to filter out already created records.
INSERT INTO Task ( TaskName, TaskDate )
SELECT TaskManagerQrySelect.TaskName, TaskManagerQrySelect.CalendarDate
FROM TaskManagerQrySelect
WHERE Not Exists(
SELECT *
FROM Task
WHERE Task.TaskName = TaskManagerQrySelect.TaskName
AND Task.TaskDate = TaskManagerQrySelect.CalendarDate);
One limitation of this method is that if the date of repetition (e.g. the 15th of each month) is changed, the future records with the wrong day will remain. A solution to this would be to update all the future records with the adjusted date, then run the insert.
One possibility could be to create a table of Months, and a table of Years (prior year, current, and next one). I could run a SELECT query which takes the Day from the TaskManager table, the Month from the Month table, and the Year from the Year table - I imagine that this could somehow create my desired multiple records for a single TaskManager record. Though I'm not sure what the exact SQL would be.