Quarterly data load in SSIS package

Quarterly data load in SSIS package - sql

I just started using the SSIS tool and I need quick help to load data quarterly
Here`s my scenario:
I came up with a query ( source Database: DB2 ) that will extract data from 2010-01-01 to 2021-12-31,(11 years of data) however the data volume is too high ( around 300 M), so I would like to split the data source query to load data into quarter wise.
I tried year wise and still, I am getting more volume of data which my SSIS server is not able to handle.
I have created a year loop to loop it through, in that created a script task into it followed by a data flow task.
For example,
select * from tab1 where start_date >= '2010-01-01' and end_Date <= '2010-12-31'
This I would like to loop it as, ( 4 times load, 1 for each quarter)
select from tab1 where start_date >= '2010-01-01' and end_Date <= '2010-03-31'
select from tab1 where start_date >= '2010-04-01' and end_Date <= '2010-06-30'
select from tab1 where start_date >= '2010-07-01' and end_Date <= '2010-09-30'
select from tab1 where start_date >= '2010-10-01' and end_Date <= '2010-12-31'
Year-wise perfectly works fine, however, I am not getting any idea how do I need to load the data into quarter-wise.
I want to pass each quarter parameters to the source query as parameters, so overall I need to loop to 48 times ( 2010 to 2021 = 11 yrs * 4 quarters)
Any help is greatly appreciated.
I can send screenshots of what I have created for the year loop which is working perfectly fine.

I think the solution is to use the OFFSET FETCH clause to iterate over data. Why looping over data quarterly while using a number of rows is more precise (each iteration will handle the same amount of data). A step-by-step guide is provided in the following article:
SQL OFFSET FETCH Feature: Loading Large Volumes of Data Using Limited Resources with SSIS
One thing worth mentioning is that the article handles an SQL Server source, while you are using DB2. Then you should take into consideration any syntax difference while using the OFFSET FETCH clause:
Getting top n to n rows from db2
Example: Using the OFFSET clause with a cursor
Similar questions:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute
Loop 10 records at a time and assign it to variable
SSIS failing to save packages and reboots Visual Studio

Related

Query where my table timestamp is <= 1 min from current timestamp

Due to some activity in my project I want to run this query in some frequency and check "where my query can fetch table timestamp is <= 1 min from current timestamp"
SQL Query to check the updated data in the table.

Even though your question is incomplete, you haven't provided your existing table structure or any queries. I'm just giving you some generic solution here which should work as long as you can convert that based on your specific need.
so, you are trying to get the difference of two-time values in minutes
Time that record was saved
Current Time
If you have a table : LogRecords with below fields:
LogId
LogMessage
LogTimestamp
then you would write your query to pull last-minute logs as :
select * from LogRecords
where DATEDIFF(MINUTE, LogTimestamp , GETDATE()) <= 1
I haven't tested this code but it should be 99% similar if it won't work. Please try and let me know.

Limit result rows for minimal time intervals for PostgreSQL

Background: I am running TeslaMate/Grafana for monitoring my car status, one of the gauges plots the battery level fetched from database. My server is located remotely and running in a Dock from an old NAS, so both query performance and network overhead matters.
I found the koisk page frequently hangs and by investigation, it might caused by the query -- two of the plots returns 10~100k rows of results from database. I want to limit the number of rows returned by SQL queries, as the plots certainly don't have that much precision for drawing such detailed intervals.
I tried to follow this answer and use row_number() to pop only 100-th rows of results, but more complicated issues turned up, that is, the time intervals among rows are not consistent.
The car has 4 status, driving / online / asleep / offline.
If the car is at driving status, the time interval could be less than 200ms as the car pushes the status whenever it has new data.
If the car is at online status, the time interval could be several minutes as the system actively fetches the status from the car.
Even worse, if the system thinks the car is going to sleep and need to stop fetching status (to avoid preventing the car to sleep), the interval could be 40 minutes maximum depend on settings.
If the car is in asleep/offline status, no data is recorded at all.
This obviously makes skipping every n-th rows a bad idea, as for case 2-4 above, lots of data points might missing so that Grafana cannot plot correct graph representing the battery level at satisfactory precision.
I wonder if there's any possible to skip the rows by time interval from a datetime field rather than row_number() without much overhead from the query? i.e., fetch every row with minimal 1000ms from the previous row.
E.g., I have following data in the table, I want the rows returned are row 1, 4 and 5.
row date
[1] 1610000001000
[2] 1610000001100
[3] 1610000001200
[4] 1610000002000
[5] 1610000005000
The current (problematic) method I am using is as follows:
SELECT $__time(t.date), t.battery_level AS "SOC [%]"
FROM (
SELECT date, battery_level, row_number() OVER(ORDER BY date ASC) AS row
FROM (
SELECT battery_level, date
FROM positions
WHERE car_id = $car_id AND $__timeFilter(date)
UNION ALL
SELECT battery_level, date
FROM charges c
JOIN charging_processes p ON p.id = c.charging_process_id
WHERE $__timeFilter(date) AND p.car_id = $car_id) AS data
ORDER BY date ASC) as t
WHERE t.row % 100 = 0;
This method clearly gives problem that only returns alternate rows instead of what I wanted (given the last row reads t.row % 2 = 0)
PS: please ignore the table structures and UNION from the sample code, I haven't dig deep enough to the tables which could be other tweaks but irrelevant to this question anyway.
Thanks in advance!

You can use a recursive CTE:
WITH RECURSIVE rec(cur_row, cur_date) AS (
(
SELECT row, date
FROM t
ORDER BY date
LIMIT 1
)
UNION ALL
(
SELECT row, date
FROM t
JOIN rec
ON t.date >= cur_date + 1000
ORDER BY t.date
LIMIT 1
)
)
SELECT *
FROM rec;
cur_row
cur_date
1
1610000001000
4
1610000002000
5
1610000005000
View on DB Fiddle
Using a function instead would probably be faster:
CREATE OR REPLACE FUNCTION f() RETURNS SETOF t AS
$$
DECLARE
row t%ROWTYPE;
cur_date BIGINT;
BEGIN
FOR row IN
SELECT *
FROM t
ORDER BY date
LOOP
IF row.date >= cur_date + 1000 OR cur_date IS NULL
THEN
cur_date := row.date;
RETURN NEXT row;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT *
FROM f();
row
date
1
1610000001000
4
1610000002000
5
1610000005000

Use DataStudio to specify the date range for a custom query in BigQuery, where the date range influences operators in the query

I currently have a DataStudio dashboard connected to a BigQuery custom query.
That BQ query has a hardcoded date range and the status of one of the columns (New_or_Relicensed) can change dynamically for a row, based on the dates specified in the range. I would like to be able to alter that range from DataStudio.
I have tried:
simply connecting the DS dashboard to the custom query in BQ and then introducing a date range filter, but as you can imagine - that does not work because it's operating on an already hard-coded date range.
reviewing similar answers, but their problem doesn't appear to be quite the same E.g. BigQuery Data Studio Custom Query
Here is the query I have in BQ:
SELECT t0.New_Or_Relicensed, t0.Title_Category FROM (WITH
report_range AS
(
SELECT
TIMESTAMP '2019-06-24 00:00:00' AS start_date,
TIMESTAMP '2019-06-30 00:00:00' AS end_date
)
SELECT
schedules.schedule_entry_id AS Schedule_Entry_ID,
schedules.schedule_entry_starts_at AS Put_Up,
schedules.schedule_entry_ends_at AS Take_Down,
schedule_entries_metadata.contract AS Schedule_Entry_Contract,
schedules.platform_id AS Platform_ID,
platforms.platform_name AS Platform_Name,
titles_metadata.title_id AS Title_ID,
titles_metadata.name AS Title_Name,
titles_metadata.category AS Title_Category,
IF (other_schedules.schedule_entry_id IS NULL, "new", "relicensed") AS New_Or_Relicensed
FROM
report_range, client.schedule_entries AS schedules
JOIN client.schedule_entries_metadata
ON schedule_entries_metadata.schedule_entry_id = schedules.schedule_entry_id
JOIN
client.platforms
ON schedules.platform_id = platforms.platform_id
JOIN
client.titles_metadata
ON schedules.title_id = titles_metadata.title_id
LEFT OUTER JOIN
client.schedule_entries AS other_schedules
ON schedules.platform_id = other_schedules.platform_id
AND other_schedules.schedule_entry_ends_at < report_range.start_date
AND schedules.title_id = other_schedules.title_id
WHERE
((schedules.schedule_entry_starts_at >= report_range.start_date AND
schedules.schedule_entry_starts_at <= report_range.end_date) OR
(schedules.schedule_entry_ends_at >= report_range.start_date AND
schedules.schedule_entry_ends_at <= report_range.end_date))
) AS t0 LIMIT 100;
Essentially - I would like to be able to set the start_date and end_date from google data studio, and have those dates incorporated into the report_range that then influences the operations in the rest of the query (that assign a schedule entry as new or relicensed).

Have you looked at using the Custom Query interface of the BigQuery connector in Data Studio to define start_date and end_date as parameters as part of a filter.
Your query would need a little re-work...
The following example custom query uses the #DS_START_DATE and #DS_END_DATE parameters as part of a filter on the creation date column of a table. The records produced by the query will be limited to the date range selected by the report user, reducing the number of records returned and resulting in a faster query:
Resources:
Introducing BigQuery parameters in Data Studio
https://www.blog.google/products/marketingplatform/analytics/introducing-bigquery-parameters-data-studio/
Running parameterized queries
https://cloud.google.com/bigquery/docs/parameterized-queries

I had a similar issue where I wanted to incorporate a 30 day look back before the start (#ds_start_date). In this case I was using Google Analytics UA session data and using table suffix in my where clause. I was able to calculate a date RELATIVE to the built in data studio "string" dates by using the following:
...
WHERE
_table_suffix BETWEEN
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_START_DATE), INTERVAL 30 DAY)) AS STRING)
AND
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_END_DATE), INTERVAL 0 DAY)) AS STRING)

Historical Backlog Using Oracle SQL

I'm trying to develop a query that will show the historical backlog, month by month for applications my organization processes.
The table I'm querying has each row represent an application.
This is fairly simple for a single month by taking into account when the application was created and when it was closed. I can use the code below.
SELECT COUNT(*)
FROM APPLICATIONS
WHERE TRUNC(CRTE_DTE,'mm') <= TO_DATE('09/01/2016','mm/dd/yyyy')
AND (TRUNC(CLOS_DTE,'mm')> TO_DATE('09/01/2016','mm/dd/yyyy') OR CLOS_DTE IS
NULL)
;
What I would like to do, is create something like this but grouped by month.
i.e. with output that looks similar to:
January 250
February 350
March 290
etc.
For my purposes, I'm defining backlog as any application that hadn't yet been closed by the time that month had ended while excluding any applications opened after that month had ended.
I don't have access to run PL/SQL, but if that's a requirement please let me know.
First time posting here, so if I didn't explain something correctly, let me know!

Try something like this:
SELECT d.dt, COUNT(*)
FROM
APPLICATIONS a,
(SELECT ADD_MONTHS(:DATE1, ROWNUM-1) DT FROM DUAL CONNECT BY ROWNUM<=MONTHS_BETWEEN(:DATE2, :DATE1)+1) d
WHERE TRUNC(a.CRTE_DTE,'mm') <= d.dt
AND (TRUNC(a.CLOS_DTE,'mm')> d.dt OR CLOS_DTE IS NULL)
GROUP BY D.DT
:date1 - start date, for example 08/01/2016, :date2 - end date, for example 10/01/2016, in subquery we get row for each month between :date1 and :date2, and then count backlog for them.

Calculating working days including holidays between dates without a calendar table in oracle SQL

Okay, so I've done quite a lot of reading on the possibility of emulating the networkdays function of excel in sql, and have come to the conclusion that by far the easiest solution is to have a calendar table which will flag working days or non working days. However, due to circumstances out of my control, we don't have access to such a luxury and it's unlikely that we will any time in the near future.
Currently I have managed to bodge together what is undoubtedly a horrible ineffecient query in SQL that does work - the catch is, it will only work for a single client record at a time.
SELECT O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE,
sum(CASE
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Sunday ' THEN 0
When TO_CHAR(O_ASSESSMENTS.ASM_START_DATE + rownum -1,'Day')
= 'Saturday ' THEN 0
WHEN O_ASSESSMENTS.ASM_START_DATE + rownum - 1
IN ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
THEN 0
ELSE 1
END)-1 AS Week_Day
From O_ASSESSMENTS,
ALL_OBJECTS
WHERE O_ASSESSMENTS.ASM_QSA_ID IN ('TYPE1')
AND O_ASSESSMENTS.ASM_END_DATE >= '01/01/2012'
AND O_ASSESSMENTS.ASM_ID = 'A00000'
AND ROWNUM <= O_ASSESSMENTS.ASM_END_DATE-O_ASSESSMENTS.ASM_START_DATE+1
GROUP BY
O_ASSESSMENTS.ASM_ID,
O_ASSESSMENTS.ASM_START_DATE,
O_ASSESSMENTS.ASM_END_DATE
Basically, I'm wondering if a) I should stop wasting my time on this or b) is it possible to get this to work for multiple clients? Any pointers appreciated thanks!
Edit: Further clarification - I already work out timescales using excel, but it would be ideal if we could do it in the report as the report in question is something that we would like end users to be able to run without any further manipulation.
Edit:
MarkBannister's answer works perfectly albeit slowly (though I had expected as much given it's not the preferred solution) - the challenge now lies in me integrating this into an existing report!
with
calendar_cte as (select
to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'day') in ('sunday ','saturday ') then 0 when to_date('01-01-2000')+level-1 in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011','01-01-2012','02-01-2012') then 0 else 1 end working_day
from dual
connect by level <= 1825 + sysdate - to_date('01-01-2000') )
SELECT
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day)-1 AS Week_Day
From
O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1')
and a.ASM_END_DATE >= '01/01/2012'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE

There are a few ways to do this. Perhaps the simplest might be to create a CTE that produces a virtual calendar table, based on Oracle's connect by syntax, and then join it to the Assesments table, like so:
with calendar_cte as (
select to_date('01-01-2000')+level-1 calendar_date,
case when to_char(to_date('01-01-2000')+level-1, 'Day')
in ('Sunday ','Saturday ') then 0
when to_date('01-01-2000')+level-1
in ('03-01-2000','21-04-2000','24-04-2000','01-05-2000','29-05-2000','28-08-2000','25-12-2000','26-12-2000','01-01-2001','13-04-2001','16-04-2001','07-05-2001','28-05-2001','27-08-2001','25-12-2001','26-12-2001','01-01-2002','29-03-2002','01-04-2002','06-04-2002','03-06-2002','04-06-2002','26-08-2002','25-12-2002','26-12-2002','01-01-2003','18-04-2003','21-04-2003','05-05-2003','26-05-2003','25-08-2003','25-12-2003','26-12-2003','01-01-2004','09-04-2004','12-04-2004','03-05-2004','31-05-2004','30-08-2004','25-12-2004','26-12-2004','27-12-2004','28-12-2004','01-01-2005','03-01-2005','25-03-2005','28-03-2005','02-05-2005','30-05-2005','29-08-2005','27-12-2005','28-12-2005','02-01-2006','14-04-2006','17-04-2006','01-05-2006','29-05-2006','28-08-2006','25-12-2006','26-12-2006','02-01-2007','06-04-2007','09-04-2007','07-05-2007','28-05-2007','27-08-2007','25-12-2007','26-12-2007','01-01-2008','21-03-2008','24-03-2008','05-05-2008','26-05-2008','25-08-2008','25-12-2008','26-12-2008','01-01-2009','10-04-2009','13-04-2009','04-05-2009','25-05-2009','31-08-2009','25-12-2009','28-12-2009','01-01-2010','02-04-2010','05-04-2010','03-05-2010','31-05-2010','30-08-2010','24-12-2010','27-12-2010','28-12-2010','31-12-2010','03-01-2011','22-04-2011','25-04-2011','29-04-2011','02-05-2011','30-05-2011','29-08-2011','26-12-2011','27-12-2011')
then 0
else 1
end working_day
from dual
connect by level <= 36525 + sysdate - to_date('01-01-2000') )
SELECT a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE,
sum(c.working_day) AS Week_Day
From O_ASSESSMENTS a
join calendar_cte c
on c.calendar_date between a.ASM_START_DATE and a.ASM_END_DATE
WHERE a.ASM_QSA_ID IN ('TYPE1') and
a.ASM_END_DATE >= '01/01/2012' -- and a.ASM_ID = 'A00000'
GROUP BY
a.ASM_ID,
a.ASM_START_DATE,
a.ASM_END_DATE
This will produce a virtual table populated with dates from 01 January 2000 to 10 years after the current date, with all weekends marked as non-working days and all days specified in the second in clause (ie. up to 27 December 2011) also marked as non-working days.
The drawback of this method (or any method where the holiday dates are hardcoded into the query) is that each time new holiday dates are defined, every single query that uses this approach will have to have those dates added.

If you can't use a calendar table in Oracle, you might be better off exporting to Excel. Brute force always works.
Networkdays() "returns the number of whole working days between start_date and end_date. Working days exclude weekends and any dates identified in holidays."
Excluding weekends seems fairly straightforward. Every 7-day period will contain two weekend days. You'll just need to take some care with the leftover days.
Holidays are a different story. You have to either store them or pass them as an argument. If you could store them, you'd store them in a calendar table, and your problem would be over. But you can't do that.
So you're looking at passing them as an argument. Off the top of my head--and I haven't had any tea yet this morning--I'd consider a common table expression or a wrapper for a stored procedure.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Quarterly data load in SSIS package - sql

Related

Query where my table timestamp is <= 1 min from current timestamp

Limit result rows for minimal time intervals for PostgreSQL

Use DataStudio to specify the date range for a custom query in BigQuery, where the date range influences operators in the query

Historical Backlog Using Oracle SQL

Calculating working days including holidays between dates without a calendar table in oracle SQL

Categories

Resources