Is SSIS suitable for my problem (database replication for query) - sql

I have a challenge that I am trying to solve and I can't work out from the documentation or the examples if SSIS is suitable for my problem.
I have 2 tables (jobs and tasks). Jobs represent a large piece of work, while tasks are tied to jobs. There will typically be anything from 1 task per job to 1,000,000 tasks per job. Each task has a column storing the job_id. The job_id in the jobs table is the primary key.
Every N hours, I want to do the following:
Take all of the job rows where the jobs have completed since I last ran (based on having an end_time value and that value being within the time between now and when I last ran) and add these to the jobs column in the 'query' database.
Copy all of the tasks that have a job_id from the jobs that were included in step 1 into the tasks column in the 'query' database.
Basically, I want to be able to regularly update my query database, but I only want to include completed jobs (hence the requirement of an end_time) and tasks from those completed jobs.
This is likely to be done 2 - 3 times per day so that users are able to query an almost-up-to-date copy of the live data.
Is SSIS suitable for this task, and if so, can you please advise some documentation to show where a column from the results from 1 step are used as the criteria for a 2nd step ?
Thanks in advance...

Sure SSIS can do that.
If you want to be sure that the child record are moved, then use a query for your data flow source for te second data flow. You insert the records to the main table in the first data flow. Then you use a query that picks any records in the source child table that are not inthe destination child table and that have records in the parent destination table. This way you catch any changes to existing closed records as well (you know there will be some, some one will close a job too soon and reopen and add something to it.)
Alternatively, you can add the records you are moving to a staging table. Then join to that table when doing the dataflow for the child tables. This will ensure that exactly the records you moved are the ones the child tables are populated for.
Or if you are in a denormalized datawarehouse just write a query that joins the parent and child tables together with the where clause for end date is not null. Of course don't forget to check for records that aren't currently in the datawarehouse.

Related

Trigger scheduled query

I have a partitioned table (bigquery) and records are streamed for each date multiple times during a few days period, eg: records for 02.06.2022 are streamed on 03.06, 04.06, 05.06 and etc.
Is there a way that these updates can trigger a scheduled query and insert some of the new records (based on certain criteria) to another table? Maybe using the last modified field from the table's details somehow?
The reason: we can't wait until all updates are done before using the data that we already have but at the same time we need to have all the data that is available after the first run.
Thank you in advance!
So what i did was use TIMESTAMP_MILLIS(last_modified_time) in the where statement of the scheduled query. I made the where statement to check the last_modified_time of a list of tables that i needed. If it was after the previous run, then the query would insert the new records.

Oracle: Data consistency across multiple tables to be displayed

I have 3 reports based on 3 different tables, which ideally should match each other in audit.
They are updated sequentially once in a day.
The problem here is when one of the table is updated and second one is in progress, the customer sees data discrepancy between the reports for some time.
We tried the solution where in we commit after all 3 tables are updated but we started having issue on undo tbsp. The application have many other things running on.
I am looking for a solution where in we can restrict the user to show data to a specific point, and he must see updated data only after all 3 tables are refreshed/updated.
I think you can use select * for update for all 3 tables befor start updating procedure.
In that case users can select data and will see only not changed data till update session will not finish and make commit.
You can use a flashback query to show data as-of a point in time:
select * from table1 as of timestamp timestamp '2021-12-10 12:00:00';
The application would need to determine the latest time when the tables were synchronized - perhaps with a log table that records when the update process last started. However, the flashback query also uses the UNDO tablespace. But the query should at least use less UNDO since some of the committed transactions will now free up some space.

Populating fact table with different sequence time

I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html

BigQuery update multi tables

i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)
You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.
Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.

Query Performance help

I have a long running job. The records to be processed are in a table with aroun 100K records.
Now during whole job whenever this table is queried it queries against those 100K records.
After processing status of every record is updated against same table.
I want to know, if it would be better if I add another table where I can update records status and in this table keep deleting whatever records are processed, so as the query go forward the no. of records in master table will decrease increasing the query performance.
EDIT: Master table is basically used for this load only. I receive a flat file, which I upload as it is before processing. After doing validations on this table I pick one record at a time and move data to appropriate system tables.
I had a similar performance problem where a table generally has a few million rows but I only need to process what has changed since the start of my last execution. In my target table I have an IDENTITY column so when my batch process begins, I get the highest IDENTITY value from the set I select where the IDs are greater than my previous batch execution. Then upon successful completion of the batch job, I add a record to a separate table indicating this highest IDENTITY value which was successfully processed and use this as the start input for the next batch invocation. (I'll also add that my bookmark table is general purpose so I have multiple different jobs using it each with unique job names.)
If you are experiencing locking issues because your processing time per record takes a long time you can use the approach I used above, but break your sets into 1,000 rows (or whatever row chunk size your system can process in a timely fashion) so you're only locking smaller sets at any given time.
Few pointers (my two cents):
Consider splitting that table similar to "slowly changing dimension" technique into few "intermediate" tables, depending on "system table" destination; then bulk load your system tables -- instead of record by record.
Drop the "input" table before bulk load, and re-create to get rid of indexes, etc.
Do not assign unnecessary (keys) indexes on that table before load.
Consider switching the DB "recovery model" to bulk-load mode, not to log bulk transactions.
Can you use a SSIS (ETL) task for loading, cleaning and validating?
UPDATE:
Here is a typical ETL scenario -- well, depends on who you talk to.
1. Extract to flat_file_1 (you have that)
2. Clean flat_file_1 --> SSIS --> flat_file_2 (you can validate here)
3. Conform flat_file_2 --> SSIS --> flat_file_3 (apply all company standards)
4. Deliver flat_file_3 --> SSIS (bulk) --> db.ETL.StagingTables (several, one per your destination)
4B. insert into destination_table select * from db.ETL.StagingTable (bulk load your final destination)
This way if a process (1-4) times-out you can always start from the intermediate file. You can also inspect each stage and create report files from SSIS for each stage to control your data quality. Operations 1-3 are essentially slow; here they are happening outside of the database and can be done on a separate server. If you archive flat_file(1-3) you also have an audit trail of what's going on -- good for debug too. :)