I need to replicate a table from an external db to an internal db for performance reasons. Several apps will use this local db to do joins and compare data. I only need to replicate every hour or so but if there is a performance solution, I would prefer to replicate every 5 to 10 minutes.
What would be the best way to replicate? The first thing that comes to mind is DROP and then CREATE:
DROP TABLE clonedTable;
CREATE TABLE clonedTable AS SELECT * from foo.extern#data.sourceTable;
There has to be a better way right? Hopefully an atomic solution to avoid the fraction of a second where the table doesn't exist but someone might try to query it.
The simplest possible solution would be a materialized view that is set to refresh every hour.
CREATE MATERIALIZED VIEW mv_cloned_table
REFRESH COMPLETE
START WITH sysdate + interval '1' minute
NEXT sysdate + interval '1' hour
AS
SELECT *
FROM foo.external_table#database_link;
This will delete all the data currently in mv_cloned_table, insert all the data from the table in the external database, and then schedule itself to run again an hour after it finishes (so it will actually be 1 hour + however long it takes to refresh between refreshes).
There are lots of ways to optimize this.
If the folks that own the source database are amenable to it, you can ask them to create a materialized view log on the source table. That would allow your materialized view to replicate just the changes which should be much more efficient and would allow you to schedule refreshes much more frequently.
If you have the cooperation of the folks that own the source database, you could also use Streams instead of materialized views which would let you replicate the changes in near real time (a lag of a few seconds would be common). That also tends to be more efficient on the source system than maintaining the materialized view logs would be. But it tends to take more admin time to get everything working properly-- materialized views are much less flexible and less efficient but pretty easy to configure.
If you don't mind the table being empty during a refresh (it would exist, it would just have no data), you can do a non-atomic refresh on the materialized view which would do a TRUNCATE followed by a direct-path INSERT rather than a DELETE and conventional path INSERT. The former will be much more efficient but will mean that the table appears empty when you're doing joins and data comparisons on the local server which seems unlikely to be appropriate in this situation.
If you want to go down the path of having the source side create a materialized view log so that you can do an incremental refresh, on the source side, assuming the source table has a primary key, you'd ask them to
CREATE MATERIALIZED VIEW LOG ON foo.external_table
WITH PRIMARY KEY
INCLUDING NEW VALUES;
The materialized view that you would create would then be
CREATE MATERIALIZED VIEW mv_cloned_table
REFRESH FAST
START WITH sysdate + interval '1' minute
NEXT sysdate + interval '1' hour
WITH PRIMARY KEY
AS
SELECT *
FROM foo.external_table#database_link;
Related
We face the following situation (Teradata):
Business layer frequently executes long-running queries on Table X_Past UNION ALL Table X_Today.
Table X_Today gets updated frequently, say once every 10 minutes. X_Past only once after midnight (per full-load).
Writing process should not block reading process.
Writing should happen as soon as new data is available.
Proposed approach:
2 "Today" and a "past" table, plus a UNION ALL view that selects from one of them based on the value in a load status table.
X_Today_1
X_Today_0
X_Past
loading process with load in X_Today_1 and set the active_table value in the load status table to "X_Today_1"
next time it will load X_Today_0 and set the active_table value to "X_Today_0"
etc.
The view that is used to select on the table will be built as follows:
select *
from X_PAST
UNION ALL
select td1.*
from X_Today_1 td1
, ( select active_table from LOAD_STATUS ) active_tab1
where active_tab1.te_active_table = 'X_Today_1'
UNION ALL
select td0.*
from X_Today_0 td0
, ( select active_table from STATUS_LOG ) active_tab0
where active_tab1.te_active_table = 'X_Today_0'
my main questions:
when executing the select, will there be a lock on ALL tables, or only on those that are actually accessed for data? Since because of the where clause, data from one of the Today_1/0 tables will always be ignored and this table should be availablew for loading;
do we need any form of locking or is the default locking mechanism that what we want (which I suspect it is)?
will this work, or am I overlooking something?
It is important that the loading process will wait in case the reading process takes longer than 20 minutes and the loader is about to refresh the second table again. The reading process should never really be blocked, except maybe by itself.
Any input is much appreciated...
thank you for your help.
A few comments to your questions:
Depending on the query structure, the Optimizer will try to get the default locks (in this case a READ lock) at different levels -- most likely table or row-hash locks. For example, if you do a SELECT * FROM my_table WHERE PI_column = 'value', you should get a row-hash lock and not a table lock.
Try running an EXPLAIN on your SELECT and see if it gives you any locking info. The Optimizer might be smart enough to determine there are 0 rows in one of the joined tables and reduce the lock requests. If it still locks both tables, see the end of this post for an alternative approach.
Your query written as-is will result in READ locks, which would block any WRITE requests on the tables. If you are worried about locking issues / concurrency, have you thought about using an explicit ACCESS lock? This would allow your SELECT to run without ever having to wait for your write queries to complete. This is called a "dirty read", since there could be other requests still modifying the tables while they are being read, so it may or may not be appropriate depending on your requirements.
Your approach seems feasible. You could also do something similar, but instead of having two UNIONs, have a single "X_Today" view that points to the "active" table. After your load process completes, you could re-point the view to the appropriate table as needed via a MACRO call:
-- macros (switch between active / loading)
REPLACE MACRO switch_to_today_table_0 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_0;
REPLACE MACRO switch_to_today_table_1 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_1;
-- SELECT query
SELECT * FROM X_PAST UNION ALL SELECT * FROM X_Today;
-- Write request
MERGE INTO x_today_0...;
-- Switch active "today" table to must recently loaded one
EXEC switch_to_today_table_0;
You'd have to manage which table to write to (or possible do that using a view too) and which "switch" macro to call within your application.
One thing to think about is that having two physical tables that logically represent the same table (i.e. should have the same data) may potentially allow for situations where one table is missing data and needs to be manually synced.
Also, if you haven't looked at them already, a few ideas to optimize your SELECT queries to run faster: row partitioning, indexes, compression, statistics, primary index selection.
I have a particular scenario where I need to truncate and batch insert into a Table in ClickHouse DBMS for every 30 minutes or so. I could find no reference of truncate option in ClickHouse.
However, I could find suggestions that we can indirectly achieve this by dropping the old table, creating a new table with same name and inserting data into it.
With respect to that, I have a few questions.
How is this achieved ? What is the sequence of steps in this process ?
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
Is there a better and clean way this can be achieved ?
How is this achieved ? What is the sequence of steps in this process ?
TRUNCATE is supported. There is no need to drop and recreate the table now.
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
That depends on which table engine you use. For merge-tree family you get a snapshot-like behavior for SELECT.
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
I would assume it relies on how fast the underlying file system can handle file deletions. For large tables it might contain millions of data part files which leads to slow truncation. However in your case I wouldn't worry much.
Is there a better and clean way this can be achieved ?
I suggest using partitons with a (DateTime / 60) column (per minute) along with a user script that constantly do partition harvest for out of date partitions.
I have created a table that is a join of 3-4 other tables. The field values in the original source tables from which this table was created DO change, but rarely.
Updating or recreating the table takes about 30 mins-1 hour, and then some reports are run against it. However, this requires keeping track of any changes from the original source tables.
If, instead, I run reports off a VIEW, I know with 100% certainty that all the field values are correct - but will my SELECT performance suffer and become slower due to the view 'going back and fetching' values each time?
In this case, speed is on the same level of importance of accuracy, and my ultimate question is whether to use a view or a table. Thank you to anyone who's taken the time to read this!
We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?
The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits
I have a system that has a materialized view that contains roughly 1 billion items, on a consistent two hour basis I need to update about 200 million (20% of the records). My question is what should the refresh strategy on my materialized view be? As of right now it is refresh with an interval. I am curious as to the performance impacts between refreshing on an interval vice refresh never and rename/replace the old materialized view with the new one. The underlying issue is the indices that are used by Oracle which creates a massive amount of redo. Any suggestions are appreciated.
UPDATE
Since some people seem to think this is off topic my current view point is to do the following:
Create an Oracle Schedule Chain that invokes a series of PL/SQL (programming language I promise) functions to refresh materialized view in a pseudo-parallel fashion. However, being as though I fell into the position of a DBA of sorts, I am looking to solve a data problem with an algorithm and/or some code.
Ok so here is the solution I came up with, your mileage may vary and any feedback is appreciated after the fact. The overall strategy was to do the following:
1) Utilize the Oracle Scheduler making use of parallel execution of chains (jobs)
2) Utilize views (the regular kind) as the interface from the application into the database
3) Rely on materialized views to be built in the following manner
create materialized view foo
parallel
nologging
never refresh
as
select statement
as needed use the following:
create index baz on foo(bar) nologging
The advantage of this is that we can build the materialized view in the background before dropping + recreating the view as described in step 2. Now the advantage is creating dynamically named materialized views, while keeping the view with the same name. The key is to not blow away the original materialized view until the new one is finished. This also allows for quick drops, as there is minimum redo to care about. This enabled materialized view creation on ~1 billion records in 5 minutes which met our requirement of "refreshes" every thirty minutes. Further this is able to be handled on a single database node, so even with constrained hardware, it is possible.
Here is a PL/SQL function that will create it for you:
CREATE OR REPLACE procedure foo_bar as
foo_view varchar2(500) := 'foo_'|| to_char(sysdate,'dd_MON_yyyy_hh_mi_ss');
BEGIN
execute immediate
'Create materialized view '|| foo_view || '
parallel
nologging
never refresh
as
select * from cats';
END foo_bar;