Here's the scenario I need help with:
I have a large Oracle table. This oracle table is being queried by mobile app users 1-10 times per second. This allows for very little down time of the table.
There is a backend process that refreshes all the data in the table, 1mio rows approximately. The process deletes * from the table, then inserts values from the source table. That's it.
The problem: this causes the table to be unavailable for too long (15 minutes).
I read about partition exchange, but all the examples I find deal with a certain partition range which doesn't apply to my problem.
My question: can I somehow refresh the data in a temp offline table and then just make that table my online/live table? That would just be a synonym/name swap, wouldn't it? Are there better methods?
Thanks!!!
Related
I am creating a hive table by doing joins of multiple source tables. This join takes approx 3 hours time because of huge data volume. This hive table is designed truncate and load. This table is further consumed by the downstream.
We plan to refresh this hive table 4 times a day because of data in source tables keep getting updated. Since table load is truncate and load, there will be no data in this table for approx ~3 hr in each times because of join query takes this much of time. And due to this data will not be available for downstream.
Can someone suggest how we can continue to truncate and load the table and old data for downstream is still available during the fresh data loads in the table ?
One of the option to ensure downstream gets the data during the ~3hr downtime is to create a read copy of the same table for the downstream systems. For example, create a tableB which is a select * from tableA_with_joins. This will ensure downstream receives the data from tableB even if a truncate load is happening on tableA.
One of the cons of this approach is that an additional time will be consumed in syncing the data from tableA to tableB. But will ensure your downstream receives the data even during the downtime.
I have an application using an AWS Aurora SQL postgres 10 DB that expects +5M records per day on a table. The application will be running on a kubernetes environment with ~5 pods.
One of the applications requirements is to export a method to build an object with all the possible values of 5 columns of the table. ie: all distinct values of the name column.
We expect ~100 different values per column. A distinct/group by takes more than 1s per column, making the process not meeting the non functional requirements (process time).
The solution I found was to create a table/view with the distinct of each column, that table/view will be refreshed with a cron like task.
Is this the more effective approach to meet the non functional/process time requirement using only postgres tools?
One possible solution is a materialized view that you regularly refresh. Between these refreshes, the data will become slightly stale.
Alternatively, you can maintain a separate table with just the distinct values and use triggers to keep the information up to date whenever rows are modified. This will require a combined index on all the affected columns to be fast.
DISTINCT is always a performance problem if it affects many rows.
Currently we have an AuditLog table that holds over 11M records. Regardless on the indexes and statistics any query referencing this table takes a long time. Most reports don't check for Audit records past a year but we would still like to keep these records. Whats the best way to handle this?
I was thinking of keeping the AuditLog table to hold all records less than or equal to a year old. Then move any records greater than a year old to an AuditLogHistory table. Maybe just running a batch file every night to move these records over and then update the indexes and statistics of the AuditLog table. Is this an okay way to complete this task? Or what other way should I be storing older records?
The records brought back from the AuditLog table hit a linked server and check in 6 different db's to see if a certain member exists in them based on a condition. I don't have access to make any changes to the linked server db's so can only optimize what I have which is the Auditlog. Hitting the linked server db's uses up over 90% of the queries cost. So I'm just trying to limit what I can.
First, I find it hard to believe that you cannot optimize a query on a table with 11 million records. You should investigate the indexes that you have relative to the queries that are frequently run.
In any case, the answer to your question is "partitioning". You would partition by the date column and be sure to include this condition in all queries. That will reduce the amount of data and probably speed the processing.
The documentation is a good place to start for learning about partitioning.
I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.
I am developing a multi-thread application (could be considered as a client-server) which processes data. The below is the high level description of the application.
there is a table (with no key and Id field) with many rows in our database server. I have several systems (threads) which read (select) some rows (fixed number of rows) from the table and process them and remove (delete) those rows from the table.
I am looking for a solution for removing (deleting) data without using a temp table; but any ideas with temp storage are welcome.
P.S: By using locks and a temp table, I solved the reading process but I need help on deleting part.
P.S2: One possible solution that Jean said is not removing rows physically from the table. This idea is great but I forgot to mention that this table must be empty after a specific period of time and by using the solution I need to have a system which deletes all the marked rows at the end (which is not possible)